Job Description:
About Us: We provide expert support and consulting for open-source analytics and data infrastructure platforms.
We serve mission-critical, high-volume systems and rely on our team to keep them fast, stable, and available. Our remote-first team works across multiple time zones (US, Brazil, Europe, India, Philippines) supporting 100+ customer environments with SLAs ranging from advisory support to 24/7 incident coverage.
About the Role: ----------------------------------- This is a hands-on role where you will be close to real incidents, engineers, and customers. You'll be expected to bring in practices you've already used successfully in previous service or managed-services environments.
Your Key Responsibilities include Designing an on-call plan that ensures all critical skills are available when needed; owning the incident management process for your accounts; defining key service metrics (e.g., MTTA/MTTR/SLA compliance/backlog health); acting as an incident lead/coordinator during major incidents keeping engineers focused and customers informed;
Creating/maintaining SOPs/runbooks/triage guides for SRE engineers covering common incident types/operational tasks; Training/coaching first-line/SRE teams so they can handle initial triage/basic troubleshooting/clear communication escalating only when needed; ----------------------------------- Configuration Management & Readiness: Establishing a configuration management process tracking each customer's environment/platforms/clusters/regions/configs/access/monitoring/key contacts Proactively closing information gaps by working directly with customers/engineers Ensuring configuration information is accessible/trustworthy during incidents/onboarding new engineers
Customer Communication & Governance: Being the primary operational contact for enterprise customers Leading regular service reviews/status calls presenting SLA performance/key incidents/risks/improvement actions Presenting/agreeing on incident management processes with customers channels/priorities escalation paths expectations Working closely with Account Management/Sales on renewals/expansions/expectation management Clarifying what is in scope vs out of scope working with customers/sales shaping paid change requests as additional work arises Monitoring effort vs contract helping protect margins flagging risks early under-scoped contracts chronic over-use under-utilized capacity Working in matrix environment coordinating different technical teams staffing/delivering engagements effectively Onboarding training designing/maintaining onboarding paths shadowing training SOPs/environment overviews certification certain incident types ensuring new team members reach productive independent state quickly safely You will contribute significantly towards delivering these responsibilities.