We are seeking an experienced Service Delivery Manager to join our team.
Job Description:
This is a hands-on role that requires ownership of service operations, including SLAs and incident processes, on-call and skills coverage, SOPs and first-line/SRE enablement, configuration management, SLA metrics and reporting, and coordination between customers and our engineering teams.
About the Role:
* This position involves designing and maintaining an on-call and coverage plan that ensures all critical skills are available when needed.
* You will own the incident management process for your accounts, prioritizing roles, communication cadence, escalations, and post-incident reviews.
* Define and monitor key service metrics (e.g., MTTA, MTTR, SLA compliance, backlog health) and drive improvements based on them.
* You will act as incident lead/coordinator during major incidents, keeping engineers focused and customers informed.
SOPs, Runbooks & First-Line Enablement:
* Create and maintain SOPs, runbooks, and triage guides for SRE engineers, covering common incident types and operational tasks.
* Train and coach first-line/SRE teams so they can confidently handle initial triage, basic troubleshooting, and clear communication, escalating only when needed.
* Continuously refine documentation based on real incident experience and feedback.
Configuration Management & Readiness:
* Establish and run a configuration management process that keeps track of each customer's environment (platforms in use, clusters, regions, configs, access, monitoring, key contacts).
* Proactively close information gaps by working directly with customers and engineers.
* E nsure configuration information is available and trustworthy during incidents and for onboarding new engineers.
Customer Communication & Governance:
* Be the primary operational contact for a set of enterprise customers.
* Lead regular service reviews and status calls, presenting SLA performance, key incidents, risks, and improvement actions.
* Present and agree on the incident management process with customers (channels, priorities, escalation paths, expectations).
Required Skills and Qualifications:
* 5+ years in a Service Delivery, Managed Services, IT Operations, or Enterprise Support role serving external customers.
* Experience with 24/7 or extended-hours operations, including on-call or follow-the-sun setups.
* Hands-on experience with incident management and ITSM practices (incident/problem/change), ideally in an ITIL-inspired environment.
* A track record of creating or improving SOPs/runbooks and training first-line / SRE teams.
* Experience maintaining configuration/environment data for customer systems.
Benefits:
This role offers the opportunity to work in a remote-first environment with flexible contract structures. You will have regular overlap with European and North American business hours.
Nice to Have:
* Background with data, analytics, or streaming platforms (e.g., Druid, Kafka, Flink, StarRocks, ClickHouse, TiDB, Hadoop, cloud data warehouses).
* Experience working in small, fast-moving, remote teams.
What Success Looks Like in 6–12 Months:
* On-call coverage is clear, predictable, and sustainable; engineers know when they're on and what's expected.
* First-line/SREs handle a meaningful share of incidents without escalation, using well-maintained runbooks.
* You can open a customer's configuration, see an accurate picture, and use it during incidents and planning.
* SLA and incident metrics are tracked, reported, and discussed regularly with customers and internally.
You'll Be a Great Fit If:
* You have 5+ years in a Service Delivery, Managed Services, IT Operations, or Enterprise Support role serving external customers.
* You have experience with 24/7 or extended-hours operations, including on-call or follow-the-sun setups.