We're looking for a professional to take ownership of our service operations, managing service-level agreements, incident processes, on-call and skills coverage, standard operating procedures, configuration management, SLA metrics, and reporting.
About the Role
This is a hands-on position that requires close collaboration with engineers and customers. You'll be responsible for designing and maintaining an on-call plan, owning incident management processes, defining key service metrics, and driving improvements.
1. Service Operations, On-Call & Incidents
o Design and maintain an on-call plan to ensure critical skills are available when needed.
o Owning incident management processes, prioritizing roles, communication cadence, escalations, and post-incident reviews.
o Defining and monitoring key service metrics (e.g., MTTA, MTTR, SLA compliance, backlog health) and driving improvements.
o Acting as incident lead/coordinator during major incidents, keeping engineers focused and customers informed.
Responsibilities
You'll be responsible for:
1. Create and maintain SOPs, runbooks, and triage guides for SRE engineers, covering common incident types and operational tasks.
2. Train and coach first-line/SRE teams to handle initial triage, basic troubleshooting, and clear communication, escalating only when needed.
3. Continuously refine documentation based on real incident experience and feedback.
4. Establish and run a configuration management process to keep track of each customer's environment (platforms in use, clusters, regions, configs, access, monitoring, key contacts).
5. Proactively close information gaps by working directly with customers and engineers.
6. Ensure configuration information is available and trustworthy during incidents and for onboarding new engineers.
7. Be the primary operational contact for a set of enterprise customers.
8. Lead regular service reviews and status calls, presenting SLA performance, key incidents, risks, and improvement actions.
9. Present and agree on the incident management process with customers (channels, priorities, escalation paths, expectations).
10. Work closely with Account Management / Sales on renewals, expansions, and expectation management.
11. Clarify what is in scope vs. out of scope and work with customers and Sales to shape paid change requests when additional work is needed.
12. Monitor effort vs. contract, help protect margins, and flag risks early (under-scoped contracts, chronic over-use, under-utilized capacity).
13. Work in a matrix environment, coordinating with different technical teams to staff and deliver engagements effectively.
14. Design and maintain onboarding paths for new engineers joining support/delivery.
15. E nsure new team members reach a productive, independent state quickly and safely.
What Success Looks Like
In 6–12 months, you'll achieve:
* A clear, predictable, and sustainable on-call coverage plan.
* First-line/SREs handling a meaningful share of incidents without escalation, using well-maintained runbooks.
* The ability to open a customer's configuration, see an accurate picture, and use it during incidents and planning.
* SLA and incident metrics tracked, reported, and discussed regularly with customers and internally.
* C ustomers having a clear understanding of how incidents are handled and feeling confident in the process.
* New engineers ramping up faster thanks to structured onboarding and training.
Requirements
To succeed in this role, you'll need:
* 5+ years in a Service Delivery, Managed Services, IT Operations, or Enterprise Support role serving external customers.
* Experience with 24/7 or extended-hours operations, including on-call or follow-the-sun setups.
* Hands-on experience with incident management and ITSM practices (incident/problem/change), ideally in an ITIL-inspired environment.
* A track record of creating or improving SOPs/runbooks and training first-line / SRE teams.
* Experience maintaining configuration / environment data for customer systems.
* Comfort discussing technical topics with engineers (cloud, distributed systems, data platforms) and explaining them in clear business terms to customers.
* Experience in commercial delivery: scope boundaries, change requests, effort vs. revenue, working alongside Sales / Account Management.
* Strong communication skills in English, both written and spoken.
Nice to have:
* Background with data, analytics, or streaming platforms (e.g., Druid, Kafka, Flink, StarRocks, ClickHouse, TiDB, Hadoop, cloud data warehouses).
* Experience working in small, fast-moving, remote teams.
Location & Working Style
We're a remote-first team, collaborating online across multiple time zones. The role requires regular overlap with European and North American business hours. We're flexible on contract structure (direct employment or via a global payroll partner or contractor/B2B), depending on your location and preference.