Job Overview
The Site Reliability Engineer will be responsible for ensuring the stability and performance of our systems.
* Incident Management: The ideal candidate will handle major incidents via a Critical Issue Response System, providing frequent updates until resolution.
* Troubleshooting: Perform deep-dive application troubleshooting to identify preventive actions and improve overall system reliability.
* CIRS Requests: Manage CIRS-related requests including deployments, feature toggles, and data fixes.
* Production Support: Follow up on major production incidents and coordinate with cross-functional teams.
* Monitoring: Enhance monitoring capabilities using tools like Dynatrace, Kibana, and Splunk.
* Scripting: Write and improve monitoring scripts and alerts based on incident learnings.
* Customer Escalations: Handle customer escalations and coordinate with Support & Engineering teams.
Requirements
* DevOps Expertise: Deep experience in DevOps and Production Support is required.
* Automation Skills: Experience in automation and CI/CD practices is essential.
* Cloud Platforms: Familiarity with cloud platforms (GCP, AWS, or Azure) is preferred.
* Monitoring Tools: Hands-on experience with monitoring tools such as Dynatrace, Kibana, Splunk is required.
* Troubleshooting Ability: Strong troubleshooting skills and ability to deep dive into application issues are necessary.