Site Reliability Professional
Your primary responsibility will be to manage critical incidents through our sophisticated Critical Issue Response System, providing regular updates until resolution.
You will be accountable for conducting in-depth application troubleshooting, identifying preventive measures, and handling CIRS-related requests including deployments, feature toggles, and data corrections.
Enhancing monitoring capabilities using tools like Dynatrace, Kibana, and Splunk is an essential aspect of this role, as well as developing and refining monitoring scripts and alerts based on incident learnings.
You will handle customer escalations, collaborate with Support & Engineering teams, and provide support for planned activities and respond to ad-hoc requests from other teams.
Key Requirements and Qualifications:
* Extensive experience in DevOps and Production Support
* Experience in automation and CI/CD practices
* Familiarity with cloud platforms (GCP, AWS, or Azure preferred)
* Hands-on experience with monitoring tools such as Dynatrace, Kibana, Splunk
* Strong analytical skills and ability to deeply investigate application issues
* Excellent communication and coordination skills across teams