Job Overview:
As a Site Reliability Engineer, you play a vital role in ensuring the smooth operation of our systems. Your primary responsibility is to identify and resolve production issues efficiently.
Key Responsibilities
* Incident Management:
You will be responsible for handling major incidents via our Critical Issue Response System (CIRS) and providing frequent updates until resolution.
* Deep-Dive Troubleshooting:
Perform deep-dive application troubleshooting and identify preventive actions to minimize future occurrences.
* CIRS-Related Tasks:
* Collaboration:
Follow up on major production incidents and coordinate with cross-functional teams to ensure effective issue resolution.
* Monitoring Enhancement:
Enhance monitoring capabilities using tools like Dynatrace, Kibana, and Splunk.
* Scripting and Alert Writing:
Write and improve monitoring scripts and alerts based on incident learnings.
* Customer Escalations:
Handle customer escalations and coordinate with Support & Engineering teams to provide timely resolutions.
* Support and Planning:
Support planned activities and respond to ad-hoc requests from CES teams.
Requirements and Qualifications
* DevOps and Production Support:
Deep experience in DevOps and Production Support.
* Automation and CI/CD Practices:
Experience in automation and CI/CD practices.
* Cloud Platforms:
Familiarity with cloud platforms (GCP, AWS, or Azure preferred).
* Monitoring Tools:
Hands-on experience with monitoring tools such as Dynatrace, Kibana, Splunk.
* Troubleshooting Skills:
Strong troubleshooting skills and ability to deep dive into application issues.
* Communication Skills:
Excellent communication and coordination skills across teams.
Please submit your resume in English.