Site Reliability Engineering (SRE) Role
This position involves creating and maintaining scalable and highly reliable software systems. The role requires a combination of software engineering principles and infrastructure management expertise.
Main Responsibilities:
* Design and implement monitoring tools to ensure continuous system reliability and performance.
* Respond promptly to emergency situations impacting system reliability, perform root cause analysis, and implement corrective actions.
* Streamline change management processes to enhance system performance and reliability.
* Collaborate with development teams to identify and resolve system-related issues and automate routine tasks.
* Ensure the scalability and reliability of systems, meeting high performance and efficiency standards.
Key Skills and Qualifications:
* Proficiency in monitoring tools like Azure Monitoring, App Insights, Prometheus, Grafana.
* Experience with Infrastructure as Code (Terraform, ARM/Bicep, Pulumi, etc.) and release management tooling (ArgoCD, Harness, Octopus, etc.).
* Knowledge of incident alert tools (PageDuty, Opsgenie) and container orchestration tools like Kubernetes, AKS.
About the Job:
The ideal candidate will have a strong background in SRE principles, experience with DevOps practices, and excellent problem-solving skills.
Requirements:
* Strong understanding of SRE principles and best practices.
* Experience with cloud platforms (AWS, Azure, Google Cloud).
* Proficiency in programming languages like Python, Java, or C++.
* Familiarity with agile methodologies and version control systems like Git.
Benefits:
This is an exciting opportunity to work on challenging projects, collaborate with talented professionals, and grow your career in the field of Site Reliability Engineering.