About the Role
We are seeking an experienced Site Reliability Engineer to join our team. In this role, you will be responsible for designing, implementing, and maintaining scalable and highly reliable software systems.
Key Responsibilities
* Develop and implement monitoring tools to ensure continuous uptime and reliability of applications;
* Act swiftly in response to emergency situations impacting system reliability in production environments, performing root cause analysis for on-going incidents;
* Oversee and streamline change management processes to enhance system performance and reliability. Ownership of releases to production environments;
* Collaborate with development teams throughout the software lifecycle, focusing on contributing to solving system-related issues and eliminating toil - automating routine tasks for enhanced productivity;
* Focus on the reliability and scalability of systems, ensuring high performance and efficiency standards.
Required Skills and Qualifications
* Proficiency in monitoring tools such as Azure Monitoring, App Insights, Prometheus, Grafana; Project tracking and version management with tools like JIRA, SVN, GitHub;
* Expertise with Infrastructure as Code (Terraform, ARM/Bicep, Pulumi, etc.) and release management tooling (ArgoCD, Harness, Octopus, etc.);
* Experience in incident alert tools (PageDuty, Opsgenie), and container orchestration tools like Kubernetes, AKS and similar.
Benefits and Other Information
This is a full-time work-from-home opportunity. The ideal candidate will have excellent communication skills and be able to work independently. We offer a competitive salary and benefits package.