Mission
Our Cloud and SRE department is looking for a specialist to reinforce our team. This professional will be fundamental to ensuring the stability, performance, and resilience of our SaaS environments, promoting automation and operational excellence at scale.
The mission of this position will be to ensure that our systems are always available, performant, and secure. The selected candidate will be responsible for applying software engineering principles to solve operational challenges, promoting a data-driven culture, automation, and reliability.
This role is available for people with disabilities.
Responsibilities
 * Define and monitor SLIs/SLOs and reliability indicators by product and environment;
 * Act on the design and continuous improvement of observability pipelines (tracing, metrics, and logs);
 * Lead the incident response process, conduct post-mortems, and promote corrective actions;
 * Collaborate with development teams to ensure resilience from application design;
 * Automate operational tasks, autoscaling, and capacity management in cloud environments;
 * Participate in the construction and dissemination of runbooks, playbooks, and disaster recovery strategies;
 * Support a culture of operational excellence with a focus on continuous improvement, predictability, and failure prevention;
 * Use performance analysis tools (APM) for diagnosis and mitigation of bottlenecks.
Requirements
 1. Experience with multi-cloud environments (OCI and AWS) and their management and automation tools;
 2. Solid experience in SRE, DevOps, or Production Engineering in critical environments;
 3. Domain knowledge in observability practices: metrics, logs, tracing, and alerts (e.g., Datadog, Prometheus, Grafana, etc.);
 4. Advanced knowledge of automation and IaC (Terraform, Ansible, CDK, etc.);
 5. Familiarity with CI/CD pipelines (e.g., GitHub Actions, GitLab, Azure DevOps);
 6. Experience with containers and orchestration (Docker, Kubernetes, ECS, EKS);
 7. Good base in distributed systems, networks, scalability, and capacity management;
 8.