Site reliability engineering manager

Campinas

Anunciada dia 26 abril

Descrição

Site Reliability Engineering Manager - Provisioning: Automates secure delivery of databases, queues, caches, and AI services/resources; builds self-service andguardrails. Reliability: Ensures availability, performance, and resilience of data, middleware, and AI services in production; builds observability, autohealing, and disaster recovery. What You Will Do- Lead and grow the team: Manage, mentor, and develop SRE engineers across Provisioning and Reliability. - Lead cross-team initiatives:- Align roadmaps with Architects and partner teams; ensure adoption of architecture standards. - Run design reviews and architecture signoffs; surface and mitigate risks and complexity early. - Translate standards into guardrails and automation (policyascode, selfservice) for consistent delivery. - Apply lightweight RACI and clear escalation paths to resolve tradeoffs quickly. Drive roadmap and execution:- Provisioning: Self-service engines, Crossplane/Terraform-based automation, policy-as-code, secured pipelines, access management, backups/restore, Azure AI resource provisioning, and quota management. - Reliability: Observability integrations (New Relic, Azure Monitor), performance tuning, autohealing, DR/BCP, resilience testing, failover automation, SLOs, reliability for Azure AI services and messaging platforms (Kafka, Event Hubs, Service Bus). Establish engineering excellence:- Infrastructure as Code (Terraform/Crossplane) and CI/CD best practices. - Change management with safe deploys and rollback strategies. - Incident and problem management with blameless postmortems. - Continuous improvement loops measured by DORA/SRE metrics. - Champion security and compliance by design:- Policy as code, least privilege access, and secrets/identity hygiene. - Guardrails in self-service flows; auditability and evidence collection. - Partnership with Security/Compliance for standards and reviews. Partner with the Architects:- Codevelop platform architecture and service standards. - Define SLIs/SLOs and capacity/reliability patterns for core services. - Align roadmaps and run design reviews for high-impact changes. Own delivery outcomes:- Navigate competing priorities across a broad platform scope balancing reactive operational load (incidents, toil, on-call) against proactive platform investment (self-service, automation, resilience) without a clean separation between the two. - Make and communicate prioritization decisions under ambiguity, with partial information, across teams that have conflicting urgency. - Maintain a defensible, visible backlog that reflects real risk and business impact not just the loudest stakeholder. Operate a healthy on-call:- High-quality playbooks and automation-first troubleshooting (AI-assisted). - Actionable alerts with SLO-based paging and noise reduction. - Regular resilience testing and post-incident hardening. Initiatives You Will Lead:- Self-service provisioning for databases, queues, and caches with golden configurations and policy guardrails. - AI-assisted troubleshooting for provisioning and production incidents. - Platform wide observability integration for data, middleware, and AI services (New Relic, Azure Monitor). - Automated DR runbooks and resilience/chaos testing in production. - Performance tuning at service and query layers, including automated tuning workflows. - Standardization of provisioning via Terraform/Crossplane for databases, messaging, and AI services. - Governance for Azure AI services (quotas, access, safety guardrails) with clear consumption patterns for product teams. Success Metrics- Reduced MTTR and incident count, improved SLO attainment for data/middleware services. - Improved lead time for change and change failure rate; increased automation coverage and reduced toil. - Faster time to provision and higher first success rate for self-service requests. - Measurable improvements in cost efficiency, performance, and capacity predictability. - Team health: engagement, growth, hiring velocity, stress levels, and retention. - SLO attainment for AI endpoints and messaging services; reduced alert noise viaimproved observability (New Relic/Azure Monitor). What You Will Bring- Proven experience managing SRE/platform/infrastructure teams delivering production-critical services. - Deep familiarity with Azure and the team stack: PostgreSQL, MongoDB, Cosmos DB; Redis; messaging systems such as CloudAMQP/RabbitMQ, Kafka, Event Hubs, and Service Bus. - Strong reliability fundamentals: SLOs/SLIs, incident and problem management, capacity, DR/BCP, performance tuning. - Solid automation background: IaC (Terraform/Crossplane), CI/CD (Azure DevOps), GitOps, policyascode, secrets and identity, RBAC. - Track record of building selfservice platforms and reducing toil. - Excellent crossfunctional leadership with product, security, and compliance partners. - Experience operating Azure AI services in production (Azure OpenAI, Cognitive Services, AI Search). - Observability experience with New Relic, Azure Monitor, and OpenTelemetry. What You''ll Need- Advanced skills in English- Experience with AI-assisted operations and troubleshooting. - Observability expertise (Prometheus/Grafana, New Relic, Azure Monitor, OpenTelemetry). - Database performance engineering and query optimization. - Experience in regulated environments and security frameworks. - FinOps capabilities (cost governance, forecasting, rightsizing, quotas, budgets, chargeback/showback). How We Work- Collaboration first with a strong partnership between the Engineering Manager and the Architects. - Automation by default; security and reliability are nonnegotiable. - Blameless postmortems, continuous learning, and measurable outcomes. - Participation in an equitable on-call rotation with high-quality runbooks and automation. Tech EnvironmentAzure (AKS, managed databases, storage, networking, identity). Azure AI services (Azure OpenAI, Cognitive Services, AI Search). Azure DevOps for CI/CD. CloudAMQP (RabbitMQ). Databases: PostgreSQL, MongoDB, Cosmos DB. Caches: Redis. Queues/Brokers: Kafka, Event Hubs, Service Bus. Terraform/Crossplane, GitOps. Observability: New Relic, Azure Monitor, logs, metrics, traces, alerting workflows.

Se candidatar

Criar um alerta

Salvar