Key Responsibilities:
* Oversee the continuous health, performance, and availability of enterprise monitoring tools such as Splunk, DynaTrace, and NewRelic.
* Perform routine maintenance, upgrades, and configuration tuning to ensure optimal system performance.
* Triage and resolve monitoring-related incidents and service tickets in a timely and efficient manner.
* Collaborate with cross-functional teams including application, infrastructure, and DevOps teams to integrate monitoring solutions and enhance visibility.
* Develop and maintain dashboards, alerts, and reports to support operational and business objectives.
* Participate in on-call rotations and support incident response efforts.
* Document operational procedures, runbooks, and knowledge base articles.
* Identify and implement automation opportunities to reduce manual effort and improve reliability.
Required Skills and Qualifications:
* 5+ years of experience in systems engineering or enterprise monitoring roles.
* Hands-on experience with Splunk, DynaTrace, and NewRelic in production environments.
* Strong understanding of IT operations, incident management, and ticketing systems such as ServiceNow and Jira.
* Proficiency in scripting languages like Python, PowerShell, and Bash for automation and tool integration.
* Familiarity with cloud platforms AWS, Azure, or GCP and containerized environments Kubernetes and Docker.
* Excellent troubleshooting skills and a bias for action in high-pressure situations.
* Strong written and verbal communication skills in English; Portuguese is an asset.