Posted on: 20/11/2025
Job Description :
Key Responsibilities :
- Build, integrate, and optimize AIOps platforms to improve incident detection, root cause analysis, and automated remediation.
- Develop and maintain automation scripts for repetitive operational tasks using Python, Shell, or PowerShell.
- Implement intelligent anomaly detection and predictive analytics using ML/AI models.
- Configure, manage, and enhance monitoring tools such as Datadog, New Relic, Dynatrace, Prometheus, Grafana, Splunk, ELK, or similar.
- Develop advanced dashboards and observability pipelines for logs, metrics, traces, and events.
- Ensure high availability and performance of monitoring and alerting systems.
- Lead triage during critical incidents and work with engineering teams to reduce MTTR.
- Perform root cause analysis using AI-driven insights and create preventive action plans.
- Build self-healing workflows and automated incident response systems.
- Work closely with DevOps teams to integrate AIOps in CI/CD pipelines.
- Manage and support cloud infrastructure on AWS / Azure / GCP.
- Implement and optimize cloud-native services for logging, monitoring, and autoscaling.
- Build and manage data pipelines for operational data across logs, metrics, traces, events, and alerts.
- Work with ML engineers to integrate models into operational workflows.
- Ensure data quality, integrity, and real-time processing for AIOps systems.
- Collaborate with SRE, DevOps, Infrastructure, and Security teams to enhance operational intelligence.
- Create clear documentation, runbooks, workflows, and automation playbooks.
- Train internal teams on AIOps processes and automation tools.
Required Skills & Experience :
Technical Expertise :
- 4+ years of experience in AIOps, DevOps, SRE, or Cloud Operations roles.
- Hands-on experience with at least one AIOps or observability platform :
1. Dynatrace, Moogsoft, Datadog, BigPanda, Splunk ITSI, New Relic, Elastic APM, etc.
2. Strong scripting/programming skills in Python, Shell, Go, or PowerShell.
- Experience with ML/AI models for anomaly detection, forecasting, or event correlation (nice-to-have but preferred).
- Strong knowledge of cloud platforms (AWS, Azure, GCP) and cloud-native monitoring services.
- Strong understanding of Kubernetes, Docker, and microservices monitoring.
- Experience with CI/CD tools such as Jenkins, GitHub Actions, GitLab CI, Azure DevOps, etc.
- Strong understanding of Infrastructure as Code tools : Terraform, CloudFormation, Ansible.
- Solid knowledge of logs, metrics, traces, APM tools, synthetic monitoring, and alerting frameworks.
- Experience with log aggregation tools (Splunk, ELK, Loki, etc.).
Soft Skills :
- Strong analytical thinking and problem-solving abilities.
- Excellent communication skills to collaborate with cross-functional teams.
- Ability to work in a fast-paced environment and handle critical incidents calmly.
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
ML / DL / AI Research
Job Code
1577901
Interview Questions for you
View All