Posted on: 13/11/2025
Description :
Location : Pan India Except Mumbai
About the Role :
We are looking for a highly experienced Reliability Architect with strong expertise in proactive monitoring, observability, automation, AIOps/MLOps, and large-scale infrastructure management.
The ideal candidate will drive system reliability, performance optimization, and cross-functional collaboration while leading incident response and mentoring support teams.
Key Responsibilities :
Monitoring & Automation :
- Proactively monitor software systems to prevent incidents and reduce manual intervention.
- Automate routine operational tasks to maximize operational efficiency.
Effective Monitoring & Alerting :
- Design intelligent monitoring systems that trigger symptom-based alerts for early issue detection.
- Configure alert thresholds, anomaly detection rules, and escalation workflows.
Application Performance Monitoring (APM) :
- Implement and manage APM tools such as New Relic, Dynatrace, AppDynamics, etc.
- Track application performance, identify bottlenecks, and optimize resource utilization.
Log Analysis & Troubleshooting :
- Leverage Splunk (or similar tools) for log analysis, anomaly detection, and incident debugging.
- Improve system reliability through continuous log insights and root cause analysis.
Dashboards & Reporting :
- Build intuitive dashboards visualizing system health, KPIs, and operational metrics.
- Automate scheduled reports for performance trends, reliability metrics, and risk indicators.
Reliability Metrics & Observability :
- Define and track SLOs, SLIs, error budgets, and other reliability benchmarks.
- Apply full-stack observability practices including logs, metrics, distributed tracing, and event correlation.
AI-Driven Monitoring (AIOps/MLOps) :
- Use AIOps to detect anomalies, automate incident response, and build self-healing workflows.
- Integrate ML models with observability tools for predictive insights and performance optimization.
Cross-Team Collaboration :
- Collaborate with development, DevOps, and support teams to enhance service reliability.
- Strengthen release processes through rigorous testing, reviews, and monitoring integration.
Capacity Planning & Performance :
- Participate in architecture and design reviews.
- Ensure systems are scalable, resilient, and optimized for peak performance.
Debugging, Incident Response & Rollbacks :
- Lead major incident response efforts with structured troubleshooting and RCA.
- Manage controlled rollbacks of faulty deployments and ensure minimal service impact.
Mentoring & Knowledge Sharing :
- Mentor L1/L2 support teams, establishing best practices for monitoring and observability.
- Promote a culture of reliability engineering and continuous improvement.
Infrastructure & Tooling :
- Manage infrastructure using tools like Chef, Ansible, Terraform, Kubernetes, GitLab CI/CD, etc.
- Support automation, configuration management, and infrastructure-as-code workflows.
Documentation :
- Maintain detailed documentation of processes, architectures, SOPs, and troubleshooting guides.
Proactive Mindset :
- Drive reliability initiatives with ownership, enthusiasm, and a forward-thinking approach.
Desired Skills & Tools :
- AIOps/MLOps platforms
- Splunk, Grafana, Kibana, Prometheus
- New Relic, Dynatrace, AppDynamics
- Terraform, Ansible, Chef
- GitLab CI/CD, Jenkins
- Kubernetes, Docker
- Strong debugging and RCA skills
- Excellent communication and cross-functional collaboration
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
DevOps / Cloud
Job Code
1574683
Interview Questions for you
View All