Description :

Location : Pan India Except Mumbai

About the Role :

We are looking for a highly experienced Reliability Architect with strong expertise in proactive monitoring, observability, automation, AIOps/MLOps, and large-scale infrastructure management.

The ideal candidate will drive system reliability, performance optimization, and cross-functional collaboration while leading incident response and mentoring support teams.

Key Responsibilities :

Monitoring & Automation :

- Proactively monitor software systems to prevent incidents and reduce manual intervention.

- Automate routine operational tasks to maximize operational efficiency.

Effective Monitoring & Alerting :

- Design intelligent monitoring systems that trigger symptom-based alerts for early issue detection.

- Configure alert thresholds, anomaly detection rules, and escalation workflows.

Application Performance Monitoring (APM) :

- Implement and manage APM tools such as New Relic, Dynatrace, AppDynamics, etc.

- Track application performance, identify bottlenecks, and optimize resource utilization.

Log Analysis & Troubleshooting :

- Leverage Splunk (or similar tools) for log analysis, anomaly detection, and incident debugging.

- Improve system reliability through continuous log insights and root cause analysis.

Dashboards & Reporting :

- Build intuitive dashboards visualizing system health, KPIs, and operational metrics.

- Automate scheduled reports for performance trends, reliability metrics, and risk indicators.

Reliability Metrics & Observability :

- Define and track SLOs, SLIs, error budgets, and other reliability benchmarks.

- Apply full-stack observability practices including logs, metrics, distributed tracing, and event correlation.

AI-Driven Monitoring (AIOps/MLOps) :

- Use AIOps to detect anomalies, automate incident response, and build self-healing workflows.

- Integrate ML models with observability tools for predictive insights and performance optimization.

Cross-Team Collaboration :

- Collaborate with development, DevOps, and support teams to enhance service reliability.

- Strengthen release processes through rigorous testing, reviews, and monitoring integration.

Capacity Planning & Performance :

- Participate in architecture and design reviews.

- Ensure systems are scalable, resilient, and optimized for peak performance.

Debugging, Incident Response & Rollbacks :

- Lead major incident response efforts with structured troubleshooting and RCA.

- Manage controlled rollbacks of faulty deployments and ensure minimal service impact.

Mentoring & Knowledge Sharing :

- Mentor L1/L2 support teams, establishing best practices for monitoring and observability.

- Promote a culture of reliability engineering and continuous improvement.

Infrastructure & Tooling :

- Manage infrastructure using tools like Chef, Ansible, Terraform, Kubernetes, GitLab CI/CD, etc.

- Support automation, configuration management, and infrastructure-as-code workflows.

Documentation :

- Maintain detailed documentation of processes, architectures, SOPs, and troubleshooting guides.

Proactive Mindset :

- Drive reliability initiatives with ownership, enthusiasm, and a forward-thinking approach.

Desired Skills & Tools :

- AIOps/MLOps platforms

- Splunk, Grafana, Kibana, Prometheus

- New Relic, Dynatrace, AppDynamics

- Terraform, Ansible, Chef

- GitLab CI/CD, Jenkins

- Kubernetes, Docker

- Strong debugging and RCA skills

- Excellent communication and cross-functional collaboration