- Reliability Architect with over 10 years of experience in proactive monitoring, automation, and observability.

- Skilled in AIOps/MLOps, infrastructure management, and performance optimization using modern tools and practices.

- Adept at leading incident response, mentoring support teams, and driving cross- functional collaboration to ensure system reliability and scalability.

Key Responsibilities:

- Monitoring and Automation : Proactively monitor software systems to prevent incidents and automate routine operational tasks.

- Effective Monitoring : Design monitoring systems that trigger alerts based on symptoms rather than outages, ensuring early detection and resolution.

- Application Performance Monitoring (APM) : Implement and manage APM tools like New Relic or Dynatrace to track application performance, identify bottlenecks, and optimize resource usage.

- Log Analysis with Splunk : Use Splunk to analyze logs for troubleshooting, anomaly detection, and improving system reliability.

- Dashboards Preparation : Build intuitive dashboards to visualize system health, performance metrics, and operational KPIs.

- Alerts Setup : Configure intelligent alerts based on thresholds and anomalies to ensure timely incident response.

- Reports Scheduling : Automate regular reporting to provide insights into system performance, reliability, and trends.

- Reliability Metrics : Define and track metrics such as SLOs, SLIs, and error budgets to measure and maintain system reliability.

- Observability Skills : Apply observability practices including distributed tracing, logging, and metrics collection to gain deep insights into system behavior.

- AI-Driven Monitoring & Automation : Utilize AIOps techniques to proactively detect anomalies, automate incident response, and enable self-healing systems through intelligent alerting and predictive analytics.

- Observability & ML Integration : Integrate machine learning models with observability tools to enhance system insights, optimize performance, and ensure reliability of AI-powered services in production.

- Cross-Team Collaboration : Work closely with development and support teams to enhance service reliability through rigorous testing and release procedures.

- Capacity Planning : Participate in system design reviews and capacity planning to ensure scalability and performance.

- Debugging and Incident Response : Lead incident response efforts, analyze debugging information, and manage rollbacks of faulty software deployments.

- Mentoring Support Teams : Guide and mentor L1/L2 support teams to establish best practices in monitoring and observability.

- Infrastructure Management : Manage infrastructure using tools like Chef, Ansible, Terraform, GitLab CI/CD, and Kubernetes.

- Documentation : Maintain comprehensive documentation of processes and procedures to ensure operational consistency and reduce redundancy.

- Proactive Mindset : Approach challenges with enthusiasm, ownership, and a continuous improvement mindset