About MyOperator :

MyOperator is a Business AI Operator platform that enables businesses, teams, and AI agents to work together seamlessly for customer operations such as Sales, Support, Escalations, Feedback, and Refund processes. With 12,000+ businesses using our platform, we operate at meaningful scale and power mission-critical communication workflows including voice bots, WhatsApp automation, and intelligent call routing.

We are building for reliability, speed, and impact. MyOperator values ownership, critical thinking, and execution. This is a high-expectation, high-learning environment where engineers are empowered to solve complex problems and build systems that directly affect customer outcomes.

Role Overview :

We are looking for a skilled and proactive Site Reliability Engineer (SRE) to take end-to-end ownership of production reliability, observability, and performance engineering across MyOperators AI-powered communication infrastructure.

This role is not operational-only it requires strong system design thinking, deep troubleshooting ability, and a production ownership mindset. You will define reliability standards, build observability frameworks, lead incident response, and drive SLO-based engineering practices across distributed AWS and Kubernetes environments.

Key Responsibilities :

- Own production reliability, uptime, latency, and error budgets across critical services.

- Design and manage production-grade monitoring using Grafana, VictoriaMetrics (Prometheus), and PromQl, AWS CloudWatch.

- Define and enforce SLIs, SLOs, and SLA thresholds for AI communication systems (voice bots, WhatsApp APIs, call routing).

- Build real-time operational dashboards for incident response, capacity planning, and leadership visibility.

- Implement end-to-end distributed tracing using OpenTelemetry (OTEL Collector).

- Design and maintain centralized logging with strong correlation between logs, metrics, and traces.

- Create SLO-based alerting systems with minimal noise and fast incident detection.

- Lead incident response lifecycle : alert triage, mitigation, RCA documentation, and preventive improvements.

- Drive MTTR reduction through structured monitoring, automation, and reliability engineering practices.

- Monitor and troubleshoot AWS EKS (Kubernetes) production workloads.

- Instrument and monitor LLM API integrations, AI inference pipelines, and messaging systems.

- Analyze logs using OpenSearch / ELK for anomaly detection and root cause identification.

- Automate operational workflows using Python or Bash to eliminate manual toil.

- Drive performance optimization, scalability improvements, and capacity planning.

- Collaborate with engineering teams to instrument new services from day one.

Required Skills & Qualifications :

- 3-6 years of experience in Site Reliability Engineering, DevOps, or Platform Engineering roles.

- Hands-on experience with :

1. VictoriaMetrics / Prometheus (time-series monitoring)

2. Grafana dashboards and visualization

3. PromQL for writing complex queries and alerts

- Experience implementing distributed tracing using OpenTelemetry (Mandatory).

- Strong experience with centralized logging systems (ELK / OpenSearch / Loki).

- Experience with alerting frameworks such as Alertmanager or Grafana Alerts.

- Strong understanding of SLIs, SLOs, SLA design, and reliability engineering principles.

- Hands-on experience managing AWS production workloads (EC2, RDS, ELB, CloudWatch, IAM).

- Experience with Kubernetes (AWS EKS preferred).

- Familiarity with CI/CD pipelines and automation tools.

- Good understanding of Linux systems, networking, and cloud infrastructure.

- Experience handling production incidents and participating in on-call rotations.

- Ability to automate operational tasks using Python or Bash.

Good to Have :

- Experience with OpenSearch / ELK log pipelines and anomaly detection.

- Kubernetes monitoring (pod health, node metrics, autoscaling behavior).

- CI/CD observability integration (Jenkins, GitHub Actions).

- Experience in monitoring LLM APIs and AI inference pipelines.

- Familiarity with MLOps or AI observability tools (Arize, WhyLabs, etc.).

- Service mesh exposure (Istio).

- Infrastructure as Code (Terraform, CloudFormation).

- Experience with chaos engineering or load testing tools.

- Multi-cluster or multi-region architecture exposure.

Key Expectations :

- Ownership of production systems and high availability.

- Strong troubleshooting and debugging skills.

- Focus on automation and reliability improvements.

- Proactive approach to incident prevention.

- Ability to reduce alert noise and improve signal quality.

- Data-driven approach to reliability engineering.

This Role Is Not For :

- Candidates with purely development experience and no production ownership.

- Candidates without real incident response or on-call experience.

- Freshers or candidates with less than 3 years of experience.

The job is for:

May work from home