Posted on: 22/12/2025
Description :
Job Title : Senior Staff Site Reliability Engineer
Location : Bangalore
About Movius :
- We are the leading global provider of Secure Communication as a Service (SCaaS).
- Our flagship solution, MultiLine, enhances workflows, resolves compliance gaps and unifies cross-channel messaging.
- Movius AI-powered solutions enable businesses to build strong and lasting relationships with their customers in a company-owned, controllable system.
- Welcome to Phone 3.0.
- Headquartered in Alpharetta, GA, with offices in Silicon Valley, Bangalore, India, New York, and London, Movius partners with leading global wireless carriers like T-Mobile, Vodafone, TELUS, BT, Singtel & more.
Your Opportunity :
- In this role, you will be responsible for improving the reliability, scalability, and performance of our production and pre-production systems.
- You will work hands-on in designing and implementing SRE frameworks, automating key reliability workflows, and building a culture of operational excellence.
- You will also work closely with product engineering, QA, and DevOps teams to define SLOs/SLIs, enhance monitoring and alerting, and strengthen our overall reliability practices.
What Youll Do :
Reliability Engineering & Architecture :
- Design and maintain highly available, fault-tolerant systems on AWS.
- Implement service reliability models based on SLOs, SLIs, and error budgets.
- Continuously improve system performance, scalability, and resilience.
Automation & Infrastructure-as-Code (IaC) :
- Develop reusable IaC modules for multi-account and multi-environment AWS setups.
- Automate operational processes for provisioning, scaling, monitoring, and recovery.
Observability & Monitoring :
- Define observability standards and create dashboards using Elastic Stack, Grafana, or Prometheus.
- Implement intelligent alerting using AIOps and anomaly detection tools.
- Work with development teams to ensure proper telemetry and trace coverage.
Incident Management & RCA :
- Lead major incident response and ensure quick service restoration.
- Conduct blameless post-incident reviews and implement preventive actions.
- Create and maintain runbooks, escalation matrices, and reliability playbooks.
Performance & Capacity Planning :
- Analyse performance bottlenecks and propose tuning or optimization strategies.
- Lead capacity forecasting and ensure the system can handle growth demands.
Collaboration & Mentorship :
- Partner with development, QA, and DevOps teams to embed SRE principles.
- Coach and mentor junior engineers on reliability engineering and automation.
Documentation & Knowledge Management :
- Maintain detailed architecture diagrams, design documents, and operational procedures.
- Document SLOs, automation workflows, and change management reports.
Technical Leadership :
- Promote a code-driven, automation-first reliability culture across teams.
What You Bring :
Education :
- Bachelors degree in Computer Science, Information Technology, or equivalent experience.
Experience :
- 8+ years in SRE or DevOps roles managing large-scale distributed systems.
- Proven hands-on experience in cloud operations (AWS preferred), automation, and CI/CD pipelines.
- Experience in the Telecom domain is an added advantage.
Technical Skills :
- Deep knowledge of AWS (EKS, EC2, RDS, IAM, VPC, Kafka, CloudWatch, API Gateway, Lambda, WAF, KMS).
- Strong Linux administration and networking fundamentals.
- Skilled in Terraform, Jenkins, Git, and scripting (Python, Bash).
- Solid understanding of observability tools (Grafana, Elastic Stack, Prometheus).
- Experience with container orchestration (Kubernetes) and microservices-based systems.
Certifications (Preferred) :
- Terraform Associate or Kubernetes Certified Administrator (CKA).
- SRE Foundation or Google SRE certification is desirable.
Why Join Movius? :
- Work on a global-scale platform serving enterprise customers.
- Be part of a high-performing, innovation-driven engineering team.
- Competitive pay, benefits, and opportunities for professional growth.
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1593980
Interview Questions for you
View All