At Cautio, we are building secure, scalable, and reliable cloud-native solutions that empower businesses to grow confidently. As a Bangalore-based startup, we thrive on solving complex challenges, moving fast, and working collaboratively.

We are looking for a DevOps / Site Reliability Engineer (SRE) who can take ownership of our cloud infrastructure and bring expertise in building highly reliable systems following SRE best practices.

Key Responsibilities :

- Design, build, and maintain Infrastructure as Code (IaC) using Terraform for consistent and scalable environments.

- Develop, monitor, and improve AWS-based infrastructure including API Gateway, ECS, EC2, S3, RDS, and Event

Bridge.

- Create and manage robust CI/CD pipelines for seamless application deployment and rollbacks.

- Implement and manage containerized workloads with Docker and orchestrate them using Kubernetes (basic to intermediate level).

- Apply SRE best practices : Define and measure SLIs, SLOs, and error budgets; set up monitoring, logging, and

alerting; drive incident response and postmortems; automate operational tasks.

- Work with development teams to enable high availability, scalability, and security for applications.

- Apply strong Linux administration and networking knowledge to troubleshoot and optimize systems.

- Continuously analyze system performance and recommend improvements to ensure optimal performance and cost efficiency.

Required Skills & Experience :

- 2-4 years of hands-on experience in DevOps and cloud infrastructure roles.

- Proficiency in Terraform for infrastructure provisioning.

- Strong expertise with AWS services (API Gateway, ECS, EC2, S3, RDS, EventBridge).

- Experience implementing and maintaining CI/CD pipelines (Jenkins, GitHub Actions, GitLab CI/CD, etc.).

- Good understanding of Linux systems and networking fundamentals (TCP/IP, DNS, VPNs).

- Hands-on with Docker and Kubernetes (basic/intermediate level) for container orchestration.

- Practical knowledge of SRE principles : SLIs/SLOs, error budgets, reliability engineering, incident management.

- Excellent debugging, troubleshooting, and problem-solving skills.

- Self-starter with the ability to take ownership and work in a dynamic, fast-paced environment.

- Mandatory : Must be willing to work onsite at our Bangalore office.

Nice-to-Have (Good to Bring, Not Mandatory) :

- Experience with observability tools (Prometheus, Grafana, Datadog, ELK).

- Knowledge of DevSecOps practices for integrating security into CI/CD and infrastructure.

- Scripting experience in Python or Shell for automation.

- Exposure to designing high availability and disaster recovery strategies.

This is a hands-on role where youll have the opportunity to directly influence how infrastructure and reliability are built for scale in a startup environment.