- System Design and Architecture : Collaborate with cross-functional teams to design and implement scalable, reliable, and efficient systems. Participate in system architecture discussions, provide recommendations, and drive improvements to meet business objectives.

- System Monitoring and Performance : Develop and implement robust monitoring systems to proactively identify and resolve performance bottlenecks, service disruptions, and other issues affecting system reliability. Continuously monitor system performance metrics and optimize resource utilization.

- Incident Response and Troubleshooting : Respond to and resolve production incidents in a timely manner, utilizing strong troubleshooting skills and collaborating with other teams. Conduct root cause analysis to prevent future incidents and implement corrective actions.

- Automation and Tooling : Develop automation tools and scripts to streamline deployment, configuration, and monitoring processes. Implement and maintain CI/CD pipelines to ensure efficient and reliable software delivery.

- Capacity Planning and Scalability : Work closely with development teams to forecast system capacity requirements and plan for scalability. Conduct performance testing and capacity analysis to ensure systems can handle increased loads and peak traffic.

- Security and Compliance : Implement and maintain security measures and best practices to protect our infrastructure and data. Stay up to date with the latest security vulnerabilities and apply necessary patches and upgrades.

- Collaboration and Documentation : Foster strong collaboration with cross-functional teams, including developers, operations, and QA. Document system configurations, processes, and procedures to facilitate knowledge sharing and ensure a smooth handover of responsibilities.

Qualifications and Skills :

- Bachelors degree in computer science, Engineering, or a related field (or equivalent practical experience).

- Strong experience in a Site Reliability Engineering role or a similar capacity, managing large-scale, highly available production systems.

- Proficiency in programming and scripting languages (e.g., Python, Bash, Ruby).

- Deep understanding of Linux/Unix systems and networking concepts.

- Experience with cloud platforms (e.g., AWS, Azure, GCP) and containerization technologies (e.g., Docker, Kubernetes).

- Familiarity with infrastructure-as-code tools (e.g., Terraform, Ansible) and configuration management tools (e.g., Chef, Puppet).

- Knowledge of monitoring and logging tools (e.g., Prometheus, ELK stack) and incident management systems (e.g., PagerDuty).

- Strong problem-solving and analytical skills, with the ability to quickly identify and resolve complex technical issues.

- Excellent communication and collaboration skills, with the ability to work effectively in a team-oriented environment.

Did you find something suspicious?

Similar jobs that you might be interested in

Posted by

Shubi

Talent Aquisition at Driffle

Last Active: 29 Apr 2026

Job Views:
285

Applications: 189

Recruiter Actions: 35

Posted in

DevOps / SRE

Functional Area

DevOps / Cloud

Job Code

1630755

Jobs by location

Interview Questions for you

View All

How to Write Leave Application for Urgent Work: Format & Samples (2025)

Top 90+ Machine Learning Interview Questions and Answers

Top 40+ Deep Learning Interview Questions and Answers