Posted on: 16/07/2025
Key Responsibilities :
- Infrastructure & Automation : Design, deploy, and maintain highly reliable and scalable systems and infrastructure.
Automate routine tasks and workflows to improve operational efficiency through scripts like python, PowerShell, go, etc.
- Monitoring & Incident Management : Build and manage monitoring systems, identify key metrics, and respond to incidents in a timely manner. Lead post-mortem analysis to prevent future incidents and improve system reliability.
- Performance Optimization : Analyze system performance and implement improvements for latency, throughput, and system resource usage.
- Collaboration & Support : Work closely with development teams to ensure that application architectures are robust, scalable, and easy to monitor. Provide guidance on best practices for code deployment and maintenance.
- Capacity Planning : Monitor and forecast infrastructure usage and capacity to ensure systems can handle future demand.
Recommend and implement changes to optimize resource allocation.
- Disaster Recovery & Business Continuity : Develop and implement disaster recovery and business continuity plans to ensure that critical services remain available in the event of failures.
- Security & Compliance : Collaborate with security teams to ensure infrastructure and applications meet security best practices and compliance requirements.
Skills and Qualifications :
- Experience : 3-6 years of experience in Site Reliability Engineering, DevOps, or a similar field, with a solid understanding of
both software development and system administration.
Technical Expertise :
- Proficient with cloud platforms (AWS, GCP, Azure) and containerization technologies (Docker, Kubernetes).
- Strong experience with monitoring and alerting tools (Prometheus, Grafana, Datadog, etc.
- Proficiency with configuration management tools (Terraform, Ansible, Puppet, Chef).
- Experience with CI/CD pipeline management (Jenkins, GitLab, CircleCI).
- Strong knowledge with scripting languages (Python, Powershell, Go, etc.) for automation tasks.
- Strong understanding of networking, security, and system architecture principles.
Problem-Solving Skills : Excellent analytical and troubleshooting skills, able to diagnose complex technical issues and identify solutions quickly.
Communication : Strong verbal and written communication skills. Ability to explain complex technical concepts to both technical and non-technical stakeholders.
Team Player : Ability to work collaboratively in a cross-functional team, mentoring junior team members and contributing to team success.
Preferred Qualifications :
- Cloud certifications (e., AWS Certified Solutions Architect, Google Professional Cloud Architect) are a plus.
- Experience with distributed systems and large-scale infrastructure is highly desirable.
- Experience with service meshes, load balancing, and fault-tolerant architectures.
- Understanding of software development lifecycle and Agile methodologies
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1513961
Interview Questions for you
View All