Posted on: 31/10/2025
Key Responsibilities :
Observability & Monitoring :
- Dashboards & Metrics : Design and implement comprehensive dashboards covering OS/platform-level and application-level monitoring, broken into primary (RED) and secondary indicators (USE).
- Availability & Reliability : Establish and maintain SLIs, SLOs, and error budgets for the service.
- Performance Monitoring : Build alerting systems and performance monitoring to proactively identify and resolve issues before they impact users.
- Incident Response : Participate in on-call rotations, lead incident response efforts (including post-mortem analysis and remediation), maintain on-call routing, and assign application-level problems to engineering teams.
Infrastructure Automation & Deployment :
- CI/CD Pipeline Management : Build and optimize CI/CD pipelines for speed and resilience.
- Infrastructure as Code : Develop and maintain infrastructure using tools like Terraform, Ansible, or similar.
- Configuration Management : Automate system configuration and ensure consistency across environments.
- Implement and recommend best practices for configuration control.
Security & Compliance :
- Security Automation : Ensure security scanning systems are in place and review escalated vulnerabilities.
- Access Control : Maintain proper authentication, authorization, and audit logging systems.
- Compliance Reporting : Ensure systems meet regulatory and industry standards.
- Security Incident Response : Participate in security incident response and remediation efforts.
Cost Optimization :
- Resource Management : Monitor and optimize cloud resource usage and costs.
- Capacity Planning : Analyze usage patterns and plan for future capacity needs.
- Cost Analysis : Provide recommendations for cost-effective architecture and resource allocation.
- Right-sizing : Implement automated scaling and resource optimization strategies.
Common Services & Platform Engineering :
- Shared Infrastructure : Build and maintain common services (notification systems, caching layers, message queues, or third-party stacks).
- Database Operations : Manage database reliability, performance, and scaling (where not handled by DB teams).
- Service Mesh & Networking : Implement and maintain service discovery, load balancing, and network policies.
- Developer Tools : Create and maintain tools and platforms that improve developer productivity and reliability.
Required Qualifications :
Technical Skills :
- Programming Languages : Proficiency in at least two of Python, Shell, Java, NodeJS, or similar.
- Cloud Platforms : Experience with AWS, GCP, or Azure.
- Containerization : Hands-on experience with Docker, Kubernetes, and container orchestration.
- Monitoring & Observability : Experience with Prometheus, Grafana, ELK stack, or similar tools.
- Infrastructure as Code : Proficiency with Ansible, Terraform, Helm, or similar.
- Version Control : Expert-level Git usage and collaborative development practices.
- CI/CD Pipelines : Hands-on experience with GitLab CI/CD, GitHub Actions, or similar.
SRE-Specific Knowledge :
- Experience defining and maintaining SLOs and SLIs.
- Understanding and implementation of error budget policies.
- Proven track record in toil reduction and automation.
- Experience with capacity planning and performance testing.
Preferred Qualifications :
- Bachelors degree in Computer Science, Engineering, or equivalent experience.
- Experience with microservices and distributed systems.
- Knowledge of security best practices and compliance frameworks.
- Experience with chaos engineering and reliability testing.
- Prior experience in an SRE or DevOps role at a tech company.
- Contributions to open-source projects or technical communities.
Success Metrics :
- Maintain or improve service availability and reliability metrics.
- Demonstrated reduction in manual operational work through automation.
- Effective participation in incident response and prevention.
- High-quality, well-tested code contributions.
- Strong collaboration with development teams to improve system reliability.
Team Culture & Values :
- Blameless Post-Mortems : Learn from failures without blame.
- Automation First : Prefer automated solutions over manual processes.
- Measure Everything : Data-driven decisions and continuous improvement.
- Knowledge Sharing : Document and share expertise.
- Work-Life Balance : Sustainable on-call practices and reasonable load.
Growth Opportunities :
- Work on cutting-edge infrastructure and reliability challenges.
- Exposure to large-scale distributed systems and modern cloud technologies.
- Clear career path toward Senior SRE, Staff Engineer, or Management roles.
- Collaboration with engineering teams across the organization.
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1567946
Interview Questions for you
View All