Key Responsibilities :

Observability & Monitoring :

- Dashboards & Metrics : Design and implement comprehensive dashboards covering OS/platform-level and application-level monitoring, broken into primary (RED) and secondary indicators (USE).

- Availability & Reliability : Establish and maintain SLIs, SLOs, and error budgets for the service.

- Performance Monitoring : Build alerting systems and performance monitoring to proactively identify and resolve issues before they impact users.

- Incident Response : Participate in on-call rotations, lead incident response efforts (including post-mortem analysis and remediation), maintain on-call routing, and assign application-level problems to engineering teams.

Infrastructure Automation & Deployment :

- CI/CD Pipeline Management : Build and optimize CI/CD pipelines for speed and resilience.

- Infrastructure as Code : Develop and maintain infrastructure using tools like Terraform, Ansible, or similar.

- Configuration Management : Automate system configuration and ensure consistency across environments.

- Implement and recommend best practices for configuration control.

Security & Compliance :

- Security Automation : Ensure security scanning systems are in place and review escalated vulnerabilities.

- Access Control : Maintain proper authentication, authorization, and audit logging systems.

- Compliance Reporting : Ensure systems meet regulatory and industry standards.

- Security Incident Response : Participate in security incident response and remediation efforts.

Cost Optimization :

- Resource Management : Monitor and optimize cloud resource usage and costs.

- Capacity Planning : Analyze usage patterns and plan for future capacity needs.

- Cost Analysis : Provide recommendations for cost-effective architecture and resource allocation.

- Right-sizing : Implement automated scaling and resource optimization strategies.

Common Services & Platform Engineering :

- Shared Infrastructure : Build and maintain common services (notification systems, caching layers, message queues, or third-party stacks).

- Database Operations : Manage database reliability, performance, and scaling (where not handled by DB teams).

- Service Mesh & Networking : Implement and maintain service discovery, load balancing, and network policies.

- Developer Tools : Create and maintain tools and platforms that improve developer productivity and reliability.

Required Qualifications :

Technical Skills :

- Programming Languages : Proficiency in at least two of Python, Shell, Java, NodeJS, or similar.

- Cloud Platforms : Experience with AWS, GCP, or Azure.

- Containerization : Hands-on experience with Docker, Kubernetes, and container orchestration.

- Monitoring & Observability : Experience with Prometheus, Grafana, ELK stack, or similar tools.

- Infrastructure as Code : Proficiency with Ansible, Terraform, Helm, or similar.

- Version Control : Expert-level Git usage and collaborative development practices.

- CI/CD Pipelines : Hands-on experience with GitLab CI/CD, GitHub Actions, or similar.

SRE-Specific Knowledge :

- Experience defining and maintaining SLOs and SLIs.

- Understanding and implementation of error budget policies.

- Proven track record in toil reduction and automation.

- Experience with capacity planning and performance testing.

Preferred Qualifications :

- Bachelors degree in Computer Science, Engineering, or equivalent experience.

- Experience with microservices and distributed systems.

- Knowledge of security best practices and compliance frameworks.

- Experience with chaos engineering and reliability testing.

- Prior experience in an SRE or DevOps role at a tech company.

- Contributions to open-source projects or technical communities.

Success Metrics :

- Maintain or improve service availability and reliability metrics.

- Demonstrated reduction in manual operational work through automation.

- Effective participation in incident response and prevention.

- High-quality, well-tested code contributions.

- Strong collaboration with development teams to improve system reliability.

Team Culture & Values :

- Blameless Post-Mortems : Learn from failures without blame.

- Automation First : Prefer automated solutions over manual processes.

- Measure Everything : Data-driven decisions and continuous improvement.

- Knowledge Sharing : Document and share expertise.

- Work-Life Balance : Sustainable on-call practices and reasonable load.

Growth Opportunities :

- Work on cutting-edge infrastructure and reliability challenges.

- Exposure to large-scale distributed systems and modern cloud technologies.

- Clear career path toward Senior SRE, Staff Engineer, or Management roles.

- Collaboration with engineering teams across the organization.