HamburgerMenu
hirist

Job Description

Role : Site Reliability Engineer

About the Role :


We are seeking an experienced and passionate Senior Site Reliability Engineer (SRE) to join our team. In this role, you will work on improving the availability, reliability, and scalability of our services, systems, and infrastructure. You will collaborate closely with development, operations, and security teams to ensure that our systems are robust, performant, and maintainable. As a Senior SRE, you will also play a key role in mentoring junior engineers and driving best practices across the organization.

Key Responsibilities :

- Infrastructure & Automation : Design, deploy, and maintain highly reliable and scalable systems and infrastructure. Automate routine tasks and workflows to improve operational efficiency through scripts like python, PowerShell, go, etc.

- Monitoring & Incident Management : Build and manage monitoring systems, identify key metrics, and respond to incidents in a timely manner. Lead post-mortem analysis to prevent future incidents and improve system reliability.

- Performance Optimization : Analyze system performance and implement improvements for latency, throughput, and system resource usage.

- Collaboration & Support : Work closely with development teams to ensure that application architectures are robust, scalable, and easy to monitor. Provide guidance on best practices for code deployment and maintenance.

- Capacity Planning : Monitor and forecast infrastructure usage and capacity to ensure systems can handle future demand. Recommend and implement changes to optimize resource allocation.

- Disaster Recovery & Business Continuity : Develop and implement disaster recovery and business continuity plans to ensure that critical services remain available in the event of failures.

- Security & Compliance : Collaborate with security teams to ensure infrastructure and applications meet security best practices and compliance requirements.

Skills and Qualifications :

- Experience : 3-6 years of experience in Site Reliability Engineering, DevOps, or a similar field, with a solid understanding of both software development and system administration.

Technical Expertise :

- Proficient with cloud platforms (AWS, GCP, Azure) and containerization technologies (Docker, Kubernetes).

- Strong experience with monitoring and alerting tools (Prometheus, Grafana, Datadog, etc.).

- Proficiency with configuration management tools (Terraform, Ansible, Puppet, Chef).

- Experience with CI/CD pipeline management (Jenkins, GitLab, CircleCI).

- Strong knowledge with scripting languages (Python, Powershell, Go, etc.) for automation tasks.

- Strong understanding of networking, security, and system architecture principles.

Problem-Solving Skills : Excellent analytical and troubleshooting skills, able to diagnose complex technical issues and identify solutions quickly.

Communication : Strong verbal and written communication skills. Ability to explain complex technical concepts to both technical and non-technical stakeholders.

Team Player : Ability to work collaboratively in a cross-functional team, mentoring junior team members and contributing to team success.

Preferred Qualifications :

- Cloud certifications (e.g., AWS Certified Solutions Architect, Google Professional Cloud Architect) are a plus.

- Experience with distributed systems and large-scale infrastructure is highly desirable.

- Experience with service meshes, load balancing, and fault-tolerant architectures.

- Understanding of software development lifecycle and Agile methodologies.


info-icon

Did you find something suspicious?