Posted on: 30/07/2025
Job Description :
Responsibilities :
- Develop and maintain real-time monitoring, alerting, and logging systems to detect and resolve issues before they impact customers.
- Automate manual operations, including application deployment, configuration, scaling, and recovery.
- Collaborate with software engineering teams to integrate reliability best practices into the development lifecycle.
- Conduct root cause analysis (RCA) and implement preventive measures to mitigate recurring issues.
- Support a 24/7 distributed enterprise environment across multiple global data centers.
- Work closely with the Support team to enhance incident response processes, ensuring fast and effective resolution of technical escalations.
- Participate in on-call rotations to support critical application issues and outages.
- Maintain and optimize CI/CD pipelines to ensure fast and reliable application releases.
- Enhance system security by managing SSL certificates, encryption, and authentication mechanisms.
- Foster a culture of continuous improvement by evaluating new tools, frameworks, and methodologies to enhance system reliability.
Requirements :
- 4+ years of experience in a similar role focusing on application reliability, automation, and performance optimization.
- Strong expertise in Linux and Windows system administration.
- Proficiency in at least one scripting language (e.g., Python, Shell, Perl, JavaScript).
- Experience with Docker, Kubernetes, or containerization technologies.
- Familiarity with CI/CD tools like Jenkins and deployment automation frameworks.
- Hands-on experience with monitoring and observability tools (e.g., Prometheus, Grafana, ELK Stack, New Relic, Datadog).
- Understanding of networking concepts (TCP, IP, DNS, load balancing, firewalls).
- Experience with configuration management tools like Ansible, Salt, or Puppet.
- Strong debugging and troubleshooting skills across application, database, and infrastructure layers.
- Ability to work in a fast-paced, high-pressure environment with multiple priorities.
- Excellent communication and collaboration skills to work effectively with engineering and support teams.
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1521999
Interview Questions for you
View All