Posted on: 29/09/2025
Job Description :
Responsibilities :
- Ensure the smooth operation of production systems through proactive monitoring and maintenance.
- Respond promptly to system alerts and incidents, minimizing downtime and impact.
- Implement and manage observability tools such as Prometheus, Grafana, ELK (Elasticsearch, Logstash, Kibana), and
- Develop and maintain dashboards and alerts to track key performance indicators (KPIs).
- Analyze monitoring data to identify trends, potential issues, and areas for optimization.
- Manage and analyze system logs to troubleshoot errors and identify root causes.
- Implement log aggregation and analysis solutions to improve error detection and resolution.
- Develop and maintain error handling procedures and documentation.
- Develop and maintain automation scripts using Python, Bash, or other scripting languages to streamline operations.
- Automate routine tasks, deployments, and infrastructure management to improve efficiency and reduce manual effort.
- Implement Infrastructure as Code (IaC) principles to manage infrastructure configurations.
- Investigate and resolve production incidents, performing root cause analysis and implementing effective solutions.
- Collaborate with development and operations teams to resolve complex issues and ensure timely resolution.
- Enhance and maintain UI automation and testing frameworks to improve testing efficiency and coverage.
- Participate in deployment processes, ensuring smooth and reliable releases.
- Implement and maintain CI/CD pipelines to automate software delivery.
- Identify and address performance bottlenecks in production systems.
- Implement performance tuning and optimization strategies to improve system efficiency and responsiveness.
- Monitor and analyze system performance metrics to identify areas for improvement.
- Create and maintain comprehensive documentation for system configurations, procedures, and troubleshooting guides.
- Share knowledge and best practices with team members through training sessions and knowledge base articles.
- Contribute to the development and improvement of internal tools and processes.
Requirements :
Essential :
- 2+ years of experience in Production Engineering, DevOps, or Testing Automation.
- Strong scripting skills in Python, Bash, or similar languages.
- Hands-on experience with observability tools like Prometheus, Grafana, ELK, or Datadog.
- Experience with log management and error handling.
- Familiarity with UI automation and testing frameworks.
- Strong problem-solving and analytical skills.
- Ability to work independently and as part of a team.
- Good communication and interpersonal skills
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
DevOps / Cloud
Job Code
1554145
Interview Questions for you
View All