Posted on: 24/08/2025
Roles & Responsibilities :
- Maintain and monitor the availability of cloud infrastructure, troubleshoot, identify, and resolve production-level infrastructure issues.
- Using Infrastructure as a Code (IAAC) tools, develop and maintain automation tools for provisioning, configuration management, and deployment.
- Establish and maintain monitoring and alerting systems for the detection and response to incidents.
- Demonstrate strong customer focus.
- Should have the ability to collaborate with internal teams and customers during incidents, explaining the issue, recommending immediate mitigations, and providing long-term solutions.
- Investigate customer escalations and work closely with the engineering, support, and sales teams to implement a solution.
- Perform a postmortem analysis of system failures and implement corrective measures as necessary.
- Participate in the rotational on-call schedule based on the need to be available in an emergency.
- A demonstrated track record of optimising cloud infrastructure costs. Monitor and control the use of cloud resources, implement cost-saving measures, and provide recommendations for optimising cloud costs.
- Experience implementing security best practices and compliance measures in production environments.
- Experience with security audits, vulnerability assessments, and the implementation of security controls to protect sensitive data and ensure regulatory compliance.
Candidate Profile :
- Experience designing, architecting, and running large scale cloud infrastructure.
- Experience working with reverse proxy, webserver, load balancing and CDN services.
- Familiarity with security best practices and compliance frameworks such as PCI DSS
- Strong interpersonal and communication skills (including oral, written, and listening skills)
- Experience with stress testing and tuning production systems using tools such as K6, Locust
- Experience in using AWS Cost Explorer, AWS Budgets, and AWS Cost and Usage Reports and optimising costs to ensure efficient resource use.
Technical skills :
- Experience with scripting languages such as Python and Bash
- Experience managing reverse proxies/web servers on a large-scale production level.
- Experience with infrastructure as a code tool such as Terraform/CloudFormation
- Experience working with Kafka, Elasticsearch, and RabbitMQ
- Experience with observation tools such as Prometheus, Grafana, and Loki
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1534650
Interview Questions for you
View All