Posted on: 10/04/2026
Description :
We are looking for a highly skilled site reliability engineer to manage and scale our on-premise payments infrastructure.
You will work on a hybrid environment spanning virtual machines and containerized workloads on bare metal, ensuring high availability, security, and performance for mission-critical systems.
Key Responsibilities :
- Operate and optimize virtualized environments (VMs) and containerized workloads (Docker on bare metal)
- Manage and scale middleware systems like :
1. Nginx (traffic routing, reverse proxy, load balancing)
2. Redis (caching, HA setup)
3. Kafka (streaming, partitioning, fault tolerance)
- Build and maintain CI/CD pipelines using Jenkins
- Manage infrastructure and application configurations using Git-based version control
- Ensure high availability, resilience, and performance tuning across systems
- Work on Linux system administration (RHEL/CentOS/Ubuntu)
- Implement and maintain automation frameworks using :
1. Ansible
2. Shell scripting
- Manage and troubleshoot networking components :
1. TCP/IP, DNS, Load balancing
2. Firewalls, WAF policies
3. Akamai
- Handle security and compliance requirements
- Maintain accurate inventory and asset management systems
- Participate in incident response, RCA, and system reliability improvements
- Collaborate with application, security, and DevOps teams
Required Skills & Qualifications :
Core Infrastructure :
- Strong hands-on experience with Linux system administration
- Experience managing on-prem data center environments
- Solid understanding of:
- Virtualization (VMware / KVM or similar)
- Bare metal provisioning
Containers & Middleware :
- Experience running Docker in production (non-Kubernetes setups preferred)
- Strong operational knowledge of :
1. Nginx
2. Redis
3. Kafka
4. RDBMS
5. Java
Observability, Alerting & Reliability :
- Design and manage observability platforms :
1. Elastic Stack (ELK)
2. Grafana / Prometheus stack
- Build and maintain :
1. Metrics, logs, and tracing pipelines
2. Dashboards for system health and business KPIs
- Develop intelligent alerting strategies :
1. Reduce noise (alert fatigue)
2. Improve signal quality
- Build correlation mechanisms / alert aggregation systems to :
1. Reduce MTTD (Mean Time to Detect)
2. Reduce MTTR (Mean Time to Recover)
- Drive proactive monitoring and anomaly detection
- Lead incident response, debugging, and RCA with data-driven insights
CI/CD & Version Control :
- Hands-on experience with :
1. Git (branching strategies, code reviews, infra-as-code workflows)
2. Jenkins (pipeline creation, build automation, deployment orchestration)
Networking & Security :
- Good understanding of :
1. Networking fundamentals (L3/L4 concepts)
2. Firewalls and WAF (rule tuning, debugging)
3. Experience handling secure production environments
Automation :
- Hands-on experience with :
1. Ansible
2. Shell scripting (bash)
Operations :
- Experience with : Monitoring, alerting, and logging systems
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1627494