Posted on: 11/02/2026
Description :
- Collaborate with engineering and product teams to ensure reliable and scalable system design
- Lead technical discussions and propose implementation strategies for reliability improvements
- Participate in incident response and on-call rotation
- Take ownership of minor production incidents and contribute to post-incident reviews
- Perform infrastructure-level application support :
a. Connectivity troubleshooting (port checks, firewall rules, VLAN checks)
b. Load balancer troubleshooting
c. Certificate management (renewals, CA creation, certificate deployment)
- Develop automation scripts using Python, Bash, or Perl
- Build multi-threaded automation scripts for scheduling and orchestration of applications
- API management create/invoke APIs, implement health checks
- Identify operational toil and eliminate it through automation
- Contribute to Disaster Recovery (DR) and resiliency testing initiatives
- Support migration of applications to Google Cloud Platform (GCP)
- Provision and deprovision GCE, GKE clusters
- Build and manage : Dockerized environments
- Jenkins pipelines for CI/CD deployments
- Ansible playbooks for parallel automation workflows
- Mentor L1 and L2 SRE team members
- Contribute reliability improvement ideas to product backlogs
Required Skills & Experience :
- Strong experience with Linux-based systems
- Hands-on experience with GCP (GCE, GKE) or other cloud platforms
- Strong scripting/programming skills (Python, Bash; multi-threading knowledge preferred)
- Good understanding of :
a. Application architectures
b. Messaging protocols
c. Distributed systems concepts
Knowledge of networking fundamentals :
a.TCP / UDP / IP
b. HTTP/HTTPS
c. Load balancing
- Experience with CI/CD tools like Jenkins
Hands-on experience with :
a. Docker
b. Kubernetes
c. Ansible
- Strong troubleshooting and analytical skills
- Experience handling production incidents
- Excellent communication and stakeholder collaboration skills
Good to Have :
- Experience with monitoring and observability tools :
a. OpenTelemetry
b. Splunk
c. Prometheus
d. Grafana
- Experience in high-availability or low-latency systems
- Exposure to financial/trading systems
- Experience working in Agile environments
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1611694