Posted on: 18/08/2025
Role : SRE Manager
- Must have managed team of 15 people or more
- Must be hands on with SRE practices
- 8+ years of experience.
Required Skill Sets :
- Excellent Verbal & Non-verbal communication.
- Experience / knowledge on Infrastructure Monitoring & support (observability).
- Experience / knowledge on Application Support.
- Must have worked with large enterprise customers/applications.
- Understand the support process
- Experience working with large enterprise applications and aware of the L1 support process for the enterprise and large applications
- Experience with Kubernetes administration
- This is from the support perspective.
- Candidate should be able to perform basic operations on k8s while supporting the application
- Candidates must be aware of the k8s concepts and hands on to manage basic k8s.
- Experience working with event driven applications.
- Understanding on Kafka.
- Understanding on Redis.
- Understanding of MongoDB.
- Should understand from the perspective of supporting the application
- Should be aware of how queues work and know about basic building blocks of the event driven application.
- Exposure to one of the cloud technologies (Amazon/Google Cloud/Azure)
- GCP required others good to have.
- Excellent and MUST have good troubleshooting & Problem-solving skills.
- Experience in Linux System Administration.
- Experience in Bash / Python Scripting.
- Should be able to run and do the updates to existing automations
- Experience with DevOps tools (Jenkins, SumoLogic, Github, Opsgenie, Box, DropBox, Cisco Spark, Rancher)
- Experience with analyze and visualize tools (Grafana/Prometheus/ELK), must be aware of observability concepts and should have practiced.
- Experience with creating the dashboards and alerts using above observability tools.
- MUST available for regular weekly support (24-7 environment).
- This is L1only position and engineer will work in shifts to support the application
- Good Understanding of Networking concepts & N/W commands
- ThisisfromL1perspective to troubleshoot the issues using existing observability and the run-books.
Roles & Responsibilities :
- Monitoring Critical & Non-Critical applications
- Acknowledge, Triage & troubleshoot alerts within scope adhering to the set SLA.
- Providing On-call Support for mission critical issues, investigate, troubleshoot & drive towards resolution.
- Follow escalation procedures as per RB and escalate alerts.
- Ensuring web-scale systems are highly available & fault-tolerant.
- Improve the performance of micro-services and solve scaling/performance issues.
- Capacity management and planning.
- Strong interpersonal communication skills (including listening, speaking, and writing).
- Ability to work well in a diverse, team-focused environment with other SREs & developers.
- Knowledgebase engineering developing / updating Runbooks Preferred.
- Patch deployment as per schedule.
- Backup/Clean file storage activities on servers on a schedule basis.
- Schedule Job's Manual/Automation.
- Jenkins automation and maintenance on a schedule basis.
- Investigate monitoring alerts/Logs/Grafana Patterns take a proactive approach to address false positives, forecast potential threats via data analyzing, Submit Bug/Enhancement to Development team on demand.
- Various Reports/Automation Scripts creation as per request from various teams , Maintenance & update of existing reports & Automation Scripts.
- Communicate with various other departments on day-to-day operations and needs using internal tools and emails.
- Need to very creative and propose solution on gap findings
- Responsible for capacity planning, shift management and people management.
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1530968
Interview Questions for you
View All