Role : SRE Manager

- Must have managed team of 15 people or more

- Must be hands on with SRE practices

- 8+ years of experience.

Required Skill Sets :

- Excellent Verbal & Non-verbal communication.

- Experience / knowledge on Infrastructure Monitoring & support (observability).

- Experience / knowledge on Application Support.

- Must have worked with large enterprise customers/applications.

- Understand the support process

- Experience working with large enterprise applications and aware of the L1 support process for the enterprise and large applications

- Experience with Kubernetes administration

- This is from the support perspective.

- Candidate should be able to perform basic operations on k8s while supporting the application

- Candidates must be aware of the k8s concepts and hands on to manage basic k8s.

- Experience working with event driven applications.

- Understanding on Kafka.

- Understanding on Redis.

- Understanding of MongoDB.

- Should understand from the perspective of supporting the application

- Should be aware of how queues work and know about basic building blocks of the event driven application.

- Exposure to one of the cloud technologies (Amazon/Google Cloud/Azure)

- GCP required others good to have.

- Excellent and MUST have good troubleshooting & Problem-solving skills.

- Experience in Linux System Administration.

- Experience in Bash / Python Scripting.

- Should be able to run and do the updates to existing automations

- Experience with DevOps tools (Jenkins, SumoLogic, Github, Opsgenie, Box, DropBox, Cisco Spark, Rancher)

- Experience with analyze and visualize tools (Grafana/Prometheus/ELK), must be aware of observability concepts and should have practiced.

- Experience with creating the dashboards and alerts using above observability tools.

- MUST available for regular weekly support (24-7 environment).

- This is L1only position and engineer will work in shifts to support the application

- Good Understanding of Networking concepts & N/W commands

- ThisisfromL1perspective to troubleshoot the issues using existing observability and the run-books.

Roles & Responsibilities :

- Monitoring Critical & Non-Critical applications

- Acknowledge, Triage & troubleshoot alerts within scope adhering to the set SLA.

- Providing On-call Support for mission critical issues, investigate, troubleshoot & drive towards resolution.

- Follow escalation procedures as per RB and escalate alerts.

- Ensuring web-scale systems are highly available & fault-tolerant.

- Improve the performance of micro-services and solve scaling/performance issues.

- Capacity management and planning.

- Strong interpersonal communication skills (including listening, speaking, and writing).

- Ability to work well in a diverse, team-focused environment with other SREs & developers.

- Knowledgebase engineering developing / updating Runbooks Preferred.

- Patch deployment as per schedule.

- Backup/Clean file storage activities on servers on a schedule basis.

- Schedule Job's Manual/Automation.

- Jenkins automation and maintenance on a schedule basis.

- Investigate monitoring alerts/Logs/Grafana Patterns take a proactive approach to address false positives, forecast potential threats via data analyzing, Submit Bug/Enhancement to Development team on demand.

- Various Reports/Automation Scripts creation as per request from various teams , Maintenance & update of existing reports & Automation Scripts.

- Communicate with various other departments on day-to-day operations and needs using internal tools and emails.

- Need to very creative and propose solution on gap findings

- Responsible for capacity planning, shift management and people management.