Posted on: 11/11/2025
Description :
Role Overview :
We are seeking an experienced L2 TechOps Engineer to manage and support large-scale production systems across Linux, Big Data, and containerized environments. The role requires a strong foundation in Linux administration, SQL optimization, Big Data tools (Hive, Spark), and OpenShift (OCP). The ideal candidate will be adept at monitoring, troubleshooting, automation, and incident response, ensuring the stability, scalability, and resilience of mission-critical systems.
Key Responsibilities
1. Production Environment Management :
- Manage, monitor, and maintain production systems to ensure optimal performance, uptime, and reliability.
- Perform system performance tuning, capacity planning, and load management to prevent bottlenecks.
- Ensure system compliance with security and governance policies across all environments.
2. Monitoring & Observability :
- Design and maintain monitoring dashboards in tools such as Grafana, Kibana, and Prometheus.
- Set up alerting frameworks to proactively detect performance degradations or failures.
- Perform root cause analysis (RCA) using metrics, logs, and distributed tracing tools.
3. Troubleshooting & Incident Management :
- Analyze and debug complex production issues using Airflow logs, Spark UI, Hive performance metrics, and system logs.
- Lead incident triage, drive restoration efforts, and document post-incident analysis reports.
- Collaborate with application, data, and infrastructure teams for quick resolution of issues.
4. Automation & Optimization :
- Develop automation scripts using Shell scripting, Python, or Ansible to reduce manual
intervention and operational overhead.
- Automate repetitive tasks such as system checks, deployment verifications, and data
validations.
- Contribute to CI/CD pipeline improvements to enhance deployment reliability.
5. Container & Cloud Platform Operations :
- Monitor, troubleshoot, and maintain OpenShift (OCP) clusters, including pods, nodes, and services.
- Collaborate with DevOps and platform teams to ensure smooth application deployments.
6. Data Platform Operations :
- Support and optimize Big Data workloads involving Hive, Spark, and related data frameworks.
- Write, tune, and debug SQL queries to analyze large datasets and identify data
inconsistencies.
- Work with data engineering teams to maintain healthy data pipelines and job schedules.
7. Reliability & Process Excellence :
- Participate in on-call rotations, incident management, and 24/7 support coverage as required.
- Establish and document SOPs (Standard Operating Procedures) for key operational workflows.
- Contribute to continuous improvement initiatives in monitoring, automation, and fault tolerance.
Required Skills & Experience :
- 4+ years of experience in Production Support, DevOps, or IT Operations roles.
- Strong hands-on expertise in Linux server management and troubleshooting.
- Proficiency in SQL and experience with Hive, Spark, or other Big Data ecosystems.
- Experience managing OpenShift (OCP) or Kubernetes-based environments.
- Strong understanding of monitoring and logging tools (Grafana, Kibana, Prometheus, ELK
Stack).
- Experience with Airflow for workflow orchestration and debugging.
- Scripting experience in Shell, Python, or equivalent automation frameworks.
- Solid understanding of incident, change, and problem management practices (ITIL or similar).
- Excellent analytical, communication, and collaboration skills.
Did you find something suspicious?
Posted By
Posted in
DevOps / SRE
Functional Area
DevOps / Cloud
Job Code
1572115
Interview Questions for you
View All