HamburgerMenu
hirist

Capco - Site Reliability Engineer - Monitoring Tools

Capco Technologies Pvt Ltd
Multiple Locations
6 - 12 Years

Posted on: 14/07/2025

Job Description

Job Overview :

We are seeking a highly skilled and proactive SRE Engineer to join our team. You will play a critical role in ensuring the reliability, scalability, and performance of our production systems and applications. This position requires a strong blend of software engineering expertise, operational acumen, and a deep commitment to automation and continuous improvement. You will contribute to our mission of providing a highly available and efficient platform for our users.


Responsibilities :


- Design and implement highly available and scalable systems, ensuring the reliability and performance of the company's website or application.

- Collaborate with cross-functional teams to define and establish Service Level Objectives (SLOs) and Service Level Agreements (SLAs) for critical systems.

- Monitor systems and applications proactively, identifying and resolving any performance bottlenecks or availability issues before they impact users.

- Develop and maintain robust monitoring tools, alerts, and dashboards to provide comprehensive visibility into system health and performance.

- Conduct post-incident analyses to identify root causes and implement preventive measures to avoid future incidents, fostering a culture of learning.

- Automate repetitive tasks and processes to improve operational efficiency, reduce manual intervention, and increase system reliability.

- Create and maintain clear, comprehensive documentation for system architecture, configurations, and troubleshooting procedures.

- Perform capacity planning and resource allocation to ensure optimal system performance and scalability for future growth.

- Collaborate with development teams to implement and deploy new features and enhancements, ensuring they meet reliability, performance, and operational standards.

- Stay up to date with industry best practices, new technologies, and emerging trends in Site Reliability Engineering (SRE).

- Provide primary operational support and engineering for multiple large-scale distributed software applications.


Objectives of this Role: :


- Run the production environment by monitoring availability and taking a holistic view of system health.

- Build software and systems to manage platform infrastructure and applications.


- Improve reliability, quality, and time-to-market of our suite of software solutions.

- Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating for continual improvement.


Primary Skills Required :


- Strong knowledge of Linux/Unix systems and command-line tools.

- Proficiency in scripting languages such as Python, Shell, or Perl.

- Experience with configuration management tools like Ansible, Puppet, or Chef.

- Familiarity with cloud platforms like AWS, Azure, or Google Cloud.

- Understanding of networking principles and protocols (TCP/IP, HTTP, DNS, etc.).

- Knowledge of containerization technologies (Docker, Kubernetes) and orchestration tools.

- Expertise in monitoring and logging tools such as Prometheus, Grafana, ELK stack, or Splunk (Optional - But Good to Know).

- Experience with Citrix technologies such as XenApp, XenDesktop, and NetScaler.

- Ability to support the administration and engineering of the Citrix environment.

- Experience working with Citrix Provisioning Server, SQL Database, and Citrix License Server.

- Experienced knowledge of virtualization technologies such as VMware or Hyper-V.

- Strong problem-solving and troubleshooting skills, with the ability to analyze and resolve complex technical issues.

- Terraform basic syntax and GitLab CI/CD configuration, pipelines, jobs.

- Cloud resources provisioning and configuration through CLI/API.

- Understanding of how to do basic queries in logs tools for general questions.

- Operating system (Linux) configuration, package management, startup, and troubleshooting.

- Block and object storage configuration.

- Networking VPCs, proxies, and CDNs.


Secondary Skills Required :


- Bachelor's degree in Computer Science, Engineering, or a related field.

- Proven experience as a Site Reliability Engineer or a similar role.

- Solid understanding of software development methodologies and DevOps principles.

- Experience with agile and iterative development processes.

- Certification in relevant technologies or frameworks is a plus (e.g., AWS Certified DevOps Engineer, Certified Kubernetes Administrator).

- Familiarity with continuous integration/continuous deployment (CI/CD) pipelines.

- Experience with source control systems such as Git or SVN.

- Knowledge of security best practices and experience implementing security measures in a production environment.

- Ability to work independently and handle multiple projects and priorities simultaneously.

- Strong analytical and problem-solving skills, with a focus on continuous improvement and automation.

- Excellent communication and collaboration skills to work effectively with cross-functional teams.

- Strong attention to detail and ability to work in a fast-paced, dynamic environment.


info-icon

Did you find something suspicious?