We are seeking a Site Reliability Engineer (SRE) with a strong focus on customer-facing technical support. In this role, you will be the primary point of contact for our enterprise SaaS customers, addressing and resolving technical issues to ensure optimal system performance and user satisfaction. Your responsibilities will encompass managing incoming support tickets, providing timely solutions, and maintaining high system uptime and application availability.

This position requires a deep understanding of systems engineering principles, extensive Linux system administration expertise, and the ability to monitor and manage large-scale cloud clusters. Your technical acumen, combined with excellent communication skills, will be crucial in delivering a superior support experience and contributing to the reliability and efficiency of our SaaS platform.

Responsibilities :

- Serve as the first line of support for customer-reported technical issues related to our SaaS platform.

- This involves data connectivity issues, report errors, performance concerns, access problems, data inconsistencies, software bugs, integration challenges etc.

- Understand and empathise with the challenges ThoughtSpot users face, offering tailored solutions to improve their user experience.

- Ensure prompt and accurate updates, meet SLAs and provide timely resolution to customer issues via tickets and calls

- Create knowledge-base articles to document knowledge and help customers with self-service.

- Maintain, monitor, and troubleshoot ThoughtSpot cloud infrastructure.

- Monitor system health and performance through metrics, logs, and dashboards using tools like Prometheus, Grafana to detect and prevent issues early

- Work with Engineering teams to define and implement tools to enhance debuggability, supportability, availability, scalability, and performance.

- Be an expert in cloud and on-premise infrastructure by developing automation and best practices.

- Participate in on-call rotation for critical SRE systems, lead the incident review and root cause analysis.

Requirements :

- B. S. degree in Computer Science or relevant industry experience.

- Exceptional communication skills, both written and verbal, to effectively engage with cross-functional teams, customers, and stakeholders.

- Relevant work experience in troubleshooting complex Linux Systems and managing distributed systems

- Experience in virtualisation and Cloud technologies.

- Experience in enterprise customer support, on-call rotation for critical SRE systems, leading incident review and root cause analysis.

- Ability to diagnose technical problems and work with Engineering on escalated issues.

- Strong problem-solving skills, algorithmic thinking and a strong foundation in how systems should work.

- Understanding of tools and frameworks required to operate and manage Cloud infrastructure.

- Strong customer service skills.

- Solid communication skills and ability to work independently.

- Ability to leverage automation, monitoring and data analysis to ensure high availability.

- Familiarity with scripting languages such as Python, JavaScript or Bash.

- Exposure to infrastructure and service monitoring tools.

Ideal Candidate Profile :

- You thrive in dynamic, customer-facing environments and are passionate about ensuring system reliability and customer satisfaction.

- You have a balanced mix of technical expertise in cloud operations and a proven record in handling support incidents and end-user queries, setting you apart from candidates with purely systems or cloud engineering backgrounds.