Description :

About the role :

ThoughtSpot is seeking an experienced Staff Engineer to lead the architecture and evolution of our Cloud Infrastructure and Observability control plane.

You will lead the design of a multi-cloud control plane (AWS, GCP, Azure) that powers our Business Intelligence (BI) application, ensuring it is resilient, cost-efficient, and deeply observable.

This role is ideal for a distributed systems expert who wants to solve complex challenges like Multi-Cloud Disaster Recovery, AI-Driven Operations, and FinOps-as-Code, while enabling engineering velocity through self-service platforms.

What you will do :

- Architect the Next-Gen Observability Stack.

- Build the "Single Pane of Glass": Design and operationalise a cutting-edge observability pipeline (Logs, Metrics, Traces) using Prometheus, ELK/EFK, Kafka, and OpenTelemetry.

- AI-Powered Operations: Lead the development of a customer-facing Operations Portal that incorporates AI agents and analytics to provide real-time health insights, automated root cause analysis, and QoS visibility to our customers.

- No-Touch Operations: Drive the platform toward "no-touch/low-touch" operations by implementing self-healing mechanisms and symptom-based alerting.

- Control Plane Engineering: Architect scalable microservices that orchestrate tenancy, feature flags, and configuration across AWS, GCP, and Azure.

- Multi-Cloud & Hybrid Cloud Strategy.

- Drive the architecture and implementation of multi-cloud disaster recovery (DR) frameworks for both multi-tenant and single-tenant SaaS offerings.

- Create SDLC frameworks that allow for seamless deployment across multiple clouds without requiring redundant testing.

- Develop an app modernisation framework to migrate applications from legacy infrastructure to modern Kubernetes-based platforms.

- Automation & Infrastructure as a Service.

- Implement Infrastructure-as-Code (IaC) solutions using tools such as Terraform, Ansible, and CloudFormation to automate provisioning and deployments.

- Provide automation and tools for both customer workflows and internal software development lifecycle (SDLC) processes.

- Integrate open-source technologies and custom-developed modules to build a state-of-the-art infrastructure stack.

- Customer-Obsessed Engineering.

- Ensure our observability isn't just watching servers, but watching the Customer Experience.

- You will instrument key user journeys (Login, Search, Checkout) to detect customer pain before they file a ticket.

- Leadership & Collaboration.

- Provide technical leadership to a team of developers, conducting architecture reviews, and code reviews, and sharing best practices in cloud-native software development.

- Lead cross-functional collaborations to ensure infrastructure is built for scalability, performance, and security.

- Mentor and develop team members, driving a culture of technical excellence.

What you'll bring :

- Experience: 10+ years of engineering experience, with at least 3+ years in a Staff/Principal role scaling enterprise SaaS platforms.

- Cloud Native Mastery: Deep hands-on expertise with Kubernetes, Docker.

- You have built and operated large-scale infrastructure on AWS, GCP, or Azure.

- Coding Proficiency: expert-level skills in Go (Golang) (preferred), Java, or Python.

- You can write production-grade microservices and K8s operators.

- Observability Deep Dive: You understand the internals of monitoring frameworks.

- You have scaled Prometheus federation, tuned Elasticsearch/Kafka for massive log ingestion, and implemented distributed tracing.

- IaC Expert: You treat infrastructure as software.

- Advanced proficiency with Terraform and Ansible is required.

- Distributed Systems Knowledge: You have a strong grasp of CAP theorem, consensus algorithms (Raft/Paxos), distributed storage, and networking fundamentals.

- Strategic Thinking: Experience building "Single Pane of Glass" solutions and managing the trade-offs between speed, cost, and reliability in a multi-cloud environment.