Posted on: 16/12/2025
Description :
About the role :
ThoughtSpot is seeking an experienced Staff Engineer to lead the architecture and evolution of our Cloud Infrastructure and Observability control plane.
You will lead the design of a multi-cloud control plane (AWS, GCP, Azure) that powers our Business Intelligence (BI) application, ensuring it is resilient, cost-efficient, and deeply observable.
This role is ideal for a distributed systems expert who wants to solve complex challenges like Multi-Cloud Disaster Recovery, AI-Driven Operations, and FinOps-as-Code, while enabling engineering velocity through self-service platforms.
What you will do :
- Architect the Next-Gen Observability Stack.
- Build the "Single Pane of Glass": Design and operationalise a cutting-edge observability pipeline (Logs, Metrics, Traces) using Prometheus, ELK/EFK, Kafka, and OpenTelemetry.
- AI-Powered Operations: Lead the development of a customer-facing Operations Portal that incorporates AI agents and analytics to provide real-time health insights, automated root cause analysis, and QoS visibility to our customers.
- No-Touch Operations: Drive the platform toward "no-touch/low-touch" operations by implementing self-healing mechanisms and symptom-based alerting.
- Control Plane Engineering: Architect scalable microservices that orchestrate tenancy, feature flags, and configuration across AWS, GCP, and Azure.
- Multi-Cloud & Hybrid Cloud Strategy.
- Drive the architecture and implementation of multi-cloud disaster recovery (DR) frameworks for both multi-tenant and single-tenant SaaS offerings.
- Create SDLC frameworks that allow for seamless deployment across multiple clouds without requiring redundant testing.
- Develop an app modernisation framework to migrate applications from legacy infrastructure to modern Kubernetes-based platforms.
- Automation & Infrastructure as a Service.
- Implement Infrastructure-as-Code (IaC) solutions using tools such as Terraform, Ansible, and CloudFormation to automate provisioning and deployments.
- Provide automation and tools for both customer workflows and internal software development lifecycle (SDLC) processes.
- Integrate open-source technologies and custom-developed modules to build a state-of-the-art infrastructure stack.
- Customer-Obsessed Engineering.
- Ensure our observability isn't just watching servers, but watching the Customer Experience.
- You will instrument key user journeys (Login, Search, Checkout) to detect customer pain before they file a ticket.
- Leadership & Collaboration.
- Provide technical leadership to a team of developers, conducting architecture reviews, and code reviews, and sharing best practices in cloud-native software development.
- Lead cross-functional collaborations to ensure infrastructure is built for scalability, performance, and security.
- Mentor and develop team members, driving a culture of technical excellence.
What you'll bring :
- Experience: 10+ years of engineering experience, with at least 3+ years in a Staff/Principal role scaling enterprise SaaS platforms.
- Cloud Native Mastery: Deep hands-on expertise with Kubernetes, Docker.
- You have built and operated large-scale infrastructure on AWS, GCP, or Azure.
- Coding Proficiency: expert-level skills in Go (Golang) (preferred), Java, or Python.
- You can write production-grade microservices and K8s operators.
- Observability Deep Dive: You understand the internals of monitoring frameworks.
- You have scaled Prometheus federation, tuned Elasticsearch/Kafka for massive log ingestion, and implemented distributed tracing.
- IaC Expert: You treat infrastructure as software.
- Advanced proficiency with Terraform and Ansible is required.
- Distributed Systems Knowledge: You have a strong grasp of CAP theorem, consensus algorithms (Raft/Paxos), distributed storage, and networking fundamentals.
- Strategic Thinking: Experience building "Single Pane of Glass" solutions and managing the trade-offs between speed, cost, and reliability in a multi-cloud environment.
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
DevOps / Cloud
Job Code
1590432
Interview Questions for you
View All