We are looking for someone who can build Observability systems that engineers love to work with.

In this role, you will play a key part in shaping the future of our platform by developing tooling and providing hands-on technical expertise to design, deploy, and optimize our services in a compliant and cost-effective way in the cloud.

The ideal candidate will have a programming background in a cloud environment, a strong understanding of cloud automation, Observability, and security best practices, as well as the ability to collaborate effectively with cross-functional teams.

Roles & Responsibilities :

- Develop and analyze various business and technical scenarios to drive the highest levels of executive decision-making around Observability resources.

- Drive consensus and decisions with stakeholders.

- Develop and implement automation to provision, configure, deploy, and monitor Observability services.

- Create reusable integrations for third-party tools (e., CI/CD systems, monitoring platforms, container registries and many more) to consolidate workflows.

- Communicate risks and progress in a timely manner to reporting supervisor

- Ensure efficient resource utilization and continuously improve processes leveraging automation and internal tools resulting in enhanced Product delivery, maturity, and scalability.

- Support the features delivered by debugging and creating RCA for production issues and subsequently work towards short term and long-term fix

- On-Call Rotation: Participate in an on-call rotation to provide 24/7 support for critical systems.

Required Experience/Skills :

- Professional degree in Computer Science from a reputed college with consistent academic record.

- 4-6 years of professional experience in DevOps or software engineering roles, with a focus on configuring, deploying, and maintaining Kubernetes in AWS

- Strong proficiency in infrastructure as code (IaC) using Terraform, AWS CloudFormation, or similar tools.

- Experience with scripting and automation using languages such as Python

- Experience with CI/CD pipelines and automation tools such as Concourse, Jenkins, or Ansible.

- Experience with teams having delivered observability and telemetry tools and practices, such as Prometheus, Grafana, ELK stack, distributed tracing, and performance monitoring.

- Experience with cloud-native tools such as Istio, Argo CD, External Secrets Operator, Keda, Karpenter, etc

- Understanding SRE principles includes monitoring, alerting, error budgets, fault analysis, and automation.

- Concepts of SLI, SLO, SLA, Define SLIs (Service Level Indicators), SLOs (Service Level Objectives), and error budgets.

- Excellent problem-solving skills and attention to detail.

About Our Culture :

Smarsh hires lifelong learners with a passion for innovating with purpose, humility and humor.

Collaboration is at the heart of everything we do.

We work closely with the most popular communications platforms and the worlds leading cloud infrastructure platforms.

We use the latest in AI/ML technology to help our customers break new ground at scale.

We are a global organization that values diversity, and we believe that providing opportunities for everyone to be their authentic self is key to our success.