DevOps Engineer

About Us :

We build a collaborative, real-time workspace platform enabling teams to organize content, manage projects, and communicate at scale. Our platform is a cloud-native SaaS product running on AWS, serving users across multiple regions through a microservices architecture.

Our engineering team moves fast. We ship continuously to a Kubernetes-based infrastructure with a fully automated CI/CD pipeline and take infrastructure quality as seriously as product quality. We value engineers who treat infrastructure as code, own reliability end-to-end, and proactively improve the systems they work on.

About the Role :

Experience Level : Mid to Senior | Minimum: 5+ years in DevOps / Platform Engineering

We are looking for an experienced DevOps Engineer to own and evolve the infrastructure that powers our platform. You will work closely with our backend and frontend engineers to keep our systems reliable, secure, observable, and cost-efficient.

You will manage a production-grade AWS environment spanning 16+ microservices (Go, Node.js) on Kubernetes (EKS), with infrastructure provisioned entirely through Terraform and deployments managed via Helm and GitHub Actions. The role covers everything from infrastructure design and CI/CD pipelines to monitoring, incident response, and security hardening.

This is a hands-on engineering role you will write real Terraform, maintain Helm charts, build and improve CI/CD pipelines, and debug production issues from CloudWatch logs to Kubernetes pod events.

What You'll Do :

Infrastructure as Code (Terraform) :

- Own, maintain, and evolve a large library of Terraform modules that provision the entire AWS environment across development and production accounts

- Manage EKS cluster configurations including managed node groups and spot/fleet instance node groups (cost-optimized, achieving up to 70% savings vs on-demand)

- Provision and maintain supporting infrastructure : VPC, subnets, security groups, ALB, ACM certificates, Route53 DNS, SQS queues, SES email, and EFS volumes

- Add new modules for evolving infrastructure requirements and ensure all resources are reproducible and version-controlled

- Apply Terraform changes safely across environments using Terraform workspaces and remote state backends

Kubernetes & Container Orchestration :

- Operate and maintain the AWS EKS cluster with both spot/fleet and on-demand worker node groups

- Deploy and manage 16+ microservices on Kubernetes using Helm charts (4 custom charts : generic deployments, one-time jobs, cron jobs, and ingress)

- Configure and tune Horizontal Pod Autoscalers (HPA), Pod Disruption Budgets (PDB), and Persistent Volume Claims (PVC) per service

- Manage Kubernetes ingress, service accounts, RBAC, and ConfigMaps/Secrets

- Maintain the Helm chart repository (versioning, publishing, GitHub Actions pipeline)

- Debug pod failures, resource constraints, and node scheduling issues

CI/CD Pipeline Management :

- Own multiple GitHub Actions workflows covering PR validation, auto-deployment to dev, and production releases

- Enforce a two-part release flow: (1) PR checks (build, unit tests, commit linting, manual approvals) ? (2) auto-deploy on merge to development for dev environment; semver tag (vx.y.z) releases for production

- Maintain build pipelines for Go microservices (multi-stage Docker builds), Node.js services, and Helm charts

- Manage AWS ECR image repositories pushing, tagging, lifecycle policies

- Configure Slack notifications for deployment failures and pipeline events

- Build and improve deployment automation, reducing manual intervention in release processes

Monitoring & Observability :

- Operate SigNoz for APM configure service traces, metrics dashboards, and alerts across all microservices

- Manage CloudWatch log groups per service (integrated via Fluent Bit log shipping from Kubernetes)

- Maintain Grafana dashboards for infrastructure-level metrics

- Monitor Prometheus metrics exposed by backend services

- Maintain StatusPage.io public status pages for our services

- Define alerting rules and on-call runbooks; own incident response and post-mortems

Security & Secrets Management :

- Manage AWS Secrets Manager for all service credentials (MongoDB, Wasabi, application configs)

- Administer AWS Client VPN with SSO integration for secure developer access to private infrastructure

- Maintain IAM roles, policies, and service accounts following least-privilege principles

- Manage ACM certificates and ensure TLS is enforced across all ingress endpoints

- Operate ClamAV for malware scanning of user-uploaded files

- Support the SpiceDB fine-grained authorization service and its migration tooling

- Participate in compliance reviews and apply security best practices across the AWS account

Networking & Cloud Architecture :

- Manage multi-VPC architecture : separate VPCs for dev and production environments with VPC peering for controlled cross-environment access

- Configure MongoDB Atlas PrivateLink connectivity ensuring database clusters are accessible only from within the designated VPC

- Maintain bastion host configuration for emergency database access

- Design and implement network segmentation, security group rules, and NACLs

- Manage DNS via Route53 and ALB routing rules

Collaboration with Engineering Teams :

- Partner with Go and Node.js backend engineers to containerize new services and onboard them to the deployment pipeline

- Work with frontend engineers on AWS Amplify deployments for the Nuxt.js / Vue 3 PWA

- Provide runbooks and documentation for common debugging workflows (e.g., CloudWatch log tailing, VPN access, EKS pod debugging)

- Define and enforce infrastructure standards, naming conventions, and tagging strategies across environments

Our Stack You'll Work With These Every Day :

Cloud Platform - AWS :

- EKS (Kubernetes managed control plane)

- EC2 (managed and custom/fleet node groups)

- ECR (container image registry)

- ALB (Application Load Balancer)

- CloudWatch (logging and metrics)

- Secrets Manager

- SQS (message queues)

- SES (transactional email)

- ACM (SSL/TLS certificates)

- Route53 (DNS)

- EFS (persistent storage for Kubernetes)

- Client VPN (developer access)

- AWS SSO (identity federation)

- AWS FIS (Fault Injection Simulator chaos engineering)

- AWS Amplify (frontend CI/CD and hosting)

Container Orchestration & Packaging :

- Kubernetes (EKS) fleet/spot + on-demand node groups

- Helm (4 custom charts : generic deployments, one-time jobs, cron jobs, ingress)

- Docker (multi-stage builds for Go and Node.js services)

- HPA, PDB, PVC, Ingress, RBAC

Infrastructure as Code :

- Terraform modular components, multi-environment (dev + prod), remote state backend

CI/CD & Automation :

- GitHub Actions (multiple workflows)

- Semver-based release tagging (vx.y.z) for production promotions

- Slack for pipeline notifications

Monitoring & Observability :

- SigNoz (APM, distributed tracing, dashboards, alerts)

- CloudWatch (log aggregation per-service log streams)

- Fluent Bit (Kubernetes log shipping to CloudWatch)

- Grafana (infrastructure dashboards)

- Prometheus (per-service metrics)

- StatusPage.io (public incident communication)

Data & Storage :

- MongoDB Atlas (cloud MongoDB with PrivateLink, per-environment isolation)

- Aurora PostgreSQL and MySQL (via Amazon RDS)

- Redis (ElastiCache single-instance and cluster mode)

- Wasabi (S3-compatible object storage with HA configuration)

- EFS (Elastic File System for Kubernetes PVCs)

Security & Access :

- AWS Secrets Manager

- AWS Client VPN + AWS SSO

- IAM (service roles, least-privilege policies)

- ACM (TLS certificates)

- Security Groups and NACLs

- SpiceDB (fine-grained authorization service)

- ClamAV (antivirus scanning)

Services Architecture :

- 14 Go microservices (gRPC inter-service communication via Protocol Buffers)

- 1 Node.js service (document generation)

- gRPC (primary inter-service transport)

- REST/HTTP (client-facing APIs)

- MongoDB change streams (event-driven data sync)

- Asynq/Redis (async task queues)

Frontend Deployment :

- AWS Amplify (Nuxt.js 3 / Vue 3 PWA web application frontend)

- Node.js 22+

- GitHub Actions for Amplify CI/CD

What We're Looking For :

Minimum Experience Requirements at a Glance :

Minimum :

- DevOps / Platform Engineering (overall)

- Terraform (module-level IaC)

- Kubernetes in production

- AWS (EKS, ECR, CloudWatch, IAM, etc.)

- CI/CD pipeline ownership (GitHub Actions or equivalent)

Must Have :

Experience & General Skills :

- 5+ years of hands-on DevOps or Platform Engineering experience in a production environment

- Strong ownership mentality you don't wait to be asked to fix something that's broken

- Comfortable working in a fast-moving startup environment with evolving infrastructure requirements

- Clear written communication (runbooks, post-mortems, documentation)

Cloud - AWS (2+ years) :

- Solid experience with AWS core services : EKS, EC2, ALB, ECR, CloudWatch, Secrets Manager, IAM, SQS, Route53, ACM

- Understanding of AWS networking : VPC design, subnets, security groups, VPC peering, PrivateLink

- Experience managing multi-environment AWS accounts (dev / prod separation)

Kubernetes & Containers (2+ years) :

- Production Kubernetes experience deploying, scaling, and debugging workloads

- Helm chart authoring and maintenance (not just helm install)

- Docker writing efficient multi-stage Dockerfiles for compiled (Go) and interpreted (Node.js) applications

- Familiarity with HPA, PDB, resource limits/requests, and pod scheduling

Infrastructure as Code - Terraform (2+ years) :

- 2+ years writing and maintaining Terraform at module level

- Experience with remote state, workspaces, and multi-environment Terraform layouts

- Ability to read existing module code, understand dependencies, and extend it safely

CI/CD (1+ year) :

- GitHub Actions building and maintaining workflows (jobs, steps, secrets, environments, reusable workflows)

- Experience implementing gated release pipelines with automated checks and manual approval gates

- Container build and push pipelines to ECR or similar registries

Monitoring & Observability :

- Practical experience with log aggregation (CloudWatch, Fluent Bit, or similar)

- Alerting configuration defining meaningful alerts (not alert fatigue)

- Experience debugging production issues from logs and metrics

Security :

- Secrets management best practices (Secrets Manager or Vault)

- IAM least-privilege design

- VPN and SSO administration basics

Nice to Have :

- Experience with SigNoz or OpenTelemetry-based APM platforms

- Experience with MongoDB Atlas including PrivateLink and cluster management

- Familiarity with SpiceDB or Zanzibar-style authorization systems

- Experience with AWS FIS or other chaos engineering tools

- Knowledge of Wasabi or S3-compatible storage beyond AWS native S3

- Experience with AWS Amplify for frontend deployments

- Exposure to gRPC service-based architectures (understanding of how Protocol Buffer services are deployed and scaled)

- Experience running cost optimization programs on EKS using spot/fleet instances

- Familiarity with ClamAV integration in Kubernetes environments

- Go or Node.js enough to read service code, identify issues in Dockerfiles, and help debug build failures

What We Offer :

- Ownership over a production-grade, cloud-native infrastructure stack not just ticket execution

- Exposure to a modern, well-structured microservices architecture with 16+ services

- A team that treats infrastructure quality as a first-class concern

- Flexible, asynchronous-friendly work culture

- Opportunity to shape DevOps practices and tooling from an early stage