- Engage with our product teams to understand requirements, design, and implement resilient and scalable infrastructure solutions

- Operate, monitor, and triage all aspects of our production and non-production environments

- Collaborate with other engineers on code, infrastructure, design reviews, and process enhancements.

- Evaluate and integrate new technologies to improve system reliability, security, and performance

- Develop and implement automation to provision, configure, deploy, and monitor Apple services

- Participate in an on-call rotation providing hands-on technical expertise during service-impacting events

- Design, build, and maintain highly available and scalable infrastructure

- Implement and improve monitoring, alerting, and incident response systems

- Automate operations tasks and develop efficient workflows

- Conduct system performance analysis and optimization

- Collaborate with development teams to ensure smooth deployment and release processes

- Implement and maintain security best practices and compliance standards

- Troubleshoot and resolve system and application issues

- Participate in capacity planning and scaling efforts

- Stay up-to-date with the latest trends, technologies, and advancements in SRE practices

- Contribute to capacity planning, scale testing, and disaster recovery exercises.

- Approach operational problems with a software engineering mindset

- BS degree in computer science or equivalent field with 5+ years of experience

- 5+ years in an Infrastructure Ops, Site Reliability Engineering, or DevOps-focused role.

- Knowledge of Linux operating system principles, networking fundamentals, and systems management.

- Demonstrable fluency in at least one of the following languages : Java, Python, or Go

- Experience managing and scaling distributed systems in a public, private, or hybrid cloud environment

- Develop and implement automation tools and apply best practices for system reliability.

- You will be responsible for the availability & scalability of our services and manage the disaster recovery and other operational tasks.

- Collaborate with the development team to improve application codebase for logging, metrics and traces for observability.

- Collaborate with data science teams and other business units to design, build and maintain the infrastructure that runs machine learning and generative AI workloads.

- Influence architectural decisions with focus on security, scalability and performance.

- Find and fix problems in production, and work to avoid them from happening again

Preferred Qualifications :

- Familiarity with micro-services architecture and container orchestration with Kubernetes.

- Awareness of key security principles including encryption, keys (types and exchange protocols).

- Understanding SRE principles includes monitoring, alerting, error budgets, fault analysis, and automation.

- Strong sense of ownership, with a desire to communicate and collaborate with other engineers and teams.

- Ability to identify and communicate technical and architectural problems, while working with partners and their team to iteratively find solutions.