Posted on: 07/05/2026
Job Description :
As an SRE Manager at Ford Motor Company, you will lead and empower a team of Site Reliability Engineers dedicated to optimizing the performance, reliability, and scalability of our diverse and critical application portfolio within the Marketing and Sales organization. This extensive landscape includes both on-premise and cloud-based applications, supporting a wide range of internal and external customers, including our vital eCommerce business. In this leadership role, you will be instrumental in defining and implementing SRE best practices, fostering a culture of operational excellence, and driving cross-functional collaboration with engineering, product management, and operations teams across Marketing and Sales.
Leveraging your deep expertise in site reliability, you will guide strategic initiatives for continuous improvement, incident management, and proactive system health monitoring across the entire Marketing and Sales application ecosystem. Your leadership will be important in maintaining Ford's innovation leadership within the automotive industry, helping us set new standards in digital commerce and internal customer satisfaction across all our marketing and sales initiatives. Your strategic direction and your team's contributions will directly ensure the smooth operation and evolutionary growth of our comprehensive Marketing and Sales capabilities, reinforcing Ford's commitment to excellence and innovation.
As an SRE Manager for the Marketing and Sales organization, your responsibilities will include :
Define SRE Strategy & Vision :
- Translate technical metrics into meaningful business health indicators and ensure alignment with Ford's strategic imperatives, particularly during the 12-16 month scaling initiative to encompass all of Marketing and Sales department.
- Establish and enforce enterprise-wide SRE principles, policies, Service Level Objective, Service Level Indicator standards, and error budget frameworks, ensuring consistent application and adherence across all portfolios (on-prem and cloud).
Lead the "Paved Road" Initiative & Platform Engineering :
- Treat the SRE platform as a product, continuously developing and improving it for internal customers, focusing on automation, self-service, and minimizing operational toil.
- Champion an automation-first mindset across the organization, guiding the team in developing automated solutions for infrastructure provisioning and operational tasks to maximize efficiency and reliability.
Drive Observability, AIOps & Performance Strategy :
- Own the overall observability strategy and roadmap for the diverse Marketing and Sales application ecosystem, ensuring comprehensive monitoring, alerting, and logging across on-prem and cloud solutions.
- Guide the team in continuously measuring and optimizing system performance to exceed customer needs, advance capabilities, and ensure a holistic view of system health.
Architectural Leadership & Collaboration for Reliability :
- Influence architectural decisions across product teams to ensure services are designed for observability, failure tolerance, and efficient operation within a global customer base.
Oversee Incident Management & Resilience :
- Establish and continuously refine robust incident management processes, including fostering a blameless post-mortem culture to drive continuous learning and improvement.
- Lead cross-domain incident coordination for critical, systemic failures across Marketing and Sales, driving continuous improvement in Mean Time To Recovery.
- Champion proactive resilience testing, including Chaos Engineering, to identify and address vulnerabilities before they impact customers.
- Lead the 24x7 First Responder team, managing the Marketing and Sales Command Center and ensuring rapid response and resolution for production incidents.
Team Leadership & Development :
- Recruit, mentor, and develop a high-performing team of highly skilled SREs, fostering a culture of continuous learning, innovation, and psychological safety.
- Provide technical guidance, coaching, and career development opportunities to team
members, enabling them to grow their expertise across the diverse Marketing and Sales technologies.
- Manage team capacity, workload, and priorities effectively to support the expanding scope of SRE across Marketing and Sales.
Cross-Functional Engagement & Governance :
- Build strong partnerships with development, QA, product management, security, and infrastructure teams across Marketing and Sales to embed SRE principles throughout the entire software development lifecycle.
- Ensure compliance with Ford's security, regulatory, and operational standards, overseeing the implementation and regular testing of disaster recovery and business continuity processes for all Marketing and Sales applications.
- Foster a vibrant SRE Community of Practice for all SREs, promoting knowledge sharing, best practices, and continuous upskilling.
Reporting, Vendor & Budget Management :
- Provide executive-level reporting on the overall health, reliability, and performance of the Marketing and Sales ecosystem, identifying systemic trends and presenting strategic recommendations to senior leadership.
- Manage relationships with key technology vendors for SRE-related tools and platforms.
- Oversee the budget for central SRE initiatives and drive cloud spend optimization strategies.
EXPERIENCES / COMPETENCIES :
Education Qualification :
- Bachelor's degree in Computer Science, Engineering, or a related technical field (Master's degree preferred).
Number of Years of Experience :
Progressive Leadership :
- 10+ years of progressive experience in Site Reliability Engineering, including a minimum of 5+ years of proven leadership experience managing and mentoring SRE teams.
Cloud Expertise :
- Extensive experience designing, deploying, and operating mid to large-scale public cloud environments.
- GCP expertise is a must-have, with additional experience in AWS or Azure being a significant advantage.
Infrastructure as Code (IaC) :
- Demonstrated expertise and hands-on experience in implementing and driving Infrastructure as Code (IaC) strategies, particularly with Terraform Enterprise.
SRE Frameworks & Observability :
- Strong track record of defining and implementing comprehensive SRE frameworks, including Service Level Objectives, Service Level Indicator, and Error Budgets.
- Proven experience in developing and implementing robust observability solutions (monitoring, logging, tracing) using tools such as Dynatrace, Grafana, Prometheus, and native cloud monitoring services.
Modern Application Architectures :
- Experience with microservices architectures, Spring Boot, and both NoSQL and SQL datastores.
Enterprise CMS (Plus) :
- Familiarity with Adobe Experience Manager (AEM) or similar enterprise Content Management System (CMS) platforms is a plus.
Leadership Skills and Personality Traits :
The ideal SRE Manager will demonstrate :
Strategic & Visionary Leadership :
- An exceptional strategic thinker with a proven ability to translate complex business requirements into tangible technical solutions and SRE initiatives.
- Possesses the ability to dissect problems from multiple angles and guide the team towards the most efficient and impactful solutions that align with the long-term SRE strategy and Ford's business objectives.
Proactive & Anticipatory :
- Takes initiative to identify and mitigate potential problems before they impact operations, continuously seeking opportunities to enhance system performance and reliability across the diverse Marketing and Sales landscape.
Decisive & Resilient :
- Maintains composure and provides clear direction under pressure, making confident and rapid decisions during critical incidents to minimize disruption.
- Demonstrates resilience in bouncing back from setbacks, maintaining focus on achieving and exceeding reliability goals.
Change Agent :
- Possesses a strong ability to drive significant organizational and technical change, fostering a culture of reliability, accountability, and continuous improvement throughout Marketing and Sales.
Dynamic Prioritization :
- Proven experience managing multiple competing priorities effectively in a fast-paced, dynamic environment, ensuring strategic focus and operational execution.
Collaboration & Communication Excellence :
- Possesses exceptional leadership, communication, and interpersonal skills with the ability to influence and collaborate effectively with senior leadership, product teams, and individual contributors.
- Articulates complex technical information clearly and concisely to diverse audiences, from highly technical engineers to executive leadership and non-technical stakeholders.
Cross-Functional Partnership :
- Builds and nurtures strong, collaborative relationships with engineering, product, operations, security, and business teams, championing SRE principles and fostering a shared understanding of reliability goals across the entire Marketing and Sales ecosystem.
Global & Distributed Team Leadership :
- Proficiently leads and manages a geographically dispersed or remote team, leveraging tools and best practices to ensure seamless communication, collaboration, and productivity regardless of location.
Team Development & Cultural Stewardship :
- Fosters a supportive, inclusive, and psychologically safe team environment where diverse perspectives are valued, and team members feel empowered to contribute, innovate, and take calculated risks.
Dedicated Mentor & Coach :
- Demonstrates a deep commitment to the professional growth of team members, providing continuous guidance, constructive feedback, and opportunities for development, helping SREs expand their expertise across Marketing and Sales's varied technologies.
Inspirational & Motivational :
- Inspires the team to strive for technical excellence, challenge the status quo, and continuously innovate, cultivating a culture of curiosity, learning, and blameless continuous improvement.
Technical Acumen & Innovation Driver :
- Detail-Oriented : Instills a culture of meticulous attention to detail within the team, ensuring that small issues are identified and addressed proactively before they can escalate into larger systemic problems.
- Adaptable & Agile : Exhibits flexibility in navigating unexpected technical challenges and shifts
in project or technology directions, guiding the team through change with a pragmatic and solution-oriented approach.
Accountability & Ownership :
- Unwavering Accountability : Takes full ownership and responsibility for the reliability, performance, and overall health of the Marketing and Sales application portfolio and the success of the SRE team, consistently driving for measurable outcomes and fostering a culture of accountability.
Functional/Technical Skills :
The SRE Manager will possess :
SRE Strategy & Best Practices :
- SLO/SLI Mastery : Proven ability to partner effectively with product management and development teams to establish meaningful Service Level Objectives and Service Level Indicators , utilizing Key Performance Indicators to drive the effective use of error budgets and ensure maximum application availability and uptime.
- Resilience Engineering Leadership : Strategic leadership in designing, implementing, and overseeing resilience strategies, including proactive resilience testing, to enhance the fault tolerance and recoverability of critical systems.
- Incident Management Leadership : Demonstrated expertise in leading incident response efforts for critical outages, focusing on rapid diagnosis, containment, resolution (minimizing Mean Time To Recovery), and conducting thorough blameless post-mortems to drive continuous improvement. Experience in leading a 24x7 first responder team.
Architecture & Modern Platform Engineering :
Expert-Level Architecture Understanding :
- Expert-level understanding of cloud-native architectures, distributed systems, and resilience patterns.
- Proven ability to lead and guide teams in solving complex architectural, design, and business problems, focusing on simplification, optimization, and bottleneck removal across diverse application landscapes (multi-tier, microservices).
Hybrid Cloud Expertise :
- Extensive experience managing teams working with and strategically leveraging both on-premise and leading public cloud platforms (e.g., GCP, AWS, Azure), including deep understanding of their services, deployment models, and operational best practices for hybrid environments.
Modern Application Stacks :
- Comprehensive understanding and strategic application of modern application development stacks, including experience with diverse programming languages (e.g., Java, Python), frameworks (e.g., Spring Boot), and data stores (NoSQL/SQL).
Microservices & API Design :
- Deep understanding and strategic application of microservices architectures, RESTful API design principles, and event-driven systems to build scalable and resilient platforms.
Containerization & Orchestration Mastery :
- Strong understanding and expert-level knowledge (or guiding teams) of containerization technologies (e.g., Docker) and container orchestration platforms (e.g., Kubernetes/GKE), including multi-tenant scaling, security, and operational best practices.
Automation, CI/CD & Observability :
Automation Leadership :
- Proven ability to architect, design, and lead the development of automation solutions to significantly reduce toil, improve recoverability, availability, latency, and scalability of supported applications.
- This includes experience with Infrastructure as Code (IaC) tools (e.g., Terraform) and proficiency in scripting/automation languages (e.g., Python, Go, Ansible, Node.js).
DevOps & CI/CD Expertise :
- Knowledge of CI/CD pipelines and DevOps practices, with experience in driving their implementation and optimization (Tekton experience is an advantage).
Observability & AIOps Strategy :
- Expertise in defining and driving comprehensive observability strategies, including the selection, implementation, and optimization of APM tools and monitoring solutions (e.g., Dynatrace, New Relic, ELK, Splunk, Prometheus, DataDog).
- Demonstrated experience with AIOps platforms and strategies for intelligent alerting and event correlation, leveraging AI/ML for anomaly detection and predictive analytics.
Data-Driven Optimization :
- Strong analytical skills with the ability to lead the analysis of complex performance data to identify systemic issues, predict future challenges, and drive continuous system performance improvements.
- Understanding of Mean Time To Detection (MTTD) and MTTR.
Infrastructure & Security :
- Network & Infrastructure Acumen : Strong grasp of network architecture, protocols, and security practices to guide the design and operation of robust, secure, and compliant hybrid systems.
- Database Management : Comprehensive knowledge of database administration, management, and scaling strategies for both SQL and NoSQL datastores in high-availability environments.
- Disaster Recovery & Business Continuity : Extensive experience in developing, implementing, and regularly testing comprehensive disaster recovery and business continuity strategies to ensure data integrity and availability for critical applications across diverse infrastructure.
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
Site Reliability Engineering
Job Code
1634259