Posted on: 27/04/2026
Description :
Title : Product Architect-(Cloud Infrastructure Intelligence)
About the Product :
Were building a distributed engine designed to remediate cloud waste at scale. This isn't a monitoring tool; its an intelligence layer that executes risk-aware decisions across complex enterprise environments.
The Work :
This is a high-leverage, part-time role focused on Product Strategy rather than daily implementation. You will define the rules of engagement for our AI.
Key Focus Areas :
- Detection Models : Define how we score confidence and reduce false positives across CPU, network, and disk metrics.
- Risk Modeling : Categorize environment tiers (Dev vs. Prod) and set the automation boundaries for each.
- System Behavioral Logic : Set the evaluation windows and telemetry requirements needed for the system to act accurately.
- Edge Case Engineering : Architecting for "noise" like batch processing, shared storage (EFS), and scaling groups.
Required Background :
- AWS Mastery : Deep knowledge of EC2, IAM, CloudWatch, and Organizations.
- Systems Design : Experience building automation workflows or complex cloud architectures.
- The "FinOps" Mindset : A background in cost optimization or observability is highly preferred.
Responsibilities & Duties
1. Architecting the Core Intelligence Logic :
- Design Multi-Signal Detection Frameworks : Develop the mathematical and logical models that determine "idleness" by synthesizing data from CPU, network throughput, disk I/O, and memory.
- Establish Confidence Scoring : Create the scoring engine that weighs different signals to produce a "remediation confidence" percentage, ensuring the system only acts when data is conclusive.
2. Risk Modeling & Operational Guardrails :
- Define Environment Risk Tiers : Establish distinct "Rules of Engagement" for different environment types (Sandbox, Dev, Staging, and Production) to ensure safety.
- Set Automation Boundaries : Design the "Hard Stop" logicidentifying specific conditions where the system should pause and escalate for human approval rather than proceeding with automated remediation.
3. Strategic System Design & Telemetry :
- Specify Telemetry Requirements : Define exactly which metrics (from CloudWatch, Flow Logs, etc.) the engineering team needs to ingest to enable high-fidelity decision-making.
- Bridge Product & Engineering : Translate high-level business goals (e.g., "reduce spend by 20%") into technical logic requirements that the engineering team can build and scale.
- Advise on Multi-Account Strategy : Architect how the intelligence layer should behave across complex AWS Organizations and cross-account IAM roles.
4. Explainability & Decision Transparency :
- Design the "Evidence Chain" : Define the specific data points required to justify every decision. Ensure that when a resource is flagged, the system provides a human-readable "Why."
5. Edge Case Engineering & Validation :
- Scenario Modeling : Develop logic to handle "noisy" infrastructure, including Auto Scaling Groups (ASGs), ephemeral workloads, and EFS-backed instances that appear idle but are mission-critical.
- False Positive Reduction : Continuously analyze system outputs to identify patterns that lead to false positives (e.g., monthly cron jobs or quarterly backups) and refine the detection models accordingly.
The job is for:
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
DevOps / Cloud
Job Code
1631464