Description :
A Senior Linux/Systems Engineer to design, build, and operate bare-metal AI/HPC GPU clusters. You'll own platform bring-up (UEFI/BIOS bootloaders OS), kernel/device enablement, low-level networking (RoCEv2/InfiniBand), GPU/accelerator stack readiness, and repeatable automation for provisioning and compliance.
This role suits someone who enjoys getting their hands dirty in firmware, kernel, and PCIe, and then scaling that knowledge with Ansible/Python.
Responsibilities :
- Own server bring-up : UEFI/BIOS configuration, Secure Boot/TPM/Measured Boot, GRUB, PXE/iPXE flows. Integrate and automate BMC/IPMI/Redfish workflows for out-of-band provisioning and fleet management.
- OS and Kernel Engineering : Build, customise, and harden Ubuntu images (cloud-init, Debos) and tune systemd/init for low-latency, high-throughput workloads. Diagnose and fix kernel/user-space issues using perf, ftrace, eBPF/bpftrace; configure NUMA, IRQ affinity, cgroups/namespaces.
- PCIe/Driver Enablement : Validate PCIe topologies and features (ACS/ARI/ATS), SR-IOV, IOMMU/VFIO; bring up NIC/GPU drivers and firmware. Root-cause device initialisation and performance regressions across kernel, drivers, and userspace.
- Provisioning and Automation at Scale : Author idempotent Ansible playbooks/roles; implement Python/Pytest test harnesses for pre-/post-provision validation. Build golden images and repeatable pipelines for server provisioning, configuration drift detection, and remediation.
- GPU/Accelerator and HPC Stack Readiness : Enable NVIDIA CUDA/NCCL/GPUDirect RDMA and AMD ROCm; validate multi-GPU/multi-node performance. Stand up and tune NCCL/UCX, MPI (OpenMPI), torchrun/PyTorch for distributed training workloads.
- Containers and Build Tooling : Build and maintain minimal, reproducible Docker images and Docker Compose environments for CI and validation. Use C/Go/Python, Make/CMake, and CI (GitHub Actions/GitLab CI) to publish and maintain Validation and automation tools.
- High-Performance Networking : Configure and tune RoCEv2 and/or InfiniBand fabrics; validate rdma-core/libibverbs paths end-to-end. Optimise congestion control, MTU/jumbo frames, NUMA/RSS/IRQ steering for consistent throughput/latency.
- Security and Compliance : Apply CIS hardening baselines; maintain Secure Boot policy, measured boot attestations, and patch compliance. Implement access controls and auditability across firmware, OS, and cluster automation.
Requirements :
- Educational Background : Bachelor's or Master's degree in Computer Science, Software Engineering, or a related field.
Technical Skills :
- Platform/Boot : UEFI/BIOS, GRUB, Secure Boot, PXE/iPXE, BMC/IPMI/Redfish.
- OS/Kernel : Linux (Ubuntu), systemd/init, eBPF, perf/ftrace/bpftrace, cgroups, namespaces, NUMA, IRQ affinity.
- Drivers/PCIe : PCIe fundamentals (ACS/ARI/ATS), SR-IOV, VFIO, IOMMU, NIC/GPU drivers.
- Provisioning/Automation : Ansible, Python, Pytest, Debos, cloud-init.
- Containers : Docker, docker-compose.
- Build/Dev : C, Python, Go (optional), Make, CMake, CI (GitHub Actions/GitLab CI).
- Networking (HPC) : RoCEv2 InfiniBand, libibverbs/rdma-core, NCCL/UCX, MPI (OpenMPI).
- GPU/Accel : NVIDIA (CUDA, NCCL, GPUDirect RDMA), AMD ROCm.
- Security/Compliance : CIS hardening, Secure Boot, TPM/Measured Boot.
Professional Experience :
- 7+ years in Linux systems engineering, including kernel/userspace debugging and performance tuning.
- Proven ownership of bare-metal server bring-up and fleet-scale provisioning via Ansible/Python.
- Hands-on with PCIe device enablement (SR-IOV/VFIO/IOMMU) and NIC/GPU driver stacks.
- Demonstrated success enabling multi-GPU/multi-node training over RoCEv2 or InfiniBand.
- Track record building reproducible OS images and container artefacts for production use.
Soft Skills :
- Ability to mentor peers, partner with researchers/ML engineers, and influence cross-functional roadmaps.
- Clear, concise documentation habits; you turn tribal knowledge into automation and runbooks.
Preferred Qualifications :
- Experience in cloud-based AI solutions and infrastructure.
- Familiarity with performance benchmarking and optimisation.
- Knowledge of modern development practices and Agile methodologies.
Did you find something suspicious?
Posted by
Posted in
DevOps / SRE
Functional Area
Embedded / Kernel Development
Job Code
1586723
Interview Questions for you
View All