Research Computing GPU Systems Engineer

Stanford University

🔍Business Affairs: University IT (UIT)Posted May 9, 2026Job ID: 30782

About this position

Position Description

System Operations & Management Lead day-to-day operations of the GPU Cluster, ensuring optimal uptime and performance. Architect monitoring, alerting, and observability solutions using Prometheus, Grafana, DCGM, and Base Command Manager. Manage job scheduling and resource allocation using Slurm, implementing advanced GPU partitioning and configurations. Coordinate maintenance windows, system upgrades, and capacity expansions; lead incident response and root cause analyses. System storage management, optimization, benchmarking and observability reporting. Performance Optimization & Engineering Design performance tuning strategies for GPU utilization, job throughput, and system efficiency. Optimize NVIDIA GPU fabric configurations including NVLink, NVSwitch, and InfiniBand RDMA networking. Develop containerization strategies using NVIDIA NGC, Docker, and Singularity/Apptainer. Engineer solutions for deep learning frameworks (PyTorch, TensorFlow, JAX) and CUDA application optimization. Benchmark system performance and collaborate with NVIDIA on optimization programs. User Support & Research Enablement Serve as primary technical consultant for researchers using GPU-accelerated computing, Develop documentation, best practices guides, and training materials; deliver workshops on GPU computing workflows. Profile and optimize user workloads, scaling applications from single-GPU to multi-node distributed training. Team Leadership & Strategy Mentor junior engineers and contribute to strategic planning for GPU infrastructure expansion. Evaluate emerging GPU technologies and manage vendor relationships with NVIDIA and hardware suppliers. Represent SRC in ongoing interactions with the Stanford Data Sciences group on AI/ML infrastructure; participate in on-call rotation.

Qualifications

Expert knowledge of NVIDIA GPU architecture, CUDA, and GPU computing principles (NVLink, MIG, GPUDirect) Advanced Linux administration (RHEL, Ubuntu); expertise with Slurm job scheduler Experience with high-performance networking (InfiniBand, RoCE) and parallel filesystems (Lustre, GPFS) Strong scripting (Python, Bash) and containerization experience (Docker, Singularity, Kubernetes) Familiarity with AI/ML frameworks (PyTorch, TensorFlow) and distributed training techniques Experience with monitoring tools (Prometheus, Grafana) and NVIDIA DCGM

Application Requirements

Constantly perform desk-based computer tasks. Frequently sit, grasp lightly/fine manipulation. Occasionally stand/walk, writing by hand. Rarely use a telephone, lift/carry/push/pull objects that weigh up to 10 pounds. * Consistent with its obligations under the law, the University will provide reasonable accommodations to applicants and employees with disabilities. Applicants requiring a reasonable accommodation for any part of the application or hiring process should contact Stanford University Human Resources by submitting a contact form.