Staff ML Performance Engineer (Training Efficiency)
Wayve · Sunnyvale, California USA
The role
We are looking for a Staff ML Performance Engineer to join our Training Tech team working on optimizing large scale ML jobs to enable scaling our models to the next order of magnitude. A successful candidate will increase efficiency of training and inference workloads in order to allow Wayve to train larger models faster.
Key responsibilities:
-
Profile ML workloads to identify their bottlenecks, e.g. using NVIDIA Nsight Systems
-
Design and implement efficiency improvements to maximize MFU and throughput, e.g. parallelism, model compilation, mixed precision
-
Design and implement observability tools to identify bottlenecks and drive performance improvements, e.g. to track MFU, throughput, latency, etc
-
Design and implement benchmarking tools, e.g. to track efficiency gains or regressions
-
Collaborate closely with Research teams to integrate training efficiency improvements and create a culture of performance optimization
About you
In order to set you up for success in this role, we’re looking for the following skills and experience.
Essential
-
10+ years of industry experience driving performance engineering across ML systems, GPU compute infrastructure, distributed platforms or similar field.
-
Experience optimizing large scale jobs on GPU compute clusters.
-
Experience in working in platform teams and working with research teams.
-
Experience in writing, reporting, and tracking performance benchmarks in an open and accessible way.
-
Ability to write high quality, well-structured and tested Python code
-
BS or MS in Machine Learning, Computer Science, Engineering, or a related technical discipline or equivalent experience
Desirable
-
Experience working with concurrent, parallel and distributed computing.
-
Experience using NVIDIA NSight Systems or other system profilers.
-
Experience implementing GPU kernels (CUDA, Triton, etc).
-
Knowledge of computing fundamentals - what makes code fast, secure and reliable.
This role is a full-time role based in Sunnyvale, CA (hybrid) and the reasonably estimated salary for this role ranges from $336,400 to $359,000, plus a competitive equity package. Actual compensation is based on the candidate's skills, qualifications, and experience.
#LI-HH1
Apply smarter with Convoy
Add this role to your pipeline and let Convoy's agents do the work.
- CV tailored to this job description
- Cover letter drafted and ready to edit
- Interview prep pack, automatically built
- Pipeline tracking from first look to offer
Share this listing