Staff Machine Learning Infrastructure Engineer
Company: Dyna Robotics
Location: Redwood City
Posted on: May 7, 2025
Job Description:
Staff Machine Learning Infrastructure EngineerCompany
Overview:Dyna Robotics is at the forefront of revolutionizing
robotic manipulation with cutting-edge foundation models. Our
mission is to empower businesses by automating repetitive,
stationary tasks with affordable, intelligent robotic arms.
Leveraging the latest advancements in foundation models, we're
driving the future of general-purpose robotics-one manipulation
skill at a time.Dyna Robotics was founded by industry leaders who
previously achieved a $350 million exit in grocery deep tech as
well as top robotics researchers from DeepMind and Nvidia. Our team
blends world-class research, engineering, and product innovation to
drive the future of robotic manipulation. With $20mil+ in funding,
we're positioned to redefine the landscape of robotic automation.
Join us to shape the next frontier of AI-driven robotics.Position
Overview:We are seeking an experienced Machine Learning
Infrastructure Engineer to join our team and help scale our ML
training platform. In this role, you will be responsible for
designing, implementing, and maintaining large-scale ML
infrastructure to accelerate model iteration and improve training
performance across an expanding GPU ecosystem. You will work on
cutting-edge high-performance computing systems, optimizing
distributed training environments, and ensuring system reliability
as we scale.Key Responsibilities:
- Infrastructure Design & Scalability:
- Architect and implement large-scale ML training pipelines that
leverage parallel GPU processing on platforms like GCP or AWS.
- Enhance our existing infrastructure to fully exploit
parallelism and design for future expansion, ensuring that our
system is ready to support growth.
- High-Performance ML Computing & Distributed Systems:
- Manage and optimize high-performance computing resources.
- Develop robust distributed computing solutions, addressing
challenges like race conditions, memory optimization, and resource
allocation.
- Optimize model training with techniques like mixed precision,
ZeRO, Lora, etc.
- Job Scheduling & Reliability:
- Design systems for job rescheduling, automated retries, and
failure recovery to maximize uptime and training efficiency.
- Implement intelligent job queuing mechanisms to optimize
training workloads and resource utilization.
- Evaluate and implement tradeoffs between different local and
networked storage solutions to improve data throughput and
access.
- Develop strategies for caching training data to optimize
performance.
- Work closely with ML researchers and data scientists to
understand training requirements and bottlenecks.
- Continuously monitor system performance, identify areas for
improvement, and implement best practices to enhance scalability
and reliability.Required Qualifications:
- Bachelor's degree or higher in Computer Science or a related
field.
- At least 7 years of professional experience in the software
industry, with a minimum of 2 years in a tech lead role.
- Proven experience with high-performance computing environments
and distributed systems.
- Demonstrated ability to scale ML training systems and optimize
resource utilization.
- Hands-on experience with job scheduling systems and managing
cloud GPU environments (GCP, AWS, etc.).
- Deep understanding of distributed computing concepts, including
race conditions, memory optimization, and parallel processing.
- Hands-on experience in ML model tuning for performance.
- Experience with common ML training and inference tools
including PyTorch, TensorRT, Triton, Accelerate, etc.
- Strong analytical and problem-solving skills with the ability
to troubleshoot complex system issues.
- Excellent communication skills to collaborate effectively with
cross-functional teams.Preferred Qualifications:
- Experience with container orchestration tools (e.g.,
Kubernetes) and infrastructure-as-code frameworks.If you're
passionate about building scalable ML systems and optimizing
high-performance computing infrastructures, we'd love to hear from
you.
#J-18808-Ljbffr
Keywords: Dyna Robotics, Petaluma , Staff Machine Learning Infrastructure Engineer, Engineering , Redwood City, California
Didn't find what you're looking for? Search again!
Loading more jobs...