Staff Machine Learning Infrastructure Engineer

Company: Dyna Robotics
Location: Redwood City
Posted on: May 7, 2025

Job Description:

Staff Machine Learning Infrastructure EngineerCompany Overview:Dyna Robotics is at the forefront of revolutionizing robotic manipulation with cutting-edge foundation models. Our mission is to empower businesses by automating repetitive, stationary tasks with affordable, intelligent robotic arms. Leveraging the latest advancements in foundation models, we're driving the future of general-purpose robotics-one manipulation skill at a time.Dyna Robotics was founded by industry leaders who previously achieved a $350 million exit in grocery deep tech as well as top robotics researchers from DeepMind and Nvidia. Our team blends world-class research, engineering, and product innovation to drive the future of robotic manipulation. With $20mil+ in funding, we're positioned to redefine the landscape of robotic automation. Join us to shape the next frontier of AI-driven robotics.Position Overview:We are seeking an experienced Machine Learning Infrastructure Engineer to join our team and help scale our ML training platform. In this role, you will be responsible for designing, implementing, and maintaining large-scale ML infrastructure to accelerate model iteration and improve training performance across an expanding GPU ecosystem. You will work on cutting-edge high-performance computing systems, optimizing distributed training environments, and ensuring system reliability as we scale.Key Responsibilities:

Infrastructure Design & Scalability:
Architect and implement large-scale ML training pipelines that leverage parallel GPU processing on platforms like GCP or AWS.
Enhance our existing infrastructure to fully exploit parallelism and design for future expansion, ensuring that our system is ready to support growth.
High-Performance ML Computing & Distributed Systems:
Manage and optimize high-performance computing resources.
Develop robust distributed computing solutions, addressing challenges like race conditions, memory optimization, and resource allocation.
Optimize model training with techniques like mixed precision, ZeRO, Lora, etc.
Job Scheduling & Reliability:
Design systems for job rescheduling, automated retries, and failure recovery to maximize uptime and training efficiency.
Implement intelligent job queuing mechanisms to optimize training workloads and resource utilization.
Evaluate and implement tradeoffs between different local and networked storage solutions to improve data throughput and access.
Develop strategies for caching training data to optimize performance.
Work closely with ML researchers and data scientists to understand training requirements and bottlenecks.
Continuously monitor system performance, identify areas for improvement, and implement best practices to enhance scalability and reliability.Required Qualifications:
- Bachelor's degree or higher in Computer Science or a related field.
- At least 7 years of professional experience in the software industry, with a minimum of 2 years in a tech lead role.
- Proven experience with high-performance computing environments and distributed systems.
- Demonstrated ability to scale ML training systems and optimize resource utilization.
- Hands-on experience with job scheduling systems and managing cloud GPU environments (GCP, AWS, etc.).
- Deep understanding of distributed computing concepts, including race conditions, memory optimization, and parallel processing.
- Hands-on experience in ML model tuning for performance.
- Experience with common ML training and inference tools including PyTorch, TensorRT, Triton, Accelerate, etc.
- Strong analytical and problem-solving skills with the ability to troubleshoot complex system issues.
- Excellent communication skills to collaborate effectively with cross-functional teams.Preferred Qualifications:
  - Experience with container orchestration tools (e.g., Kubernetes) and infrastructure-as-code frameworks.If you're passionate about building scalable ML systems and optimizing high-performance computing infrastructures, we'd love to hear from you.
    #J-18808-Ljbffr

Keywords: Dyna Robotics, Petaluma , Staff Machine Learning Infrastructure Engineer, Engineering , Redwood City, California

Click here to apply!

Didn't find what you're looking for? Search again!

Let Redwood City recruiters find you. Post your resume for free!

Get Redwood City Engineering jobs via email.

View more Petaluma Engineering jobs

Other Engineering Jobs

Facilities Engineer
Description: McKesson is an impact-driven, Fortune 10 company that touches virtually every aspect of healthcare. We are known for delivering insights, products, and services that make quality care more accessible (more...)
Company: MCKESSON
Location: West Sacramento
Posted on: 05/6/2025

Sr Staff Enterprise Security Engineer (InfoSec) Santa Clara, California, United States
Description: Our MissionAt Palo Alto Networks everything starts and ends with our mission:Being the cybersecurity partner of choice, protecting our digital way of life.Our vision is a world where each day is safer (more...)
Company: Palo Alto Networks, Inc.
Location: Santa Clara
Posted on: 05/6/2025

Data Scientist, Research, App Safety Engineering
Description: Data Scientist, Research, App Safety Engineering ul li link Copy linkMidExperience in driving progress, solving problems, and mentoring junior team members deep expertise and applied knowledge in (more...)
Company: Google Inc.
Location: Mountain View
Posted on: 05/6/2025

Salary in Petaluma, California Area | More details for Petaluma, California Jobs |Salary

Senior Backend Engineer
Description: About usOur mission is to reinvent the way people learn, starting with language. We begin by teaching the next billion people English, Spanish, and French.English is the global language of business, culture, (more...)
Company: Usespeak
Location: San Francisco
Posted on: 05/5/2025

Product Security Engineer
Description: Why HarveyHarvey is a secure AI platform for legal and professional services that augments productivity and automates complex workflows. Harvey uses algorithms with reasoning-adept LLMs that have been (more...)
Company: harvey.ai
Location: San Francisco
Posted on: 05/5/2025

Information Security Engineer, Consultant
Description: Your Role Th e Information Security team partners with IT and business teams to provide support and guidance to build products with sustained security and compliance through industry best practices. (more...)
Company: Blue Shield of California
Location: El Dorado Hills
Posted on: 05/6/2025

Application Security Engineer
Description: A World-Changing CompanyPalantir builds the world's leading software for data-driven decisions and operations. By bringing the right data to the people who need it, our platforms empower our partners (more...)
Company: Palantir
Location: Palo Alto
Posted on: 05/6/2025

Data Engineer Tech Lead for Business Logic Data Sets
Description: Data Engineer Tech Lead for Business Logic Data Sets br br Apply now, read the job details by scrolling down Double check you have the necessary skills before sending an application. br br We (more...)
Company: OSI Engineering
Location: Mountain View
Posted on: 05/6/2025

Data Engineer
Description: Tubi is a global entertainment company and the most watched free TV and movie streaming service in the U.S. and Canada. Dedicated to providing all people access to all the world's stories, Tubi offers (more...)
Company: Tubi Tv
Location: San Francisco
Posted on: 05/5/2025

Sr Reliability Eng
Description: At Bayer we're visionaries, driven to solve the world's toughest challenges and striving for a world where 'Health for all Hunger for none' is no longer a dream, but a real possibility. We're doing it (more...)
Company: Bayer (Schweiz) AG
Location: Berkeley
Posted on: 05/6/2025

Loading more jobs...

Staff Machine Learning Infrastructure Engineer

Didn't find what you're looking for? Search again!

Other Engineering Jobs

Log In or Create An Account