What you’ll do - Design and implement highly efficient distributed training systems for large-scale deep learning models. - Optimize parallelism strategies to improve performance and scalability across hundreds or thousands of GPUs. - Develop low-level systems components and algorithms to maximize throughput and minimize memory and compute bottlenecks. - Productionize the training systems onto CentML Platform. - Collaborate with researchers and engineers to productionize cutting-edge model architectures and training techniques. - Contribute to the design of APIs, abstractions and UX that make it easier to scale models while maintaining usability and flexibility. - Profile, debug, and tune performance at the system, model, and hardware levels. - Participate in design discussions, code reviews, and technical planning to ensure the product aligns with business goals. - Stay up to date with the latest advancements in large-scale model training and help translate research into practical, robust systems. What you’ll need to be successful - Bachelor’s, Master’s, or PhD’s degree in Computer Science/Engineering, Software Engineering, related field or equivalent working experience. - 3+ years of experience in software development, preferably with Python and C++. - Deep understanding of machine learning pipelines and workflows, distributed systems, parallel computing, and high-performance computing principles. - Hands-on experience with large-scale training of deep learning models using frameworks like PyTorch, Megatron Core, DeepSpeed. - Experience optimizing compute, memory, and communication performance in large model training workflows. - Familiarity with GPU programming, CUDA, NCCL, and performance profiling tools. - Solid grasp of deep learning fundamentals, especially as they relate to transformer-based architectures and training dynamics. - Experience working with cloud platforms (AWS, GCP, or Azure) and containerization tools (Docker, Kubernetes). - Ability to work closely with both research and engineering teams, translating evolving needs into robust infrastructure. - Excellent problem-solving skills, with the ability to debug complex systems. - A passion for building high-impact tools that push the boundaries of what’s possible with large-scale AI. Bonus points if you have - Experience building tools or platforms for ML model training or fine-tuning. - Experience building backends (e.g., using FastAPI) and frontend (e.g., using React). - Experience building and optimizing LLM inference engines (e.g., vLLM, SGLang). - Exposure to DevOps practices, CI/CD pipelines, and infrastructure as code. - Familiarity with MLOps concepts, including model versioning and serving.
Job Type
Hybrid role
Skills required
No particular skills mentioned.
Location
Toronto, ON
Salary
No salary information was found.
Date Posted
April 17, 2025
CentML is seeking a Software Engineer to design and implement efficient distributed training systems for large-scale deep learning models. This role involves optimizing performance and collaborating with teams to enhance AI model development on the CentML Platform.