Jobs

Job Directory Data Engineer Remote Jobs Data Scientist Remote Jobs AI Engineer Jobs BI Analyst Jobs Power BI Jobs System Administrator Jobs

Resources

Microsoft News Perspectives Microsoft Partners Interview Preparation Resume Coaching and Review

Company

About us Contact us

Contact us

info@happytechies.com

Dallas, TX

Copyright © 2025 Happy Techies

All Rights Reserved

|

Privacy Policy|Terms of Service

Insights

Learn

Jobs

Login Employers

Software Engineer - LLM Training

Role: AI EngineerCompany: Centml

Job description

What you’ll do - Design and implement highly efficient distributed training systems for large-scale deep learning models. - Optimize parallelism strategies to improve performance and scalability across hundreds or thousands of GPUs. - Develop low-level systems components and algorithms to maximize throughput and minimize memory and compute bottlenecks. - Productionize the training systems onto CentML Platform. - Collaborate with researchers and engineers to productionize cutting-edge model architectures and training techniques. - Contribute to the design of APIs, abstractions and UX that make it easier to scale models while maintaining usability and flexibility. - Profile, debug, and tune performance at the system, model, and hardware levels. - Participate in design discussions, code reviews, and technical planning to ensure the product aligns with business goals. - Stay up to date with the latest advancements in large-scale model training and help translate research into practical, robust systems. What you’ll need to be successful - Bachelor’s, Master’s, or PhD’s degree in Computer Science/Engineering, Software Engineering, related field or equivalent working experience. - 3+ years of experience in software development, preferably with Python and C++. - Deep understanding of machine learning pipelines and workflows, distributed systems, parallel computing, and high-performance computing principles. - Hands-on experience with large-scale training of deep learning models using frameworks like PyTorch, Megatron Core, DeepSpeed. - Experience optimizing compute, memory, and communication performance in large model training workflows. - Familiarity with GPU programming, CUDA, NCCL, and performance profiling tools. - Solid grasp of deep learning fundamentals, especially as they relate to transformer-based architectures and training dynamics. - Experience working with cloud platforms (AWS, GCP, or Azure) and containerization tools (Docker, Kubernetes). - Ability to work closely with both research and engineering teams, translating evolving needs into robust infrastructure. - Excellent problem-solving skills, with the ability to debug complex systems. - A passion for building high-impact tools that push the boundaries of what’s possible with large-scale AI. Bonus points if you have - Experience building tools or platforms for ML model training or fine-tuning. - Experience building backends (e.g., using FastAPI) and frontend (e.g., using React). - Experience building and optimizing LLM inference engines (e.g., vLLM, SGLang). - Exposure to DevOps practices, CI/CD pipelines, and infrastructure as code. - Familiarity with MLOps concepts, including model versioning and serving.

Requirements

• Bachelor's, Master's, or PhD in Computer Science/Engineering or related field.

• 3+ years of software development experience, preferably with Python and C++.

• Deep understanding of machine learning pipelines and workflows.

• Experience with large-scale training of deep learning models using frameworks like PyTorch.

• Familiarity with GPU programming, CUDA, and performance profiling tools.

• Solid grasp of deep learning fundamentals, especially transformer-based architectures.

• Experience working with cloud platforms (AWS, GCP, or Azure).

• Ability to collaborate with research and engineering teams.

• Excellent problem-solving skills for debugging complex systems.

• A passion for building high-impact tools for large-scale AI.

SHARE THIS OPENING

Similar jobs

Business Intelligence Developer/Power BI/Azure/Capital Markets/Hedge Fund- NY

May 7, 2025

Care IT Services Inc

New York, New York

Care IT Services Inc is seeking a Business Intelligence Developer with expertise in Power BI and Azure, specifically within the capital markets or hedge fund sectors. The role involves designing and delivering data solutions and reports to support business decision-making.

Senior Lead AI Engineer

May 9, 2025

Capital One

Richmond, Virginia

Capital One is seeking a Senior Lead AI Engineer to develop and deploy AI-powered products that enhance customer interactions and internal processes. The role involves collaborating with cross-functional teams to innovate and optimize AI systems.

W2 - Power BI Administrator/Developer with Microsoft Fabric & ADF Exp

May 6, 2025

My3Tech

Bolingbrook, Illinois

My3Tech is seeking a Power BI Administrator/Developer with experience in Microsoft Fabric and Azure Data Factory for a hybrid role in Bolingbrook, IL. The candidate will manage Power BI infrastructure and develop insightful reports and dashboards.

Job Type

Hybrid role

Skills required

No particular skills mentioned.

Location

Toronto, ON

Salary

No salary information was found.

Date Posted

April 17, 2025

Save job Apply now

Similar jobs

Business Intelligence Developer/Power BI/Azure/Capital Markets/Hedge Fund- NY

May 7, 2025

Care IT Services Inc

New York, New York

Care IT Services Inc is seeking a Business Intelligence Developer with expertise in Power BI and Azure, specifically within the capital markets or hedge fund sectors. The role involves designing and delivering data solutions and reports to support business decision-making.

Senior Lead AI Engineer

May 9, 2025

Capital One

Richmond, Virginia

Capital One is seeking a Senior Lead AI Engineer to develop and deploy AI-powered products that enhance customer interactions and internal processes. The role involves collaborating with cross-functional teams to innovate and optimize AI systems.

W2 - Power BI Administrator/Developer with Microsoft Fabric & ADF Exp

May 6, 2025

My3Tech

Bolingbrook, Illinois

My3Tech is seeking a Power BI Administrator/Developer with experience in Microsoft Fabric and Azure Data Factory for a hybrid role in Bolingbrook, IL. The candidate will manage Power BI infrastructure and develop insightful reports and dashboards.

Centml

CentML is seeking a Software Engineer to design and implement efficient distributed training systems for large-scale deep learning models. This role involves optimizing performance and collaborating with teams to enhance AI model development on the CentML Platform.

Grow your career background

Grow your career with our tailored content for Microsoft techies