Real Jobs. Real Change. See What's Next.
A living directory of real jobs that didn't exist 5 years ago. Curated for leaders, builders, and the curious.
LLM Training Frameworks and Optimization Engineer
Architects and optimizes distributed training infrastructure to enable efficient large-scale language model development across thousands of nodes and petabytes of data
Key Responsibilities:
- Design, implement, and optimize distributed training frameworks tailored specifically for large language models
- Optimize communication patterns including gradient synchronization and all-reduce operations in distributed training environments
- Implement advanced techniques like mixed precision, tensor parallelism, pipeline parallelism, and sharded training methodologies
- Conduct in-depth profiling and debugging of training jobs to identify and resolve performance bottlenecks
- Ensure training systems scale efficiently to thousands of nodes while maintaining fault-tolerant and checkpointed pipelines
- Collaborate with hardware teams to optimize performance for GPUs, TPUs, and other specialized accelerators
- Work with researchers and platform teams to ensure frameworks meet evolving model and workload requirements
Skills & Tools:
- Distributed training frameworks (PyTorch DDP, DeepSpeed, Megatron-LM, TensorFlow XLA)
- Parallelism techniques (data, tensor, pipeline, ZeRO-based parallelism)
- Programming languages (Python, C++, CUDA) for high-performance computing
- Memory optimization techniques (activation checkpointing, gradient sharding)
- GPU/TPU hardware and deep learning performance optimization
- Graph optimization and compiler-level performance tuning
- Training dynamics and hyperparameter optimization for large-scale LLMs
- Open-source deep learning and distributed training project contributions
- Low-level hardware optimizations (kernel fusion, custom CUDA kernels)
Where This Role Has Appeared:
- Together AI (AI Research/Infrastructure, Remote, $160k-$230k, July 2025)
Variants & Related Titles:
- Distributed ML Systems Engineer
- LLM Infrastructure Engineer
- Large-Scale Training Engineer
- AI Platform Optimization Engineer
- ML Performance Engineer
Why This Role Is New:
LLM Training Frameworks and Optimization Engineer emerged in 2022-2023 as language models scaled beyond what traditional ML infrastructure could handle. The role addresses unique challenges of training models with hundreds of billions of parameters across distributed clusters, requiring specialized expertise in memory optimization, communication patterns, and fault tolerance that didn't exist in conventional machine learning engineering.
Trend Insight:
As AI models continue growing in size and complexity, companies are creating highly specialized infrastructure roles to ensure training efficiency and cost-effectiveness, making distributed training optimization a critical competitive advantage in the AI industry.
Seen this role elsewhere? Submit an example or share your story.