LLM Training Frameworks and Optimization Engineer

Key Responsibilities:

Design, implement, and optimize distributed training frameworks tailored specifically for large language models
Optimize communication patterns including gradient synchronization and all-reduce operations in distributed training environments
Implement advanced techniques like mixed precision, tensor parallelism, pipeline parallelism, and sharded training methodologies
Conduct in-depth profiling and debugging of training jobs to identify and resolve performance bottlenecks
Ensure training systems scale efficiently to thousands of nodes while maintaining fault-tolerant and checkpointed pipelines
Collaborate with hardware teams to optimize performance for GPUs, TPUs, and other specialized accelerators
Work with researchers and platform teams to ensure frameworks meet evolving model and workload requirements

Skills & Tools:

Distributed training frameworks (PyTorch DDP, DeepSpeed, Megatron-LM, TensorFlow XLA)
Parallelism techniques (data, tensor, pipeline, ZeRO-based parallelism)
Programming languages (Python, C++, CUDA) for high-performance computing
Memory optimization techniques (activation checkpointing, gradient sharding)
GPU/TPU hardware and deep learning performance optimization
Graph optimization and compiler-level performance tuning
Training dynamics and hyperparameter optimization for large-scale LLMs
Open-source deep learning and distributed training project contributions
Low-level hardware optimizations (kernel fusion, custom CUDA kernels)

Where This Role Has Appeared:

Together AI (AI Research/Infrastructure, Remote, $160k-$230k, July 2025)

Variants & Related Titles:

Distributed ML Systems Engineer
LLM Infrastructure Engineer
Large-Scale Training Engineer
AI Platform Optimization Engineer
ML Performance Engineer

Why This Role Is New:

LLM Training Frameworks and Optimization Engineer emerged in 2022-2023 as language models scaled beyond what traditional ML infrastructure could handle. The role addresses unique challenges of training models with hundreds of billions of parameters across distributed clusters, requiring specialized expertise in memory optimization, communication patterns, and fault tolerance that didn't exist in conventional machine learning engineering.

Trend Insight:

As AI models continue growing in size and complexity, companies are creating highly specialized infrastructure roles to ensure training efficiency and cost-effectiveness, making distributed training optimization a critical competitive advantage in the AI industry.

Seen this role elsewhere? Submit an example or share your story.

Real Jobs. Real Change. See What's Next.