Research Engineer, Tokens

Key Responsibilities:

Conduct pretraining data research including understanding data trends, scaling laws, and optimal data mixture strategies
Investigate and evaluate potential new sources of training data for large language model development
Build research tools and frameworks to analyze experimental results and understand model training dynamics
Scale data processing jobs to thousands of machines while maintaining data quality and processing efficiency
Design and execute machine learning experiments focused on training data optimization and ablation studies
Create interactive visualizations and analysis tools for semantic clusters and patterns in training datasets
Optimize pretraining data processing workflows to maximize compute efficiency and model performance

Skills & Tools:

Large-scale machine learning systems and distributed computing
Language modeling with transformer architectures
Large-scale ETL (Extract, Transform, Load) data processing
High-performance computing and parallel processing frameworks
ML experiment design and research methodology
Data analysis and visualization tools
Software engineering for research infrastructure
Understanding of scaling laws and training dynamics
Experience with ML competitions and quantitative data analysis
Multimodal dataset creation and management

Where This Role Has Appeared:

Anthropic (AI Safety Research, San Francisco, CA, $315k-$340k, July 2025)

Variants & Related Titles:

LLM Data Research Engineer
Training Data Scientist
AI Dataset Engineer
Pre-training Research Scientist
LLM Data Optimization Engineer

Why This Role Is New:

Research Engineer for Tokens and Pre-training emerged in 2022-2023 as AI companies discovered that careful curation and optimization of training data could dramatically improve model performance while reducing compute costs. The role addresses the realization that with models costing hundreds of millions to train, optimizing the data they learn from became as critical as optimizing the models themselves.

Trend Insight:

As LLM training costs soar into hundreds of millions of dollars, companies are investing heavily in specialized research roles focused on data optimization, recognizing that the right training data mixture can achieve better results with significantly less compute than brute-force scaling.

Seen this role elsewhere? Submit an example or share your story.

Real Jobs. Real Change. See What's Next.