Real Jobs. Real Change. See What's Next.

A living directory of real jobs that didn't exist 5 years ago. Curated for leaders, builders, and the curious.

Research Engineer, Tokens

Optimizes massive training datasets and investigates scaling laws to improve the efficiency and performance of large language model development

AI Safety
LLM Research

Key Responsibilities:

  • Conduct pretraining data research including understanding data trends, scaling laws, and optimal data mixture strategies
  • Investigate and evaluate potential new sources of training data for large language model development
  • Build research tools and frameworks to analyze experimental results and understand model training dynamics
  • Scale data processing jobs to thousands of machines while maintaining data quality and processing efficiency
  • Design and execute machine learning experiments focused on training data optimization and ablation studies
  • Create interactive visualizations and analysis tools for semantic clusters and patterns in training datasets
  • Optimize pretraining data processing workflows to maximize compute efficiency and model performance

Skills & Tools:

  • Large-scale machine learning systems and distributed computing
  • Language modeling with transformer architectures
  • Large-scale ETL (Extract, Transform, Load) data processing
  • High-performance computing and parallel processing frameworks
  • ML experiment design and research methodology
  • Data analysis and visualization tools
  • Software engineering for research infrastructure
  • Understanding of scaling laws and training dynamics
  • Experience with ML competitions and quantitative data analysis
  • Multimodal dataset creation and management

Where This Role Has Appeared:

  • Anthropic (AI Safety Research, San Francisco, CA, $315k-$340k, July 2025)

Variants & Related Titles:

  • LLM Data Research Engineer
  • Training Data Scientist
  • AI Dataset Engineer
  • Pre-training Research Scientist
  • LLM Data Optimization Engineer

Why This Role Is New:

Research Engineer for Tokens and Pre-training emerged in 2022-2023 as AI companies discovered that careful curation and optimization of training data could dramatically improve model performance while reducing compute costs. The role addresses the realization that with models costing hundreds of millions to train, optimizing the data they learn from became as critical as optimizing the models themselves.

Trend Insight:

As LLM training costs soar into hundreds of millions of dollars, companies are investing heavily in specialized research roles focused on data optimization, recognizing that the right training data mixture can achieve better results with significantly less compute than brute-force scaling.

Seen this role elsewhere? Submit an example or share your story.