Role Overview

Lead the design and execution of evaluation frameworks, benchmarks, datasets, and tooling that measure and improve next-generation coding agents. This hands-on role combines AI research, software engineering, and model evaluation to define how frontier coding models are assessed and advanced.

Key Responsibilities

Design and own evaluation frameworks for coding agents, including benchmark specifications, scoring methodologies, rubrics, and quality standards.
Lead end-to-end research initiatives to measure and improve coding model performance across diverse software engineering tasks.
Develop high-quality datasets, golden examples, and evaluation protocols that enable reliable assessment of frontier coding systems.
Analyze model behavior and failure modes, identify systematic weaknesses, and translate findings into actionable improvements for training and evaluation.
Build tooling and infrastructure to support large-scale experimentation, data generation, review workflows, and evaluation pipelines.
Establish best practices for coding-agent assessment, ensuring methodological rigor, reproducibility, and measurement quality.
Partner with researchers, engineers, and applied AI teams to design experiments and evaluate emerging model capabilities.
Contribute to technical reports, benchmark studies, and client-facing research deliverables that communicate model performance and insights.

Qualifications

3+ years of experience in software engineering, machine learning, AI research, evaluation, or related technical disciplines.
Strong software engineering skills, proficient in Python, C++, or comparable programming languages.
Experience designing, reviewing, or validating technical assessments, benchmarks, coding tasks, or evaluation methodologies.
Familiarity with large language models, coding agents, reinforcement learning, model evaluation, or related AI systems.
Proven ability to build tooling, automate workflows, and improve technical processes through systematic experimentation.
Strong analytical skills, able to investigate model behavior and derive insights from complex systems.
Excellent written and verbal communication skills, able to present technical findings to diverse audiences.
Comfortable operating in fast-moving research environments with significant ambiguity and evolving priorities.
Preferred, but not required: experience working on frontier AI systems, coding agents, or model evaluation research; track record of driving ambiguous research or technical projects end to end; experience designing benchmarks or datasets at scale; familiarity with agentic workflows, tool use, reinforcement learning, or post-training methodologies; publications, open-source contributions, or demonstrated technical leadership in AI.

Work Terms

Full-time position, fully remote. Role is part of the Coding Research core team and involves hands-on research and engineering work focused on coding agent evaluation.

Compensation

Annual compensation range: $400, 000 to $800, 000 per year.

Software Engineer for AI Model Evaluation

About this role

Related Jobs

MLOps Engineer for AI Model Training

Java Developer for AI System Training

Performance Engineer for AI Model Training

Python Developer for AI Model Training

Frontend Software Engineer for AI Training