About this role
This role focuses on advancing the evaluation and development of cutting-edge coding agents. You will operate at the intersection of AI research, software engineering, and model evaluation, designing the benchmarks, methodologies, and data systems that shape how next-generation coding models are measured and improved.
Key Responsibilities- Design and own evaluation frameworks for coding agents, including benchmark specifications, scoring methodologies, rubrics, and quality standards.
- Lead end-to-end research initiatives aimed at measuring and enhancing coding model performance across various software engineering tasks.
- Develop high-quality datasets, golden examples, and evaluation protocols that facilitate reliable assessment of frontier coding systems.
- Analyze model behavior and failure modes, identifying systematic weaknesses and translating findings into actionable improvements for training and evaluation.
- Build tooling and infrastructure that support large-scale experimentation, data generation, review workflows, and evaluation pipelines.
- Establish best practices for coding-agent assessment, ensuring methodological rigor, reproducibility, and measurement quality.
- Collaborate closely with researchers, engineers, and applied AI teams to design experiments and evaluate emerging model capabilities.
- Contribute to technical reports, benchmark studies, and client-facing research initiatives that communicate model performance and insights.
- Strong software engineering background with expertise in Python, C++, or comparable programming languages.
- 3+ years of experience in software engineering, machine learning, AI research, evaluation, or related technical disciplines.
- Experience designing, reviewing, or validating technical assessments, benchmarks, coding tasks, or evaluation methodologies.
- Familiarity with large language models, coding agents, reinforcement learning, model evaluation, or related AI systems.
- Proven ability to build tooling, automate workflows, and enhance technical processes through systematic experimentation.
- Strong analytical skills with the capacity to investigate model behavior and derive insights from complex technical systems.
- Excellent written and verbal communication skills, with the ability to clearly articulate technical findings to diverse audiences.
- Comfortable operating in fast-paced research environments with significant ambiguity and evolving priorities.
Full-time, remote position.
CompensationAnnual salary ranges from $220, 000 to $500, 000.
EligibilityOpen to candidates with the required skills and experience, regardless of location.