Role Overview

Lead hands-on engineering work to create and validate verifiable software engineering tasks drawn from public repository histories, and evaluate how large language models perform on realistic code problems. This role combines practical repository triage, environment automation, and test quality assessment with collaboration alongside researchers to expand dataset coverage across languages and difficulty levels.

Key Responsibilities

Analyze and triage GitHub issues across trending open-source libraries, prioritizing tasks suitable for reproducible LLM evaluation.
Set up and configure repositories for evaluation, including Dockerization and development environment automation.
Assess unit test coverage and test quality for candidate repositories and issues.
Run, modify, and debug real-world codebases locally to reproduce bugs and evaluate LLM bug-fixing performance.
Collaborate with researchers to identify repositories and issue types that are challenging for LLMs.
Mentor or lead small teams of junior engineers on project tasks and dataset construction when opportunities arise.

Qualifications

Strong software engineering experience at a senior or tech lead level, with demonstrable work on well-maintained, widely used public repositories, ideally 500 or more stars on GitHub.
Proficiency in at least one of these languages: Python, JavaScript, Java, Go, Rust, C, C++, C#, or Ruby.
Practical experience with Git, Docker, and setting up basic software pipelines and local development environments.
Ability to understand, navigate, and modify complex codebases, and to run and test projects locally.
Comfortable evaluating test suites, measuring coverage, and judging test quality.
Experience contributing to or reviewing open-source projects is a plus.

Nice to Have

Previous participation in LLM research, evaluation projects, or dataset construction.
Experience building or testing developer tools, automation agents, or CI workflows.

Work Terms

Location: fully remote.
Commitment required: 20 hours per week, with some overlap with Pacific Standard Time hours.
Employment type: contractor assignment, paid as an independent contractor, no medical or paid leave provided.
Contract duration: 3 months, with an expected start date as early as next week.

Eligibility

This engagement is for independent contractors able to work remotely and meet the weekly hour and PST overlap requirements.
No employer-provided medical or paid leave is included with this contractor assignment.

Senior Software Engineer for LLM Evaluation

About this role

Related Jobs

MLOps Engineer for AI Model Training

Java Developer for AI System Training

Performance Engineer for AI Model Training

Python Developer for AI Model Training

Frontend Software Engineer for AI Training