Role Overview

Author and verify golden reference solutions for the CritPt benchmark (arXiv:2509.26574v3). This role delivers 100%-human-verified reference data used to evaluate large language models on frontier physics reasoning by solving research-level problems end-to-end, auditing peer submissions, or adjudicating between parallel solution attempts.

Physics Subdomains Covered

High Energy Physics and Mathematical Physics
Biophysics and Statistical Physics
Condensed Matter and AMO
Gravitation, Cosmology, Astrophysics
Quantum Information
Optical Properties of Materials
Magnetic Materials
Measurements in Quantum Mechanics

Key Responsibilities

Solve research-level physics challenges end-to-end, producing verifiable derivations, working code, and citations to peer-reviewed literature
Decompose complex challenges into standalone checkpoint sub-problems that require genuine physical reasoning
Author Python answer templates that include auto-grading functions for symbolic or numerical answers
Audit submitted solutions for correctness, scope, and methodological soundness, providing actionable feedback across iterations
Adjudicate between parallel solver attempts and determine which solution becomes the golden reference
Document chain-of-thought reasoning, specify error tolerances, list equivalent symbolic forms, and produce verification test cases

Qualifications

Solver: PhD or postdoc in the relevant subfield, senior PhD student minimum
Auditor: postdoc or junior professor in the relevant subfield, PhD minimum
Adjudicator: full professor or industry research PI in the relevant subfield, senior postdoc or junior professor minimum
Hands-on familiarity with at least two canonical methods in the target subfield, demonstrable via publications; broader coverage is strongly preferred
3 to 5 representative publications provided as arXiv IDs or DOIs, ideally within the last ~5 years and in the target subfield
Working proficiency with LaTeX, Python, Jupyter, and SymPy
Strong written English, B2/C1/C2 minimum; native or near-native preferred

Work Terms

Remote, asynchronous work
Hourly engagement, expected commitment approximately 10 hours per week
Work is organized in task pools, with each task pool requiring sustained participation across an 8 to 10 week window

Compensation

Pay range is $80 to $135 per hour, determined by the specific role (solver, auditor, adjudicator) and demonstrated expertise.

Eligibility

Candidates must meet the listed qualification thresholds for the role they apply to and be able to commit to the stated weekly hours and the 8 to 10 week task pool window. Remote participation is supported.

Physics Researcher for AI Model Evaluation

About this role

Related Jobs

Mathematician for AI Model Training

Computational Biologist for AI Model Training

Physics Professor for AI Model Training

Chemistry PhD for AI Model Training

Computational Chemist for AI Model Training