About this Opportunity
We are seeking a skilled professional to collaborate with AI researchers and contribute to the development of high-quality evaluation and training datasets for Large Language Models (LLMs). This project focuses on curating verifiable software engineering challenges from public GitHub repository histories using a human-in-the-loop process.
Key Responsibilities
* Review and compare model-generated code responses for each task using a structured ranking system
* Evaluate code diffs for correctness, code quality, style, and efficiency
* Provide clear, detailed rationales explaining the reasoning behind each ranking decision
* Maintain high consistency and objectivity across evaluations
Requirements
* 7+ years of professional software engineering experience at top-tier product companies
* Strong fundamentals in software design, coding best practices, and debugging
* Excellent ability to assess code quality, correctness, and maintainability
* Proficient with code review processes and reading diffs in real-world repositories
* Exceptional written communication skills to articulate evaluation rationale clearly
Bonus Points
* Experience in LLM research, developer agents, or AI evaluation projects
* Background in building or scaling developer tools or automation systems
Engagement Details
* Commitment: ~20 hours/week
* Type: Contractor
* Duration: 1 month (with potential extensions)