**About Our Initiative**
We're pushing the boundaries of AI-assisted software development to empower the next generation of AI systems.
Project Overview
We're building high-quality evaluation and training datasets to improve how Large Language Models (LLMs) interact with realistic software engineering tasks. A key focus of this project is curating verifiable software engineering challenges from public GitHub repository histories using a human-in-the-loop process.
Why This Role Is Unique
* You'll collaborate directly with AI researchers shaping the future of AI-powered software development.
* You'll work with high-impact open-source projects and evaluate how LLMs perform on real bugs, issues, and developer tasks.
* You'll influence dataset design that will train and benchmark next-gen LLMs.
* Day-to-day tasks include:
o Reviewing and comparing 3–4 model-generated code responses for each task using a structured ranking system.
o Evaluating code diffs for correctness, code quality, style, and efficiency.
o Providing clear, detailed rationales explaining the reasoning behind each ranking decision.
o Maintaining high consistency and objectivity across evaluations.
o Collaborating with the team to identify edge cases and ambiguities in model behavior.
Required Skills
* 7+ years of professional software engineering experience at top-tier product companies.
* Strong fundamentals in software design, coding best practices, and debugging.
* Excellent ability to assess code quality, correctness, and maintainability.
* Proficient with code review processes and reading diffs in real-world repositories.
* Exceptional written communication skills to articulate evaluation rationale clearly.
* Prior experience with LLM-generated code or evaluation work is a plus.
Bonus Points
* Experience in LLM research, developer agents, or AI evaluation projects.
* Background in building or scaling developer tools or automation systems.
Engagement Details
* Commitment: ~20 hours/week.
* Type: Contractor.
* Duration: 1 month (potential extensions based on performance and fit).