About our mission We're building high-quality evaluation and training datasets to improve how Large Language Models (LLMs) interact with realistic software engineering tasks. Key project focus A key focus of this project is curating verifiable software engineering challenges from public GitHub repository histories using a human-in-the-loop process. Why you'll enjoy this role * Collaborate directly with AI researchers shaping the future of AI-powered software development. * Work with high-impact open-source projects and evaluate how LLMs perform on real bugs, issues, and developer tasks. * Influence dataset design that will train and benchmark next-gen LLMs. What you'll do