We're looking for a Senior ML/AI Engineer to own and evolve our LLM-powered user experience. You'll work directly with our technical co-founder to build, optimize, and monitor agent systems that parse workout descriptions, provide scaling recommendations, and enable conversational data retrieval - all with production-grade accuracy and speed.
This is a hands-on role focused on the ML/AI engineering side: prompt engineering, model optimization, agent orchestration, and continuous improvement based on real-world usage patterns.
What You’ll Do
Core Responsibilities
* Own the workout parsing system: improve accuracy of our fine-tuned model (currently Qwen-based) that converts natural language workout descriptions into structured schemas
* Design and implement agent workflows for workout scaling recommendations and performance tracking
* Build observability workflows using Langfuse to identify and systematically address model performance issues
* Optimize agent response latency while maintaining accuracy across our tool-based reasoning system
* Collaborate on agent architecture decisions, including potential migration to frameworks like DSPy
* Ship production features: workout entry system, scaling recommendations, and score reporting
What We’re Looking For
Required
* 5+ years of ML/AI engineering experience with at least 2 years working with LLMs in production
* Strong prompt engineering and model optimization skills
* Experience building and deploying agent systems with tools/functions
* Proven ability to use observability platforms to diagnose and improve model performance
* Experience with model fine-tuning (any framework/approach)
* Strong Python programming skills
* Active CrossFit participant - candidates should understand standard movements and workout structures
Strongly Preferred:
* Experience with agent orchestration frameworks (DSPy, LlamaIndex, or similar)
* Background in production ML operations and monitoring
* Experience with Modal.com or similar serverless ML platforms
* Track record of iteratively improving LLM systems based on user feedback and metrics
* Experience fine tuning similar open-source LLMs
Success in First 6 Months
* Ship workout entry system with improved parsing accuracy
* Launch basic workout scaling recommendations
* Implement user score reporting and retrieval
* Establish robust monitoring workflows to catch and address model failures and poor user experiences
* Contribute to agent architecture decisions as we scale