Senior Data Scientist : AI Training Data (2-4 Months Contract)
Company : BespokeLabs (VC-backed; founded by IIT & Ivy League alumni)
Location : Remote
Role Type : Contract (2-4 Months)
Time Commitment : 40 hrs / week (Full-time availability required)
Compensation : Hyper-competitive hourly rate (matching top-tier Senior Data Scientist bands) Experience : 6+ years
About BespokeLabs
BespokeLabs is a premier, VC-backed AI Research lab with an exceptionally talent-dense team of IIT and Ivy League alumni.
We don't just build tooling around AI—we build the massive-scale data systems and reasoning architectures that directly power next-generation models.
Our research shapes the frontier of AI : we've published breakthroughs like GEPA, driven foundational datasets like OpenThoughts, and shipped state-of-the-art models including Bespoke-MiniCheck and Bespoke-MiniChart.
More on our website bespokelabs.ai : )
Role Overview
We are looking for a high-impact Senior Data Scientist for an intensive, 2-month sprint.
You will leverage your deep expertise in production-grade machine learning and applied statistics to develop the algorithms and logic that curate and evaluate datasets for advanced AI model training.
This is not a traditional model-building or research role.
We need a seasoned practitioner who has already owned the end-to-end DS lifecycle at scale.
You will use your intuition for feature engineering, statistical validity, and large-scale data processing to programmatically generate, shape, and validate AI training data.
What You Will Do (The Contract)
Algorithm Design : Design and implement custom statistical models and programmatic logic (e.g., anomaly detection, active learning, similarity scoring) to evaluate data quality, complexity, and redundancy at scale.
Hands-on At-Scale Coding : Write scalable PySpark and Python (NumPy / Pandas) code to apply these algorithms across massive datasets, translating experimental logic into reliable, large-scale workflows.
Metric Formulation : Develop custom quantitative metrics and heuristic benchmarks to rigorously assess the fidelity and suitability of data subsets for specific AI training objectives.
Validation & Iteration : Run high-speed validation cycles, analyzing the output of data-curation algorithms to diagnose skew, bias, or noise, and iteratively refining the logic.
High-Level Curation : Apply Senior-level domain expertise in predictive modeling and feature engineering to ensure the final training inputs meet the strict standards required for state-of-the-art ML systems.
What You Bring to the Table (Your Past Experience)
To be successful in this contract, you must have a track record of :
The End-to-End DS Lifecycle : Framing problems, modeling, validation, production, and iteration.
Production Ownership : Building and deploying ML and statistical models on large-scale datasets.
Large-Scale Data Processing : Working with Apache Spark to develop scalable feature pipelines and offline training workflows.
Experimentation : Designing and analyzing rigorous experiments (A / B tests, causal inference).
Impact : Translating complex model outputs into clear product and business decisions.
Required Qualifications (Non-Negotiable)
Experience : 6+ years as a Data Scientist or Applied Scientist.
Production Background : Proven ownership of models running in production environments.
Applied Statistics : Strong background in applied statistics and experimentation frameworks.
Core Technical Skills
Languages : Python (NumPy, Pandas, Scikit-learn, PyTorch / TensorFlow) and Strong SQL.
Big Data : Apache Spark (PySpark or Spark SQL) for large-scale data processing.
Methodologies : Feature engineering, model evaluation, statistical modeling, and hypothesis testing.
Strong Signals (Highly Valued)
Scale : Models trained on TB-scale datasets.
Domain Specificity : Experience in high-complexity domains such as : Recommendations, Pricing, Fraud / risk, Search / ranking, or Growth & experimentation.
Collaboration : Experience deploying models alongside data engineering pipelines.
Out of Scope (Who Should Not Apply)
BI / reporting-only roles
SQL-only analysts
Research-only ML roles with no production ownership
Early-career profiles
#J-*****-Ljbffr