Senior Data Scientist : AI Training Data (2-4 Months Contract) Company : BespokeLabs (VC-backed; founded by IIT & Ivy League alumni)
Location : Remote
Role Type : Contract (2-4 Months)
Time Commitment : 40 hrs / week (Full-time availability required)
Compensation : Hyper-competitive hourly rate (matching top-tier Senior Data Scientist bands) Experience : 6+ years
About BespokeLabs
BespokeLabs is a premier, VC-backed AI Research lab with an exceptionally talent-dense team of IIT and Ivy League alumni. We don’t just build tooling around AI—we build the massive-scale data systems and reasoning architectures that directly power next-generation models. Our research shapes the frontier of AI : we’ve published breakthroughs like GEPA, driven foundational datasets like OpenThoughts, and shipped state-of-the-art models including Bespoke-MiniCheck and Bespoke-MiniChart. More on our website bespokelabs.ai :
Role Overview
We are looking for a high-impact Senior Data Scientist for an intensive, 2-month sprint. You will leverage your deep expertise in production-grade machine learning and applied statistics to develop the algorithms and logic that curate and evaluate datasets for advanced AI model training.
This is not a traditional model-building or research role. We need a seasoned practitioner who has already owned the end-to-end DS lifecycle at scale. You will use your intuition for feature engineering, statistical validity, and large-scale data processing to programmatically generate, shape, and validate AI training data.
What You Will Do (The Contract)
Algorithm Design : Design and implement custom statistical models and programmatic logic (e.g., anomaly detection, active learning, similarity scoring) to evaluate data quality, complexity, and redundancy at scale.
Hands‑on At‑Scale Coding : Write scalable PySpark and Python (NumPy / Pandas) code to apply these algorithms across massive datasets, translating experimental logic into reliable, large-scale workflows.
Metric Formulation : Develop custom quantitative metrics and heuristic benchmarks to rigorously assess the fidelity and suitability of data subsets for specific AI training objectives.
Validation & Iteration : Run high-speed validation cycles, analyzing the output of data‑curation algorithms to diagnose skew, bias, or noise, and iteratively refining the logic.
High‑Level Curation : Apply Senior‑level domain expertise in predictive modeling and feature engineering to ensure the final training inputs meet the strict standards required for state‑of‑the‑art ML systems.
What You Bring to the Table (Your Past Experience)
The End‑to‑End DS Lifecycle : Framing problems, modeling, validation, production, and iteration.
Production Ownership : Building and deploying ML and statistical models on large‑scale datasets.
Large‑Scale Data Processing : Working with Apache Spark to develop scalable feature pipelines and offline training workflows.
Experimentation : Designing and analyzing rigorous experiments (A / B tests, causal inference).
Impact : Translating complex model outputs into clear product and business decisions.
Required Qualifications (Non‑Negotiable)
Experience : 6+ years as a Data Scientist or Applied Scientist.
Production Background : Proven ownership of models running in production environments.
Applied Statistics : Strong background in applied statistics and experimentation frameworks.
Core Technical Skills
Languages : Python (NumPy, Pandas, Scikit‑learn, PyTorch / TensorFlow) and Strong SQL.
Big Data : Apache Spark (PySpark or Spark SQL) for large‑scale data processing.
Methodologies : Feature engineering, model evaluation, statistical modeling, and hypothesis testing.
Strong Signals (Highly Valued)
Scale : Models trained on TB‑scale datasets.
Domain Specificity : Experience in high‑complexity domains such as : Recommendations, Pricing, Fraud / risk, Search / ranking, or Growth & experimentation.
Collaboration : Experience deploying models alongside data engineering pipelines.
Out of Scope (Who Should Not Apply)
BI / reporting‑only roles
SQL‑only analysts
Research‑only ML roles with no production ownership
Early‑career profiles
#J-18808-Ljbffr