Job Title: ML Data Pipeline Engineer
This role combines systems engineering, data quality automation, and hands-on problem-solving in a production environment. As a Data Pipeline Engineer, you will manage the end-to-end pipeline that collects, synchronizes, validates, and prepares IMU sensor and video data for machine learning model training.
The successful candidate will be responsible for maintaining and enhancing our multi-source data collection system: IMU sensors (via mobile app) and synchronized video streams from gym-based cameras. This involves improving video capture software robustness, particularly handling network interruptions and operational monitoring. Additionally, they will deploy and monitor services in remote Linux environments with appropriate DevOps practices.
In terms of data quality and validation, the engineer will evolve our Python-based QC engine that validates data pre- and post-annotation. They will also implement checks for IMU-video time synchronization, sensor health, and measurement consistency. Furthermore, they will apply digital signal processing techniques to identify sensor failures, connectivity issues, and measurement irregularities.
The ideal candidate will perform ad-hoc analysis on ~1,200+ workout tasks to classify failure modes. They will identify whether issues stem from pipeline bugs, sensor problems, or annotation errors. Based on data quality impact, they will prioritize engineering work and coordinate with the annotation team on fixes.
In terms of tooling and visualization, the engineer will maintain and extend our NextJS UI serving annotators, data scientists, and stakeholders. They will create visualizations (Chart.js) for QC metrics and signal analysis. Finally, they will integrate with LabelStudio annotation interface.
Requirements:
* Strong Python programming skills, particularly for data processing pipelines.
* Experience with time-series data and digital signal processing.
* Comfortable working in Linux environments and deploying/monitoring remote services.
* Ability to debug complex multi-component systems (sensors, video, networks, sync).
* Data quality mindset: designing validation rules, tracking metrics, investigating anomalies.
* SQL/database experience for managing pipeline metadata.
Highly Valued:
* Video processing experience (RTSP streams, encoding, OCR).
* Working with sensor/IoT data and handling connectivity challenges.
* NextJS or modern web frameworks for data tooling.
* DevOps practices: containerization, monitoring, logging, alerting.
* Experience with annotation pipelines and machine learning training data workflows.
* Background in biomechanics, sports science, or wearable sensors.
Tech Stack:
* Languages: Python (primary), JavaScript/TypeScript (NextJS UI).
* Data: IMU sensor streams, video (RTSP), time-series analysis, DSP.
* Tools: LabelStudio, Chart.js, Linux/bash, OCR libraries.
* Infrastructure: Remote deployment, monitoring systems.
You'll Thrive Here If You:
* Enjoy detective work: diagnosing why data doesn't match expectations.
* Balance pragmatism with quality: shipping improvements while maintaining reliability.
* Communicate well across technical and non-technical stakeholders.
* Can work autonomously in a small, mission-driven team.