Ai evaluation engineer

Dois Irmãos do Tocantins

FirstIgnite

Anunciada dia 26 abril

Descrição

About FirstIgnite
FirstIgnite is the AI-powered business development platform for university technology transfer offices (TTOs). We help research institutions turn breakthroughs into partnerships, licenses, and companies by combining deep LLM-driven workflows with the relationships that actually move deals forward. Our product suite spans expert discovery, grants search, and AI-driven outreach — all built on a modern, agentic stack.
We ship fast, we measure everything, and we believe evaluations are the difference between AI features that demo well and AI features that work in production.

The Role
We're hiring an AI Evaluation Engineer to own the quality bar for every LLM-powered feature we ship. You'll design, build, and scale the infrastructure that tells us — with evidence — whether a prompt change, model swap, or agent refactor made things better or worse.
This is a high-leverage role. Every customer-facing AI capability at FirstIgnite flows through your evals. You'll work directly with the Head of Engineering and partner closely with product, applied AI, and the full-stack team to establish evaluation as a first-class discipline across the company.

What You'll Do
Build evaluation infrastructure: Design and maintain eval suites using Promptfoo, LLM-as-judge methodologies, and custom harnesses for features like our expert search system, natural language grants search, and AI SDR agents.
Define what 'good' means: Partner with product and domain experts to translate fuzzy customer outcomes (\"does this surface the right principal investigator?\") into precise, measurable rubrics.
Own the feedback loop: Instrument production traffic, curate golden datasets from real customer interactions, and build pipelines that turn user behavior into regression tests.
Ship quickly under uncertainty: We routinely run 48-hour eval sprints for greenfield features with no production traffic. You'll be comfortable bootstrapping quality signal from scratch.
Model and prompt evaluation: Run rigorous A/B comparisons across models (OpenAI, Anthropic, open-weight), prompt strategies, and agent architectures. Quantify tradeoffs between cost, latency, and quality.
Agent evaluation: Help us measure multi-step agent behavior built on the OpenAI Agents SDK, Vercel AI SDK, and Temporal Cloud — including tool-use correctness, trajectory quality, and end-to-end task completion.
Raise the floor for the team: Create templates, documentation, and tooling so every engineer can write and run evals as part of normal development. Evals should feel as natural as unit tests.

Requirements
3+ years of engineering experience, with at least 1 year focused on LLM/ML evaluation, applied AI, or data quality systems.
Hands-on experience with LLM evaluation frameworks — Promptfoo, Braintrust, LangSmith, OpenAI Evals, DeepEval, or equivalent in-house tooling.
Strong grasp of LLM-as-judge methodology, including its failure modes (position bias, verbosity bias, judge-model drift) and how to mitigate them.
Statistical literacy — you know the difference between a real regression and noise, and you can design experiments that answer the question actually being asked.
Product instincts. You can sit with a customer success call transcript, identify the three failure modes that matter, and ship an eval for each by end of week.
Strong written communication. Evals are useless if the engineers shipping features don't trust or read the results.

Preferred Qualifications
Experience evaluating retrieval systems (RAG, hybrid search, reranking) — especially over structured or semi-structured domains like research, grants, or patents.
Exposure to agent orchestration frameworks (Temporal, LangGraph, OpenAI Agents SDK) and the specific challenges of evaluating multi-step, tool-using systems.
Background in information retrieval, search relevance, or a research-adjacent domain.
Experience building internal tooling or dashboards that non-engineers (PMs, domain experts) actually use to label and review model outputs.

Why This Role
You'll be the first dedicated evals hire. The scope, standards, and tooling are yours to define.
AI quality is existential for our product. This isn't a compliance role tucked into a corner — it's directly on the critical path to revenue.
Small, senior team. ~10 engineers, distributed globally, with a strong bias toward shipping and measuring.
Direct access to real-world, high-stakes LLM use cases — research discovery, grants, outbound — across a customer base that deeply values accuracy.

This is a fully remote position. We have engineers across the Americas, Europe, and Africa. Overlap with US working hours is expected.

Se candidatar

Criar um alerta

Salvar