Job details

AI Engineer, Quality (Evals)

About Us

Fieldguide is establishing a new state of trust for global commerce and capital markets through automating and streamlining the work of assurance and audit practitioners specifically within cybersecurity, privacy, and financial audit. Put simply, we build software for the people who enable trust between businesses.

We’re based in San Francisco, CA, but built as a remote-first company that enables you to do your best work from anywhere. We're backed by top investors including Growth Equity at Goldman Sachs Alternatives, Bessemer Venture Partners, 8VC, Floodgate, Y Combinator, DNX Ventures, Global Founders Capital, Justin Kan, Elad Gil, and more.

We value diversity — in backgrounds and in experiences. We need people from all backgrounds and walks of life to help build the future of audit and advisory. Fieldguide’s team is inclusive, driven, humble and supportive. We are deliberate and self-reflective about the kind of team and culture that we are building, seeking teammates that are not only strong in their own aptitudes but care deeply about supporting each other's growth.

As an early stage start-up employee, you’ll have the opportunity to build out the future of business trust. We make audit practitioners’ lives easier by bringing together up to 50% of their work and giving them better work-life balance. If you share our values and enthusiasm for building a great culture and product, you will find a home at Fieldguide.

About the Role

Fieldguide is building AI agents for the most complex audit and advisory workflows. We're a San Francisco-based Vertical AI company building in a $100B+ market undergoing rapid transformation. Over 50 of the top 100 accounting and consulting firms trust us to power their most mission-critical work. We're backed by Bessemer Venture Partners, 8VC, Floodgate, Y Combinator, Elad Gil, and other top-tier investors.

As an AI Engineer, Quality, you will own the evaluation infrastructure that ensures our AI agents perform reliably at enterprise scale. This role is 100% focused on making evaluations a first-class engineering capability: building the unified platform, automated pipelines, and production feedback loops that let us evaluate any new model against all critical workflows within hours. You'll work at the intersection of ML engineering, observability, and quality assurance to ensure our agents meet the rigorous standards our customers demand.

We're hiring across all levels. We'll calibrate seniority during interviews based on your background and what you're looking to own. This role is for engineers who value in-person collaboration at our San Francisco, CA office.

What You'll Own

Measurable AI Agents

Design and build a unified evaluation platform that serves as the single source of truth for all of our agentic systems and audit workflows
Build observability systems that surface agent behavior, trace execution, and failure modes in production, and feedback loops that turn production failures into first-class evaluation cases
Own the evaluation infrastructure stack including integration with LangSmith and LangGraph.
Translate customer problems into concrete agent behaviors and workflows
Integrate and orchestrate LLMs, tools, retrieval systems, and logic into cohesive, reliable agent experiences

Rapid Model Evaluation

Build automated pipelines that evaluate new models against all critical workflows within hours of release
Design evaluation harnesses for our most complex Agentic systems and workflows
Implement comparison frameworks that measure effectiveness, consistency, latency, and cost across model versions
Design guardrails and monitoring systems that catch quality regressions before they reach customers

AI-native engineering execution

Use AI as core leverage in how you design, build, test, and iterate
Prototype quickly to resolve uncertainty, then harden systems for enterprise-grade reliability
Build evaluations, feedback mechanisms, and guardrails so agents improve over time
Work with SMEs and ML Engineers to create evaluation datasets by curating production traces.
Design prompts, retrieval pipelines, and agent orchestration systems that perform reliably at scale

Ownership of Quality and Large Product Areas

Define and document evaluation standards, best practices, and processes for the engineering organization
Advocate for evaluation-driven development and make it easy for the team to write and run evals
Partner with product and ML engineers to integrate evaluation requirements into agent development from day one
Take full ownership of large product areas rather than executing on narrow tasks

Who You Are

You are an engineer who believes that evaluations are foundational to building reliable AI systems, not a nice-to-have. The following operating principles should resonate with you:

Evaluation-first mindset: You understand that for an AI company, not being able to evaluate a new model quickly is unacceptable
AI-native instincts: You treat LLMs, agents, and automation as fundamental building blocks and parts of the craft of engineering
Data-driven rigor: You make decisions based on metrics and are obsessed with measuring what matters
Production-oriented: You understand that evaluations must work on real production behavior, not just offline datasets
Strong product judgment: You can decide what matters and why, without waiting for guidance, not just how to implement it
Bias to building: You move fast and build working systems rather than perfect specifications

Experience

We care more about capability and trajectory than years on a resume, but most strong candidates will have:

Multiple years of experience shipping production software in complex, real-world systems
Experience with TypeScript, React, Python, and Postgres
Built and deployed LLM-powered features serving production traffic
Implemented evaluation frameworks for model outputs and agent behaviors
Designed observability or tracing infrastructure for AI/ML systems
Worked with vector databases, embedding models, and RAG architectures
Experience with evaluation platforms (LangSmith, Langfuse, or similar)
Comfort operating in ambiguity and taking responsibility for outcomes
Deep empathy for professional-grade, mission-critical software (experience with audit and accounting workflows are not required)

What Should Excite You

Agent reliability at enterprise scale: Building systems that professionals depend on
Balancing automation with human oversight: Knowing when to automate and when to surface decisions to experts
Production feedback loops: Turning real-world agent failures into systematic improvements
Explaining AI decisions: Making all forms of AI outputs and agent reasoning transparent and trustworthy
Evaluation for nuanced domains: Structuring data and feedback for workflows where ground truth requires expert judgment
High-impact visibility: Your work directly enables leadership to confidently communicate AI quality to the board and customers

More about Fieldguide:

Fieldguide is a values-based company. Our values are:

Fearless - Inspire & break down seemingly impossible walls.
Fast - Launch fast with excellence, iterate to perfection.
Lovable - Deliver happiness & 11 star experiences.
Owners - Execute & run the business with ownership.
Win-win - Create mutual value & earn trust for life.
Inclusive - Scale the best ideas with inclusive teams.

Some of our benefits include:

Competitive compensation packages with meaningful ownership
Flexible PTO
401k
Wellness benefits, including a bundle of free therapy sessions
Technology & Work from Home reimbursement
Flexible work schedules

AI Engineer Evals LLMs Agents TypeScript React Python Postgres LangSmith Observability Vector DBs RAG Evaluation Agent reliability

Fieldguide Glassdoor Company Review

No rating

Fieldguide DE&I Review

No rating

CEO of Fieldguide

Unknown name

Approve of CEO

Average salary estimate

$200000 / YEARLY (est.)

min

max

$160000K

$240000K

If an employer mentions a salary or salary range on their job, we display it as an "Employer Estimate". If a job has no salary data, Rise displays an estimate if available.

Similar Jobs

Director, Talent Acquisition

Fieldguide Hybrid San Francisco

VIEW

Posted 6 hours ago

Lead and scale Fieldguide’s global recruiting function—designing AI-native hiring systems, building employer brand, and developing a high-performing TA team to support rapid growth.

Full Stack Developer (Pipeline)

InterImage Hybrid No location specified

VIEW

Posted 8 hours ago

Experienced Full Stack Developer needed to maintain and enhance WEBCANDID and TESTFLIGHT reporting tools, including on-call support for mission-critical operations.

Senior UX Engineer, Design Systems

Greenhouse Hybrid Anywhere in the United States

VIEW

Posted 7 hours ago

Greenhouse is hiring a Senior UX Engineer, Design Systems to build reusable, accessible component patterns and documentation that enable product teams to ship faster and more consistently.

Engineering Manager, Processing

Lithic Hybrid Remote

VIEW

Posted 12 hours ago

Customer-Centric

Collaboration over Competition

Fast-Paced

Growth & Learning

Lithic is looking for an Engineering Manager to lead the Processing team responsible for low-latency, highly available transaction processing and network peering across card networks.

Front-End Application Developer

Jobgether Hybrid US

VIEW

Posted 11 hours ago

Work remotely as a Front-End Application Developer building accessible, scalable React/Angular applications for environmental data platforms while contributing across the full stack.

IT Software Development Specialist Sr

Sedgwick Hybrid Memphis, TN

VIEW

Posted 14 hours ago

Experienced software developer sought to build and maintain claims and insurance applications using PL/SQL, Oracle, Progress 4GL, .NET and SQL Server for Sedgwick’s Memphis team.

Software Engineer, Identity & Access

Patreon Hybrid No location specified

VIEW

Posted 11 hours ago

Inclusive & Diverse

Transparent & Candid

Growth & Learning

Diversity of Opinions

Mission Driven

Customer-Centric

Rapid Growth

Dare to be Different

Collaboration over Competition

Join Patreon's Identity & Access team to design and implement authentication, verification, and account-protection features that keep creators and their supporters safe and secure.

Software Engineering Internships – Academic Year

NBCUniversal Hybrid 1 Blachley Rd, Stamford, CT 06902, USA

VIEW

Posted 16 hours ago

NBCUniversal is hiring part-time Academic Year Software Engineering interns in Stamford, CT to support observability, automation, and monitoring efforts within its Operations & Technology division.

eCommerce Technical Lead, Application Development and Maintenance

Cardinal Health Hybrid US-Nationwide-FIELD

VIEW

Posted 15 hours ago

Lead the architecture, development, and stabilization of Cardinal Health's cloud-native eCommerce platforms while guiding distributed engineering teams and driving modernization efforts.

Senior Software Engineer

Fundrise Hybrid No location specified

VIEW

Posted 11 hours ago

Work on high-impact screening and fraud-prevention systems at Fundrise, building reliable, scalable software that protects millions of users while partnering closely with Legal, Finance, and Operations.

Senior Software Engineer (Tech Lead), Marketplace Middleware Engineering

Jobgether Hybrid US

VIEW

Posted 11 hours ago

Lead the design and delivery of mission-critical, event-driven middleware for a private markets fintech platform while mentoring engineers and shaping backend engineering practices.

Senior Software Engineer - Mobile

Rev Hybrid Austin

VIEW

Posted 23 hours ago

Senior Software Engineer (Mobile) to lead and deliver high-quality React Native mobile experiences while contributing across Rev’s full-stack platform to accelerate growth and engagement.

Technical Lead - Angular / AI

Entain Hybrid 210 Hudson St, Jersey City, New Jersey, United States

VIEW

Posted 13 hours ago

Lead frontend teams to design and deliver scalable Angular applications for BetMGM, championing AI-assisted engineering practices to accelerate delivery and improve code quality.

Senior Software Engineer, CV Applications

Jobgether Hybrid US

VIEW

Posted 11 hours ago

Lead the development of scalable backend systems and CV-driven features for a fast-moving youth-sports platform, shaping automated highlights and video analytics used by millions.

Fieldguide

Increase trust in commerce and capital markets by building superpowers for assurance and advisory practitioners.

6 jobs

MATCH

Calculating your matching score...

FUNDING

Series A

DEPARTMENTS

Software Engineering

SENIORITY LEVEL REQUIREMENT

Senior Level

TEAM SIZE