Browse 138 exciting jobs hiring in Evaluation now. Check out companies hiring such as METR, Artificial Intelligence Underwriting Company, Tech Firefly in Worcester, Los Angeles, Cleveland.
METR is seeking experienced researchers and research leads to develop benchmarks, run evaluations, and build infrastructure to measure and mitigate risks from advanced AI systems.
Design and ship production-grade evaluation infrastructure for cutting-edge AI agents while leading customer-facing certifications and shaping product strategy at AIUC.
Lead the technical architecture and cross-domain dependency mapping for a fast-paced, remote contract engagement supporting an academic medical center’s multi-year healthcare technology rollout.
Lead a U.S.-based team to migrate template systems to LLM autoraters and optimize model performance using advanced prompt engineering and evaluation methods.
Welo Data is seeking US-based English speakers to remotely evaluate and rate search results to improve search relevancy and AI performance.
Picogrid seeks a Strategy & Business Operations Lead to design and run the internal systems, metrics, and cross-functional programs that will let the company scale efficiently during rapid growth.
Lead the design and implementation of evaluation infrastructure and observability for enterprise-grade AI agents powering audit and assurance workflows at Fieldguide's San Francisco office.
Apply state-of-the-art AI to financial workflows at Rowspace by building retrieval systems, agentic pipelines, and evaluation frameworks that turn unstructured data into actionable investment insights.
Latitude seeks a PhD AI research intern to build a benchmark library and evaluate SOTA LLM behavior within our story engine, producing publishable results and a public report.
Energy Trust of Oregon seeks an Engineer, Planning & Evaluation to perform measure development, cost-benefit analyses, pilot design, and technical review to support cost-effective energy-efficiency programs.
Help scale Chime's AI-powered Jade assistant by building platform tooling, backend services, and observability systems as a Senior Full-Stack Engineer.
Experienced systems engineering and test & evaluation advisor needed to provide SETA support to the government for verification, test planning, execution, and evaluation of DoD systems.
Northwestern Medicine is hiring a licensed Occupational Therapist (OTR/L) for per-diem inpatient care in Winfield, IL to provide evaluations, treatment, documentation, and interdisciplinary collaboration.
Support ACS’s Employee Wellness program by coordinating and delivering on-site wellness activities across NYC locations while tracking participation and reporting outcomes.
Lead data-driven program performance analysis and provide actionable recommendations to support DoD and civilian federal programs as a Senior Program Management Analyst at One Federal Solution.
GoodAtNumbers is hiring a US-based remote Machine Learning Engineer Intern to push ML research into production by building, evaluating, and deploying reliable LLM-driven features during a paid 12-week summer internship.
Lead the design and delivery of scalable, secure AI-native systems for sophisticated legal customers as a Staff Software Engineer / Architect on Thomson Reuters' CoCounsel FDE team.
Sony AI’s Research Ethics team is hiring a remote Research Intern to work on generative AI ethics, evaluation, and harm-mitigation research with opportunities for publication.
Serve as Foster America's South Carolina Site Lead to coordinate partners, drive implementation of the OPT-In initiative, and translate learning into sustained local impact for families.
Evaluate luxury brand experiences in the Seattle/Bellevue area through short, flexible missions for CXG and help top brands improve service.
Lead the architecture and productionization of Spotify’s shared Agent Engine to power scalable, reliable agent-based experiences across the platform.
Lead the People Development team at National Vision to design and deliver scalable, measurable learning solutions for corporate, retail, manufacturing, and clinical associates.
Lead and build the agentic AI platform that enables pods of engineers and AI agents to safely and reliably deliver production software at scale.
LanguageWire is hiring an AI Engineer to design and productionize LLM-based translation workflows and bridge ML experimentation with production engineering.
Evaluate luxury brand experiences for CXG through flexible in-store or online missions that provide actionable feedback to premium brands.
Work on a mission-driven fintech team to build and ship core AI products (LLM/VLM and evaluation pipelines) that power eligibility and compliance for education savings accounts.
Iambic Therapeutics seeks a Software Engineer II to co-develop and harden ML training, evaluation, and productization workflows that enable AI-driven drug discovery.
Lead and grow an Applied AI engineering team at Mercor to build scalable evaluation and data systems that measurably improve frontier model performance.
Application Engineering Intern at Renesas Hi-Rel to perform lab-based evaluations of power/ADC products, produce technical analysis, and present findings.
Evaluate machine-translated English (US) to Japanese (Japan) song lyrics for meaning, fluency, and cultural accuracy on a flexible, remote freelance project with Welo Data.
Anduril seeks an experienced manager to lead flight test integration and operations for UAS platforms, overseeing system integration, mesh networking, and Flight Test Operations as an RPIC.
Senior NDE Engineer (Radiography Testing) to design, prototype, and deploy advanced radiography and automated inspection solutions to improve manufacturing quality and flight reliability at SpaceX.
Lead the product vision and engineering for clinician-facing AI tools at knownwell, building and operating RAG-based clinical decision support with full product ownership and direct clinician partnership.
Experienced technical product leader needed to own prioritization, quality, and stakeholder alignment for LLM-driven products while staying hands-on with architecture, code reviews, and AI cost optimization.
Help build and deploy production AI agent platforms that power personalized financial advisory workflows for institutional clients at Arta.
Contract freelance raters in the United States will evaluate personalized map and search recommendations using their Google Maps activity history and follow project guidelines to rate relevance and usefulness.
Welo Data is building a flexible, remote contributor network of native English speakers to annotate, evaluate, and create prompts that improve AI systems.
Carilion Clinic is hiring a part-time Community Outreach Specialist to deliver evidence-based pediatric health education and support community partnerships across the Roanoke area.
Evaluate machine-translated English (US) to German (Germany) song lyrics for accuracy, fluency, and cultural appropriateness in a remote freelance role.
Lead Slack's search and AI platform as VP Product to set strategy, drive model and infrastructure decisions, and deliver reliable, scalable AI-powered search and knowledge services for enterprise users.
Lead AbbVie's Neurosciences Search & Evaluation team to identify, assess, and advance high-value external partnering opportunities that strengthen the company’s neuroscience pipeline and strategic goals.
NiCE is hiring a Forward Deployed Engineer to design, ship, and operate production-scale conversational AI agents that solve high-impact enterprise problems.
Montefiore is hiring a licensed Psychologist (PhD/PsyD) to conduct disability-related psychological assessments and clinical consultations for participants in the WeCARE employment-focused program.
Experienced domain experts in Business Operations & Communications or Education and Academic Research are needed for a remote, retainer-based 2‑week role evaluating and crafting prompts for AI writing models with US-contextual standards.
Join an early-stage AI safety startup as a founding Forward Deployed Engineer to design rigorous AI evals, lead customer implementations, and shape product strategy for certification of real-world AI agents.
Work as a freelance luxury brand evaluator for CXG, discreetly assessing boutique and online experiences to help premium brands refine their service.
Serve as the MHPSS Technical Advisor for IRC RAI, providing evidence-based guidance, training, and partnership support to improve mental health and psychosocial services for forcibly displaced populations in the U.S.
Lead and develop a remote evaluation team in WGU’s School of Technology to ensure accurate, scalable competency-based assessment and continuous improvement for Electrical and Computer Engineering programs.
Epoch AI is hiring remote Researchers and Senior Researchers to conduct data-driven investigations, build benchmarks, and forecast AI capabilities and trends.
Visa is hiring a Product Analyst to define and scale generative AI platform capabilities, combining product analytics, prototyping, and cross-functional collaboration to deliver responsible, enterprise-grade AI solutions.
Below 50k*
4
|
50k-100k*
9
|
Over 100k*
19
|