Job Description

Senior QA Engineer — AI Systems & Platform (Contract)

Position Overview

This is a hands-on contract QA role. You'll write test code, design evaluation pipelines, and set the quality bar before the platform reaches a client working directly with the founding engineering team. Quality is not just about whether buttons work—you are validating whether AI-generated analysis and agent recommendations are accurate enough to show to a CFO or CEO.

The Challenge: QA for AI Is a Different Problem

Traditional QA assumes deterministic outputs. AI agents don't give you that. You will be validating quality in an environment where:

92 AI agents coordinate across business functions; agent outputs must be accurate, auditable, and aligned with human-in-the-loop governance at every critical step.
Multi-model routing (Claude, GPT, and others) means the same input can produce different outputs depending on which model handles it and all of them need to meet the same quality bar.
The Company X-Ray is the highest-stakes deliverable: a detailed analysis of a client's operations backed by their own data. Every finding must be reliable before it goes in front of a leadership team.
End users are CEOs and operations leaders who have never used a terminal. A confusing output or a wrong recommendation doesn't just create a bug ticket, it kills adoption.

What You'll Own & Test

Establish the Testing Foundation

Establish the testing framework: unit, integration, end-to-end, and AI-specific evaluation pipelines using Playwright and Vitest.
Define quality standards, test coverage requirements, and documentation practices in partnership with the Lead Engineer.
Audit the existing platform and identify the highest-risk surfaces before the next client deployment.

AI Agent & Knowledge Graph Testing

Design evaluation frameworks for non-deterministic LLM outputs — including prompt regression testing, model drift detection, and output quality scoring.
Build automated test suites for the agent orchestration layer, including governance-agent audit-trail integrity and human-override behavior.
Validate the Company Brain (Memgraph + Qdrant) for data accuracy, retrieval quality, and failure modes under real enterprise data including entity resolution across systems and temporal data patterns.
Test the Analysis Engine pipeline that surfaces Company X-Ray findings ensuring insights are not just technically accurate but reliable enough to present to a client.

Platform & Integration Testing

Own end-to-end testing of the data ingestion pipelines that connect to client systems CRM, email, calls, calendars, documents, financial systems through Nango's 700+ connector integration layer.
Test multi-model routing logic to confirm cost-optimized task allocation behaves correctly across LLM providers via LiteLLM.
Validate streaming response handling, latency thresholds, and graceful degradation when a model is unavailable or slow.
Own file ingestion pipeline testing (Word, Excel, PowerPoint, PDF) including encryption, formatting edge cases, and audit-trail continuity.

Required Qualifications

7+ years of QA engineering experience, with at least 3 years in a senior or lead capacity where you shaped process and standards not just executed them.
You have tested AI/LLM-powered applications. You understand prompt sensitivity, output variance, and how to build eval pipelines that catch regressions across model updates.
You speak in ownership: you've built the eval pipeline, owned model quality, gated the release — not just run someone else's test suite.
You write test code. Python is your primary tool. You have built and maintained CI/CD-integrated test suites, and you don't wait for someone to file a bug to find one.
Hands-on experience with Playwright and Vitest in a production environment and you've built automation frameworks from scratch, not just inherited them.
Comfortable testing complex API chains, async/streaming responses, and multi-service workflows. Data pipelines and knowledge graph outputs don't intimidate you.
You test for confusion and trust failure not just broken functionality. Your end users are non-technical executives, and you advocate for them.
US-based, able to overlap roughly 5 hours per day with EDT, and available for full-time contract hours.

Preferred Qualifications

You have experience with LLM evaluation frameworks (e.g., LangSmith, DeepEval, Promptfoo, RAGAS, or custom eval pipelines).
You have tested agent frameworks or orchestration layers in a production environment.
You have a background in a regulated industry (insurance, finance, healthcare) where audit-trail integrity is non-negotiable.
You have worked alongside Forward Deployed or solutions engineering teams and understand field deployment risk.

Technology Stack

AI/LLM: Anthropic Claude, OpenAI GPT, LiteLLM (multi-model routing), custom agent orchestration with reinforcement learning
Backend: Python (FastAPI), async agent runtime, Pydantic
Data & Graph: Memgraph · Neo4j · Qdrant · PostgreSQL · Redis
Frontend: React/Next.js, TypeScript, Tailwind CSS
Integrations: Nango (700+ connectors)
Infrastructure: Google Cloud Platform (Cloud Run, GCE, Firebase) · GitHub Actions CI/CD · Docker
Testing: Playwright, Vitest

Work Arrangement

Fully Remote US Based
Atlanta, GA a Plus
Contract (~5 hrs/day EDT Overlap)
Early Engineering Team

Compensation & Engagement

Compensation: 1099 hourly rate depending on experience (contract).
Structure: Contract, full-time hours (~40/week). Contract-to-hire possible if the engagement goes well.

About the Company

About Peach Pilot

Most AI companies sell tools. Peach Pilot transforms how businesses run.

Peach Pilot builds a platform that ingests everything about how a company operates—every system, every process, every signal—and constructs a Company Brain: a living knowledge graph that connects people, decisions, and outcomes across the entire organization. The platform deploys 92 pre-built AI agents that work together across every business function, governed by humans at every critical step. The system gets smarter with every interaction.

Peach Pilot doesn't sell software licenses. The company embeds into a client's operation, learns their business in weeks, shows them what's broken backed by their own data, and redesigns their highest-impact business functions with AI. The first vertical is insurance, and the first client engagement is already scoped and funded.

Leadership

Peach Pilot is co-founded by Mario Montag (Predikto, acquired by a Fortune 50; McKinsey, PwC) and JP James (Hive Financial Assets, Georgia Tech, TITAN 100). The company has a working platform with live infrastructure and a proven data-to-insights methodology.