How Kepler built verifiable AI for financial services with Claude
Inside a platform that indexes 26M+ SEC filings, earnings call transcripts, IR presentations, consensus estimates, and private data across 14,000+ companies and 27 global markets, and how the team behind it built AI that validates every number to the exact filing, page, and line item.
In our series, How startups build with Claude, we highlight how startups are transforming their industries with AI. In this article, we share how Kepler built a trust and verification layer for AI in financial services.
The quick pitch
Name
Kepler
Founded
2025
Founders
Vinoo Ganesh (CEO) and John McRaven (CTO)
Stack
AWS, Rust, Python, containers for orchestration
Growth
Indexed 26M+ SEC filings, 50M+ public documents, 1M+ private documents, and 14,000+ companies across 27 global markets in less than three months.
Financial firms operate in a heavily regulated environment where reporting has to be auditable and accountable. Every figure in a regulatory filing, deal pitch, or research report needs to be verifiable against source documents.
The tools the financial industry has traditionally relied on can pull data, but they still require analysts for that verification process. An analytics system can’t interpret a freeform question, decompose it into steps, or work out that a single metric requires pulling three different line items across specific fiscal periods. AI systems can do that interpretation, but they handle it in the same step as the computation, so the numbers they produce are generated by the model, which can make mistakes.
Vinoo Ganesh and John McRaven spent years at Palantir building data systems for defense, energy, and financial firms. That work shaped how they think about trust in environments where answers must be verifiable. Before founding Kepler, they spoke with 147 financial firms, including private equity, hedge funds, and investment banks, and heard the same thing at nearly all of them: everyone wanted to use AI for research, but nobody trusted the output. As one managing director told them, "How am I supposed to trust something I can't audit?"
The duo’s answer was to build deterministic infrastructure that serves as a trust and verification layer for AI. That infrastructure, together with Claude as the reasoning and interpretation layer, powers Kepler Finance: a research platform for financial services used by analysts to ask questions in plain English and receive instantly verifiable answers.
Handling long, multi-step tasks and flagging ambiguity
Financial analysis involves complex, multi-step calculations, dense data, and overloaded terminology, and has no tolerance for error. With that in mind, Kepler needed a model that could hold a long plan together without drift and flag ambiguity.
For example, if an analyst asks for a company’s inventory days outstanding over the last eight quarters, the model needs to figure out what the answer needs: the right formula, correct fiscal periods, and any restatements that might affect the numbers.
The team benchmarked across all frontier models. They found that on straightforward queries, models performed comparably. But when it came to long, multi-step plans with interdependencies, all but Claude started taking shortcuts or losing track of constraints by the fourth or fifth step. "On our workloads, Claude was the model that consistently held the plan together," Ganesh says. “Other models would start strong and then quietly drop a constraint by step five.”
The clearest difference was how each model handled uncertainty and kept humans in the loop. For example, in situations where one term can have two different meanings, most models picked one meaning and kept going. Claude stopped and asked the analyst to decide. "That behavior matters more than any benchmark score," Ganesh says. "One wrong assumption early in a financial analysis breaks everything downstream."
Engineering the context around Claude
The Kepler team found that Claude produced better results when given precisely defined tasks enhanced with structured domain knowledge, definitions, and hard boundaries on what to resolve versus what to escalate. "In finance, the model can’t be the whole system. We treat it as one stage in a pipeline whose job is to hand the model exactly what it needs to succeed at exactly that stage," says McRaven. “Prompt engineering optimizes a call while content engineering optimizes the system around it.”
The team built deterministic execution environments that Claude can invoke for every operation that needs to be provably correct, such as computing a ratio or resolving a fiscal period. They developed a proprietary ontology that maps financial concepts to precise definitions and formulas, customizable on a per-use basis. Security and access control restrictions are enforced at every step, governing which data sources each user can pull from. On top of this, they built recurring, customizable skills for the most common workflows in their pipeline, such as enterprise value calculations across complex capital structures (e.g. handling preferred shares, convertibles, and minority interests) and segment revenue waterfall reconciliation across reporting period changes. These skills coordinate between deterministic and nondeterministic stages and are idempotent by design: the same input will always generate the same output.
Next, they decomposed their workflows into a multi-stage pipeline, matching different Claude models to different stages: Opus 4.7 for complex reasoning like decomposing intent, resolving ambiguity, and producing structured execution plans, and Sonnet 4.6 for higher-throughput stages where tasks are more constrained. They also trained their own specialized models for recall (some use Claude as the foundation, some are proprietary to Kepler), scoring 94% accuracy on tasks like mapping financial statement labels to standardized taxonomy codes, compared with the 38-46% accuracy achieved by other models.
The team tests every prompt change, model upgrade, and context modification against thousands of cases before going to production. They’ve built automated evaluation pipelines that compare Claude's output against known-correct answers at every stage, checking both the structured plan and the final computed result. When a test fails, they can trace whether the issue was in Claude's reasoning, the context provided, or the downstream execution. When Anthropic ships a new model version, Kepler benchmarks it within hours and knows exactly which stages improve, which regress, and which need prompt adjustments.
Scaling with Claude
Kepler Finance has indexed more than 26 million SEC filings across 14,000+ companies, 50M+ public documents, and 1M+ private documents spanning 27 global markets. Claude makes that volume of unstructured data usable, interpreting questions against the entire corpus and reconciling differences in terminology across companies and time periods. Kepler's retrieval layer then pulls figures from verified SEC filings, computes the result, and assembles the results into the desk's Excel template, where with a single click analysts can trace each number back to its exact line item highlighted in the source document.
The separation between Claude's reasoning and Kepler's deterministic infrastructure lets a small team build at this scale. Claude handles the interpretation layer that would otherwise require many domain-specific NLP engineers and Kepler's infrastructure handles the rest. New capabilities that would take a large team months to ship can be built in weeks because the architecture is modular: the team improves the reasoning at one stage without touching the rest of the pipeline.
As financial institutions require compliance infrastructure before they engage, Kepler has built full audit logging, siloed customer environments, and end-to-end provenance from the start, and has SOC 2 Type II certification, with ISO 27001 certification underway.
Kepler’s platform is domain-agnostic by design. The team started in finance deliberately as it’s one of the most demanding environments for AI, with dense data, overloaded terminology, complex calculations, and zero tolerance for error. The architecture built to survive that scrutiny applies wherever professionals need verifiable answers from large document collections. From healthcare providers reconciling clinical trial data against treatment protocols to legal teams tracing precedent across decades of case law, the pattern is the same: Claude reasons about the question and infrastructure guarantees the answer.
"Kepler Finance is our first product," says Ganesh. "It won’t be the last."
Best practices from the Kepler team
Give Claude the right job
Retrieval is a job for a query engine. Computation is a job for a formula engine. Ask Claude to interpret, decompose, or reason.
Match models to stages
Use Opus for complex reasoning and Sonnet for constrained, high-throughput tasks. Running everything on one model leaves either quality or cost on the table.
Invest in evaluation before prompts
Build automated pipelines that test Claude's output against known-correct answers at every stage. Test each stage independently and the full pipeline end-to-end. In finance, a silent regression is how you lose a client permanently.
Build for provenance from day one
Professionals are trained to verify everything. Provenance has to shape the entire system, not get added at the end.