Giant Language Fashions (LLMs) exhibit spectacular mathematical reasoning skills, however their options regularly include errors that can not be robotically verified. Formal theorem proving methods corresponding to Lean 4 supply automated verification with full accuracy, motivating current efforts to construct specialised prover LLMs that generate verifiable proofs in formal languages. Nonetheless, a major hole stays: present prover LLMs resolve considerably fewer issues than general-purpose LLMs working in pure language. We introduce Hilbert, an agentic framework that bridges this hole by combining the complementary strengths of casual reasoning and formal verification. Our system orchestrates 4 parts: a casual LLM that excels at mathematical reasoning, a specialised prover LLM optimized for Lean 4 ways, a proper verifier, and a semantic theorem retriever. Given an issue that the prover is unable to resolve, Hilbert employs recursive decomposition to separate the issue into subgoals that it solves with the prover or reasoner LLM. It leverages verifier suggestions to refine incorrect proofs as crucial. Experimental outcomes exhibit that Hilbert considerably outperforms present approaches on key benchmarks, reaching 99.2% on miniF2F, 6.6% factors above one of the best publicly accessible technique. Hilbert achieves one of the best identified end result on PutnamBench. It solves 462/660 issues (70.0%), outperforming proprietary approaches like SeedProver (50.4%) and reaching a 422% enchancment over one of the best publicly accessible baseline. Thus, Hilbert successfully narrows the hole between casual reasoning and formal proof technology.
- †UC San Diego
- ** Work accomplished whereas at Apple







