Y Combinator aiArtificial IntelligenceLarge Language ModelsAI LLMs Fine-tuning Poetic AI Startups DeepMind

Poetic AI: Recursive Self-Improvement for LLMs Outperforms Fine-Tuning

Q: What is Poetic AI's recursively self-improving system?

Poetic AI develops **recursively self-improving AI reasoning harnesses** for large language models. This system allows AI to make itself smarter significantly faster and cheaper than traditional methods, avoiding the need for expensive, months-long retraining cycles.

Q: How did Poetic AI perform on ARC AGI v2 compared to DeepMind?

Poetic AI achieved a **54% score on ARC AGI v2** at a cost of $32 per problem. This significantly outperformed Gemini 3 DeepMind's 45% score, which cost over $70 per problem, showcasing the efficiency of their approach.

Q: Why does Poetic AI emphasize reasoning strategies over prompt optimization?

Poetic AI's research indicates that **reasoning strategies implemented in code** provide a substantially larger performance boost (5% to 95%) than prompt optimization alone. Complex problem-solving requires more than just well-crafted prompts to achieve significant gains.

20 minAI summary & structured breakdown

Summary

Poetic AI introduces a recursively self-improving system that significantly outperforms traditional fine-tuning for large language models (LLMs) by generating optimized reasoning harnesses. This approach allows startups to achieve state-of-the-art performance at a fraction of the cost and time, avoiding the 'bitter lesson' of model obsolescence. The system has demonstrated superior results on benchmarks like ARC AGI v2 and Humanity's Last Exam, offering a powerful alternative to expensive model retraining.

Key Takeaways

1
Poetic AI develops recursively self-improving systems that make AI smarter faster and cheaper than traditional methods.
2
Their approach avoids the need to train new LLMs from scratch, which costs hundreds of millions of dollars and months of effort.
3
Poetic's 'harnesses' or 'agentic systems' sit on top of existing LLMs, consistently outperforming them and remaining compatible with new frontier models.
4
The system achieved a 54% score on ARC AGI v2 at a cost of $32 per problem, significantly better than Gemini 3 DeepMind's 45% at $70+.
5
Poetic recently scored 55% on Humanity's Last Exam, surpassing Anthropic's Claude Opus 4.6 (53.1%) with an optimization cost under $100k.
6
The core technology automates the generation of optimized code, prompts, and data, making it faster and cheaper than manual agent development.
7
Reasoning strategies, implemented in code, provide a much larger performance boost (5% to 95%) than prompt optimization alone.

Poetic AI: Recursive Self-Improvement

Poetic AI is building recursively self-improving AI reasoning harnesses for LLMs. This system aims to achieve the 'holy grail' of AI where the AI makes itself smarter, but at a significantly faster and cheaper rate than previously proposed methods. Unlike approaches that require training a new LLM from scratch for every improvement step, Poetic's method avoids this massive expense and time commitment.

The traditional method of fine-tuning LLMs can cost hundreds of millions of dollars and take months, with the risk of becoming obsolete with the next frontier model release. Poetic's system offers a solution by providing a 'harness' that enhances existing models, ensuring continuous performance improvement without the need for constant, expensive retraining. This allows startups and companies to leverage the latest models without being locked into an outdated fine-tuned version.

The Alternative to Fine-Tuning

Poetic's system automatically generates specialized systems for specific problems that consistently outperform underlying language models. This bypasses the typical process of collecting tens of thousands of examples for fine-tuning, which is both costly and prone to obsolescence as new models emerge. The generated 'harness' or 'agentic system' sits on top of one or more language models, providing superior performance.

When a new, more powerful model is released, the same harness remains compatible and can deliver an even greater performance bump. Poetic can further optimize the harness for the new model, making it even better, all at a much lower cost than fine-tuning. This approach protects investments by ensuring that improvements are not lost when foundational models evolve.

Impressive Benchmark Achievements

Poetic AI has demonstrated significant capabilities on challenging benchmarks. For ARC AGI v2, they achieved 54% accuracy at a cost of $32 per problem, outperforming Gemini 3 DeepMind's 45% at over $70 per problem. This was achieved using the cheaper Gemini 3 Pro model, highlighting the efficiency of their approach.

More recently, Poetic achieved 55% on Humanity's Last Exam, a set of 2500 difficult questions designed to challenge PhDs. This score is almost two percentage points higher than the previous state-of-the-art of 53.1% set by Anthropic's Claude Opus 4.6. The optimization cost for this achievement was less than $100k, a stark contrast to the hundreds of millions typically spent on training large foundation models.

Underlying Mechanism: Automated Optimization

Poetic's core technology is a recursively self-improving 'meta-system' that generates systems to solve hard problems. These generated systems are composed of code, prompts, and data built on top of one or more language models. While such harnesses can be built manually, Poetic automates this process, making it faster and significantly cheaper than hiring a dedicated team.

The meta-system handles the optimization process, including context stuffing, example generation, and identifying robust reasoning strategies. This means the AI itself understands the dataset, identifies failure modes, and determines how to improve performance, rather than relying on human prompt engineers. This automated optimization can be applied to existing agents, optimizing prompts, reasoning strategies, or other components based on specific needs.

The Power of Reasoning Strategies

While automated prompt optimization can yield some performance improvements, Poetic's research indicates that reasoning strategies, implemented in code, provide a much more substantial boost. In one instance, manual prompt optimization for a hard task achieved only 5% performance, but adding reasoning strategies increased it to 95%. This highlights that complex problem-solving requires more than just well-crafted prompts.

Poetic's meta-system focuses on developing these sophisticated reasoning strategies, which are often written in code rather than just being better prompts. This allows for deeper and more robust problem-solving capabilities, moving beyond the limitations of simple prompt engineering to achieve significant performance gains.

Advice for Aspiring AI Engineers

For engineers looking to enter the AI field and build startups, the advice is to constantly experiment and engage with AI technologies. The field is evolving rapidly, so daily interaction with AI tools is crucial. Pushing the boundaries of what AI is capable of and building desired applications is key.

An example given is using GPT-5 to build an iPhone app in a weekend, a task that previously took a decade of experience. This demonstrates the speed and ease with which AI can now facilitate development. The overarching message is to not limit oneself and to explore how far AI can take any imaginative project, contributing to making the world better.

FAQ

What is Poetic AI's recursively self-improving system?

Poetic AI develops recursively self-improving AI reasoning harnesses for large language models. This system allows AI to make itself smarter significantly faster and cheaper than traditional methods, avoiding the need for expensive, months-long retraining cycles.

How did Poetic AI perform on ARC AGI v2 compared to DeepMind?

Poetic AI achieved a 54% score on ARC AGI v2 at a cost of $32 per problem. This significantly outperformed Gemini 3 DeepMind's 45% score, which cost over $70 per problem, showcasing the efficiency of their approach.

Why does Poetic AI emphasize reasoning strategies over prompt optimization?

Poetic AI's research indicates that reasoning strategies implemented in code provide a substantially larger performance boost (5% to 95%) than prompt optimization alone. Complex problem-solving requires more than just well-crafted prompts to achieve significant gains.

Key Learning

Explore automating the generation of optimized code and prompts based on the principles discussed by Poetic AI. This approach can help your AI projects achieve state-of-the-art performance at a fraction of the traditional cost and time for LLM development.

Sources:YouTube Video•YouTube Channel•Channel Overview