Fundamental Conflict: LLMs vs. Experiential Learning

Fundamental Conflict: LLMs vs. Experiential Learning why your "smart AI" still feels kind of dumb in real life

I keep thinking about this simple, slightly annoying question:

If these models are so powerful, why do they still feel like a very smart intern who forgets yesterday?

They write code, answer questions, pass exams. But ask them to own a workflow for weeks, adapt over time, not repeat the same mistake… and they wobble.

Underneath that feeling is a real fight:

LLMs learn from text. Real intelligence learns from experience.

And three people keep circling around this in different ways: Richard Sutton (reinforcement learning legend), Andrej Karpathy (ex-Tesla / ex-OpenAI, "decade of agents" guy), and Ilya Sutskever (ex-OpenAI chief scientist, now Safe Superintelligence).

Let's unpack what's going on, but in normal language.

1. What are LLMs actually doing?

Very roughly: LLMs are trained on huge piles of text (the internet, code, books, papers, etc.). Their main job in training is to guess the next word. They get really good at compressing patterns in human writing.

That gives you something like a frozen world model. It "knows" how people talk about medicine, law, finance, games, etc. It can remix that knowledge in flexible ways. But it does not live a life. It doesn't act, see what happens, feel consequences, and adjust.

So it's like giving a brain every Wikipedia page and GitHub repo, but never letting it actually do anything for itself.

This is where the conflict starts.

2. Sutton: "Supervised learning doesn't happen in nature"

Richard Sutton represents the hardline experience-first view. His core claim is very direct: Intelligence equals achieving goals in the world. To do that, you must learn from your own stream of experience (sense, act, get reward or pain, update). Animals don't get labeled datasets. Squirrels don't go to school. They just experiment.

So when he looks at LLMs, he sees a system that copies what humans wrote. No real goals, no real environment, no actual stakes. Just very good mimicry. Impressive, but not the thing he cares about.

He's famous for the "Bitter Lesson": whenever we try to hand-engineer clever knowledge, in the long run it loses to simple methods that just exploit more compute and learn directly from experience. From his angle, LLMs are on the wrong side of history (too much human data, not enough raw experience).

3. Karpathy: "Crappy evolution" and the slop era of agents

Andrej Karpathy is not anti-LLM, but he's also not blind to their limits. His take: Pre-training LLMs on internet text is like "crappy evolution" (a hacky shortcut to get a neural network from random noise to something with useful representations). Once you have that, then you try to build a "cognitive core" on top, a system that relies less on raw memorized facts and more on algorithms for thinking.

He's very honest about agents right now. Today's agents (the ones chaining tool calls, browsing, running code) look cool in demos. But error compounds across steps, they lack proper memory, and they're unreliable. He literally called them "slop" and said we're at least a decade away from agents that "actually work" as real digital employees.

Also, he doesn't think humans learn complex reasoning through basic RL either. The usual RL with one final reward after a long trajectory is, in his words, like "sucking supervision through a straw" (way too crude for real thinking).

So his position is: LLMs are necessary as a starting point (representation). Pure RL is too naive in its current form. The next 10 years equals the "decade of agents" (hard slog on memory, reliability, continual learning, not just bigger models).

4. Sutskever: from "age of scaling" to "age of research"

Ilya Sutskever used to be the face of "just make it bigger." Now he's basically saying: that phase is ending.

His main points start with the shift from age of scaling to age of research. 2020 to 2025 was the "age of scaling" (same training recipe, more data plus GPUs). We're running into data limits and diminishing returns. Just 100× more scale probably won't magically fix everything.

Then there's the generalization problem. These models "generalize dramatically worse than people." They crush benchmarks, then do something dumb in real applications. That's not a small bug; that's the central problem.

Finally, he sees superintelligence as a fast learner, not an all-knower. The endgame isn't a model that already knows every job. It's a human-like learning agent that can learn any job quickly (a "superintelligent 15-year-old" that can pick up new roles fast and keep improving).

And because digital systems can share and merge what they learn, many copies doing different jobs can periodically sync up. You get an AI that effectively collects the experience of millions of careers. That's his picture of future superintelligence.

5. The real conflict: text vs. lived experience

If you put all three together, you get this tension:

LLMs give you a big, flexible world model trained on human history, but mostly frozen at deployment, with weak real-world learning and shaky reliability outside benchmarks.

Experiential learning (RL and beyond) gives you the right shape of learning (act, see outcome, update). But in practice it's hard, data-hungry, and often dumb in long, complex tasks.

Right now, we are over-optimizing the first and under-building the second.

That's why current "agents" feel off. They look active but don't truly learn from their own mistakes. They don't build deep, personal experience; they just reuse the same frozen brain over and over. When they fail, we tweak prompts and guardrails, not the underlying learning rule.

So you get this weird gap:

Benchmarks say "super smart." Real workflows say "eh, not yet."

6. What can we actually learn from this?

Boiling it down:

First, LLMs are necessary, but not the final story. They're amazing world models and reasoning engines. But as long as they're mostly frozen and offline, they won't become real "workers" that grow with experience.

Second, experience isn't optional, it's the missing nutrient. Every serious vision of "AI that can learn every job" assumes continuous interaction with the world, real feedback, and the ability to improve over time. We don't get that by only shoveling in more static text.

Third, generalization beats parameter count. We're beyond the phase where "just bigger" gives you qualitatively new behavior. The hard part now is: can the system handle small shifts it hasn't seen before, reliably, without collapsing? That's the real bar.

Fourth, safety and learning are pulling against each other. To be safe, we freeze models. To be truly intelligent, we need them to keep learning. That tension is exactly why "age of research" is a thing now (you can't solve this by scale alone).

7. So what should we do next?

If you build or think about AI systems, here are concrete things you can start doing today:

Build memory systems, not just prompts. Stop relying on giant context windows. Create structured memory: save conversation summaries, track failures, store user preferences in a database. Tools like Pinecone, Weaviate, or even Postgres with embeddings work. When your AI messes up on Monday, it shouldn't repeat it on Tuesday.

Log everything and review the failures. Set up logging for every LLM call (input, output, latency, cost). Then look at the failures. Not errors, the silent wrong ones that looked confident but were garbage. Build a review queue to flag bad outputs. This is how you find real failure patterns.

Use LLMs as a component, not the whole system. The LLM is one piece in a bigger machine. Have it generate options, not make final decisions. Wrap it with validation: regex checks, API calls to verify facts, human approval for anything important. The system around the LLM is where reliability lives.

Create feedback loops, even manual ones. Just collect feedback. Add thumbs up/down buttons, let users report bad responses, track which outputs get edited. Once a week, review what went wrong and update your prompts or fine-tune on corrected examples. This is "micro-experience" in practice.

Separate brainstorming from execution. Let the LLM think creatively, but don't let it execute directly. If it writes code, show diffs and require approval. If it books meetings, show the invite first. Keep a human (or rules engine) between "AI says" and "system does."

Test on real workflows, not toy examples. Run it on 50 actual examples from your work. "Can it handle this support ticket?" "Does it break when APIs return weird data?" "What happens when users change their mind?" Count how many times you'd trust it unsupervised.

Start a postmortem document. Just a Google Doc titled "AI Failures." Every time your system does something dumb, write what happened and why. After 20 failures, you'll see patterns that tell you what to fix next.

Use structured outputs and validation. Need JSON? Use function calling or structured output modes. Then validate with Pydantic or Zod. This catches hallucinated fields before they cause problems.

Accept this is a multi-year grind. The easy wins from scaling are mostly taken. The next phase is slower, more about engineering and data than "one big model jump." But you can start making progress today.

Final thought

Right now, LLMs are like ghosts filled with second-hand memories. They're powerful, but they haven't really lived.

The next real step is to let these ghosts get their hands dirty (touch the world, make mistakes, learn, recover) without burning the house down.

That's the tension between LLMs and experiential learning.

And that's where all the interesting work is going to be.