Skip to main content

Are Autoregressive LLMs Really Doomed? (A Commentary Upon Yann LeCun's Recent Key Note)

· 4 min read
Yam Marcovitz
Parlant Tech Lead, CEO at Emcie

Yann LeCun, Chief AI Scientist at Meta and respected pioneer in AI research, recently stated that autoregressive LLMs (Large Language Models) are doomed because the probability of generating a sequence of tokens that represents a satisfying answer decreases exponentially by the token. While I hold LeCun in especially high regard, and resonate with many of the insights he shared in the summit, I disagree with him on this particular point.

Yann LeCun giving a key note at AI Action Summit

Although he qualified his statement with "assuming independence of errors" (in each token generation) this, precisely, was the wrong turn in his analysis. Autoregressive LLMs do not actually diverge in the way he implied there, and we can demonstrate it.

What is Autoregression?

Under the hood, an LLM is a statistical prediction model that is trained to generate a completion for a given text of any (practical) length. We can say that an LLM is a function that accepts text up to a pre-defined length (a context) and outputs a single token out of a pre-defined vocabulary. Once it has generated a new token, it feeds it back into its input context, and generates the next one, and so on and so forth, until something tells it to stop, thus generating (hopefully) coherent sentences, paragraphs, and pages of text.

For a deeper walkthrough of this process, see our recent post on autoregression.

Convergent or Divergent?

What LeCun is saying, then, can be unpacked as follows.

  1. Given the set C of all completions of length N (tokens),
  2. Given the subset A ⊂ C of all "acceptable" completions within C (A = C - U, where U ⊂ C is the subset of unacceptable completions)
  3. Let Ci be the completion we are now generating, token by token. Assume that Ci currently contains K<N completed tokens such that Ci is (still) an acceptable completion (Ci ∈ A)
  4. Suppose some independent constant E (for error) as the probability of generating the next token such that it causes Ci to diverge and become unacceptable (Ci ⊂ U)
  5. Then, generating the next token of Ci at K+1 is (1-E) likely to maintain the acceptability of Ci as a valid and correct completion
  6. Likewise, generating all remaining tokens R = N - K such that Ci stays acceptable has the probability of (1-E)^R

In Simpler Terms

If we always have, say, a 99% chance to generate a single next token such that the completion stays acceptable, then generating 100 next tokens brings our chance down to 0.99^100, or roughly 36%. If we generate 1,000 tokens, then by this logic there is a 0.0004% chance that our final completion is acceptable!

Do you see the problem here? Many of us have generated 1k completions that have been perfectly fine. Could we all have landed on the lucky side of 0.0004%, or is something else going on? Moreover, what about techniques like Chain-of-Thought (CoT) and reasoning models? Notice how they generate hundreds if not thousands of tokens before converging to a response that is often more correct.

The problem here is precisely with assuming that E is constant. It is not.

LLMs, due to their attention mechanism, have a way to bounce back even from initial completions that we would find unacceptable. This is exactly what techniques like CoT or CoV (Chain-of-Verification) do—they lead the model to generate new tokens that will actually increase the completion's likelihood to converge and ultimately be acceptable.

We know it first hand from developing the Attentive Reasoning Queries (ARQs) technique which we use in Parlant. We get the model to generate, on its own, a structured thinking process of our design, which keeps it convergent throughout the generation process.

Depending on your prompting technique and completion schema, not only do you not have to drop to 0.0004% acceptance rate; you can actually stay quite close to 100%.

What Is Autoregression in LLMs?

· 5 min read
Yam Marcovitz
Parlant Tech Lead, CEO at Emcie

Under the hood, an LLM is a statistical prediction model that is trained to generate a completion for a given text of any (practical) length. We can say that an LLM is a function that accepts text up to a pre-defined length (a context) and outputs a single token out of a pre-defined vocabulary. Once it has generated a new token, it feeds it back into its input context, and generates the next one, and so on and so forth, until something tells it to stop, thus generating (hopefully) coherent sentences, paragraphs, and pages of text.

Let's break that down carefully, step by step. An LLM:

  1. Accepts an input context
  2. Predicts a single token out of a vocabulary
  3. Feeds that token back into the input context
  4. Repeats the process again until something tells it to stop

On Accepting an Input Context

It must be understood that LLMs, for various reasons, do not look at their input (nor output for that matter) as text per se, but as numbers.

Each of these numbers is called a token, and each token directly corresponds to some language fragment, depending on the architecture of the model. These tokens form the basic unit by which LLMs understand and predict text. The set of all supported values of these tokens is called the model's vocabulary.

In some LLMs, a token corresponds to a proper word. In theory, it could even correspond to full sentences or paragraphs. In practice, however, tokens are most commonly word-parts (i.e., not even full words). As such, while vocabulary would have been a perfect name for models in which tokens correspond to words, it's important to remember that in most LLMs today the vocabulary is a set of word-parts. Among other benefits, this allows us to keep the vocabulary relatively small (around 200k word-part tokens).

When you prompt an LLM, the prompt first undergoes tokenization so that the LLM can understand it using its own language—its vocabulary. Tokenization breaks down a text into a series of tokens, or, again, numbers, each representing a unique word-part in the vocabulary. To illustrate this concept:

tokens = tokenizer.encode("I like bananas")

for t in tokens:
print(f"{t} = {tokenizer.decode(t)}")

# Output (for example):
# 4180 = "I"
# 5 = " "
# 918 = "lik"
# 5399 = "e"
# 5 = " "
# 882 = "ba"
# 76893 = "nana"
# 121 = "s"

On Predicting the Next Token

The next step is to do what the model is supposed to do: predict the next token, also known as generating a completion for the input context (or at least the fundamental building block of this completion process).

We won't go into the internal prediction and attention mechanisms here. Instead, I want to focus on the very last stage of the prediction process.

When an LLM has done its best to "figure out" the meaning in its input context, it provides what's called a probability distribution over its vocabulary. This means that every single token is assigned the likelihood that it, among all others, should be chosen as the next predicted token in the completion—a great honor indeed.

The following snippet illustrates what that distribution might look like:

context_tokens = tokenizer.encode("The quick brown fox jumps over the lazy dog ")
next_token_probabilities = prompt(context_tokens)

for token, probability in next_token_probabilities:
print(f"{token} ({tokenizer.decode(token)}) = {probability}")

# Output (for example):
# ...
# 5 (" ") = 0.0001 (0.01%)
# ...
# 121 ("s") = 0.002 (0.2%)
# ...
# 4180 ("I") = 0.008 (0.8%)
# ...
# 882 ("ba") = 0.013 (1.3%)
# ...
# 918 ("lik") = 0.0004 (0.04%)
# ...
# 1000 ("dog") = 0.975 (97.5%)
# ...
# 76893 ("nana") = 0.0006 (0.06%)

Once we have this probability distribution, we need to actually decide which one to use. A naive approach would simply choose the one with the highest probability. It turns out, however, that this is a mistake. Not only can it cause models to repeat themselves robotically (a rather uninspiring application of such complex beasts); worse yet, due to biases and issues that often lurk within them, the most probable token—in their eyes—is not necessarily the ones that most of us would deem most reasonable. This is because any such statistical machine has inherent flaws and inaccuracies in its representation of our complex world.

Thus, we don't simply choose the token with the highest probability. Instead, we choose one using various sampling techniques that introduce randomization into the choice process, while respecting the assigned probabilities (kind of like rolling a loaded dice that's heavier on the sides that represent the more likely tokens).

Two things we must take note of here:

  1. The token with the highest probability won't necessarily get selected at every prediction iteration—it will just be more likely to get selected
  2. The token with the highest probability isn't even necessarily the most reasonable one from a Human/AI alignment perpsective

On Feeding the Predicted Token Back into the Context

The final step of the process—before rinsing and repeating—is to the take the newly generated token and feed it back into the input context, appending it to the end. The completion process goes something like this (conceptually):

context_tokens = tokenizer.encode("The quick brown fox jumps over the lazy dog ")

while True:
next_token_probabilities = prompt(context_tokens)
next_token = weighted_random_choice(next_token_probabilities)
context_tokens.append(next_token)

if next_token == STOP_TOKEN:
break

This iterative feed-back loop is directly related to how autoregressive models are trained. During the training process, they are essentially asked to predict or "fill out" the next token in the context. Once they improve on that, the training process moves on to the next token, and so forth. And this is the principle the inference process follows as well.

This is what autoregression is!

Rethinking How We Build Customer-Facing AI Agents

· 7 min read
Yam Marcovitz
Parlant Tech Lead, CEO at Emcie

Building customer-facing AI agents can seem deceptively simple at first.

Modern LLMs make it easy to create impressive demos that handle basic conversations well. But as many development teams discover, the journey from demo to production reveals deeper challenges that traditional approaches struggle to address.

Let's examine three common methodologies for building AI agents, understanding not just their mechanics, but why they fall short in real-world applications. This understanding will help us recognize what's truly needed for successful production deployments.

The Promise and Pitfalls of Fine-Tuning

Fine-tuning—at best—is somewhat like teaching someone your company's way of doing things by having them study thousands of past conversations. The idea seems logical: if an AI model learns from your actual customer interactions, shouldn't it naturally adopt your specific approach?

To understand why this fails in practice, let's consider how a customer service department actually works. When a company launches a new product or updates a policy, they don't retrain their entire staff from scratch—they simply communicate the changes. But with fine-tuned models, even small changes to your service approach require retraining the entire model, a process that's both expensive and time-consuming.

Consider a real-world scenario: A financial services company fine-tunes their model on thousands of support conversations. Three months later, they need to adjust how their agent handles sensitive customer information. With a human team, this would be a simple policy update. With a fine-tuned model, they're looking at collecting new training data, running expensive training jobs, and testing everything again. This makes rapid iteration—the kind needed for real customer service improvement—practically impossible.

Model Degradation

There's also a more subtle but technically significant problem with fine-tuning that many developers discover too late. When you fine-tune an LLM for specific conversational behaviors, you often inadvertently degrade its general capabilities—particularly its ability to generate structured outputs like API calls or database queries.

This degradation becomes particularly problematic when you need your agent to interact with other systems, especially for RAG (Retrieval-Augmented Generation) implementations. The original LLM might have been excellent at formatting database queries or generating precise API parameters, but the fine-tuned version often struggles with these structured tasks.

The Knowledge-Behavior Gap in RAG Systems

Retrieval-Augmented Generation (RAG) solves the knowledge update problem in a seemingly elegant way. Instead of baking information into the model itself, RAG systems simply reference relevant documents during conversations, much like how a human agent might consult a knowledge base. This approach also works better when you need access-control for your data.

But in real-life scenarios, RAG has also been revealing an interesting truth about customer service: having access to information isn't the same as knowing how to use it effectively. Think about training a new customer service representative. You wouldn't just hand them a manual and consider them ready for customer interactions. They need to understand when and how to apply that information, how to handle different customers, when to escalate issues, and how to maintain the company's tone and values. This tends to change from use case to use case, company to company—and, from time to time.

This is where classic RAG systems fall short. They ensure your agent can access your refund policy, but they don't help it understand whether now is the right time to mention refunds, or how to communicate that policy to an already frustrated customer. These behavioral aspects—the "why" and "how" rather than the "what"—remain unaddressed.

The Traps of Graph-Based Conversations

Graph-based frameworks like LangGraph attempt to solve the behavior problem through structured conversation flows. Imagine creating a detailed flowchart for every possible customer interaction. In theory, this ensures your agent always knows what to do next.

But anyone who's worked in customer service knows that real conversations rarely follow neat, predictable paths. A customer might ask about pricing in the middle of a technical support discussion, or bring up a completely unrelated issue while you're explaining a feature. Graph-based systems force developers to either create increasingly complex graphs to handle every possible scenario, or accept that their agent will feel rigid and unnatural in real conversations.

This complexity doesn't just make development harder—it makes maintenance and improvement nearly impossible. When customer feedback suggests your agent needs to handle a situation differently, changing the behavior means navigating and modifying an intricate web of states and transitions, often with unforeseen consequences.

Understanding What's Really Needed

Again, these approaches all reveal something important about building effective AI agents: the challenge isn't just technical—it's about understanding how customer service actually works in practice. Real customer service excellence comes from clear guidelines, consistent behavior, and the ability to adapt quickly based on feedback.

This insight points to what's really needed: a way to directly shape agent behavior without the indirection of training data, the limitations of pure information retrieval, or the complexity of predetermined conversation paths. We need systems that let us update behavior as easily as we update documentation, that maintain consistency while allowing natural conversation flow, and that separate the "what" of information from the "why" and "how" of interaction.

A Different Approach with Parlant

When we built Parlant, we started with a fundamental observation: LLMs are like highly knowledgeable strangers who know countless ways to handle any situation. This vast knowledge is both their strength and their challenge—without clear guidance, they make reasonable but arbitrary choices that may not align with what we actually need.

Understanding the Core Challenge

Think about how organizations actually improve their customer service. They don't rewrite their training manuals from scratch every time they want to adjust how representatives handle a situation. Instead, they provide clear guidelines about specific scenarios and how to handle them. These guidelines evolve based on customer feedback, business needs, and learned best practices.

This is where Parlant's approach becomes particularly relevant. Instead of trying to bake behavior into a model through training data or control it through complex conversation graphs, Parlant provides direct and reliable mechanisms for specifying and updating how your agent should behave in different situations.

The Power of Immediate Feedback

Consider a real example from our market research into pre-LLM agents: a team in a large enterprise noticed that a simple change in how their agent addressed customers led to a 10% increase in engagement. With traditional approaches, implementing and testing such a change would have been a significant undertaking. With Parlant, it's as simple as updating a guideline.

This immediacy transforms how teams can develop and refine their AI agents. Product managers can suggest changes, customer service experts can provide insights, and developers can implement these improvements instantly—all while maintaining the consistency and reliability needed for production systems.

Beyond Simple Instructions

Parlant isn't just about giving instructions to an AI. It's built around understanding how real software teams work. Every behavioral modification is stored as JSON files, which means changes can be version-controlled through Git. Teams can branch and merge behavioral changes just like they do with code, review modifications before they go live, and roll back changes if needed.

More importantly, Parlant ensures coherence across all these behavioral specifications. When you add new guidelines, Parlant automatically checks for conflicts with existing ones. This prevents the kind of contradictory behaviors that often emerge in complex prompt-engineering setups.

A Foundation for the Future

While LLM technology continues to advance—with costs dropping and response times improving—the fundamental challenge of aligning AI behavior with human intentions remains. Parlant provides the infrastructure needed for building AI agents that can truly serve as reliable representatives of your organization.

The result is a development process that finally matches how organizations actually work: iterative, collaborative, and responsive to real-world feedback. It's about building AI agents that don't just work in demos, but excel in production, consistently delivering the kind of customer experience your organization aims to provide.