Back to Blog

The Big Idea: What If LLMs Didn't Generate Token by Token?

Every large language model you've interacted with — GPT-4, Claude, Llama — generates text one token at a time. This autoregressive approach is simple but inherently slow: to generate 1000 tokens, the model must run 1000 forward passes.

Yann LeCun, Chief AI Scientist at Meta, has been advocating for a fundamentally different architecture: JEPA (Joint Embedding Predictive Architecture). Instead of predicting the next token, JEPA predicts the next chunk of meaning — in embedding space. The model generates the entire next sequence in a single forward pass.

We built a prototype called JEPALM (JEPA Language Model) and trained it on Shakespeare. Here's what we learned.

The Architecture

JEPALM has three components:

  1. TextEncoder: Converts the source text into embeddings (4-layer Transformer)
  2. Predictor: Maps source embeddings to target embeddings in one shot (4-layer cross-attention Transformer)
  3. TextDecoder: Converts predicted embeddings back to text tokens (4-layer Transformer)
[Source Text] → Encoder → [Source Embeddings]
                              |
                        Predictor (one-shot!)
                              |
                        [Target Embeddings]
                              |
                        Decoder → [Target Text]

Total parameters: 10.1 million — tiny by modern LLM standards, but sufficient for a proof of concept.

The Training Setup

Parameter Value
Dataset Tiny Shakespeare (1.1M characters, 97 unique chars)
Model size 10.1M parameters
Training 10 epochs × 2,500 batches = 25,000 batches
GPU NVIDIA A100 80GB
Training time ~8.6 hours total
Learning rate 1e-3 with cosine decay

The loss function has three components:

The Results: Three Surprising Discoveries

Discovery 1: The Model Learned Shakespeare Almost Perfectly

The cross-entropy loss — which measures how well the model reproduces text — converged to essentially zero within the first 500 batches. This means given the right embeddings, the decoder can reconstruct Shakespeare's text with near-perfect accuracy.

CE Convergence

CE loss drops from 0.029 to ~0 by batch 500, then stays flat for the remaining 24,500 batches.

Discovery 2: Embedding Prediction Works — But Slowly

The core innovation — predicting target embeddings from source embeddings — showed steady but modest improvement. Over 10 epochs, the embedding MSE decreased by 25.3% (from 242 to 181).

Embedding Trend

Each line represents one epoch. The clear downward trend across epochs confirms the predictor is learning.

Discovery 3: The Length Prediction Task Is Broken

The length prediction loss was highly volatile, fluctuating between 0 and 1,500+ even in the stable phase. It contributed 42.5% of the total loss — nearly half — while providing minimal benefit to generation quality.

Length Volatility

Length MSE spikes unpredictably throughout training. The 20-batch moving average shows a slight upward trend — the opposite of what we want.

Loss Composition: What's Actually Training?

After the first few hundred batches, the loss composition stabilized to:

This means the model stopped optimizing for text quality very early and spent the remaining 99% of training on embedding and length prediction. The CE loss essentially became irrelevant.

Feasibility Assessment

What Worked

What Didn't Work

Key Lessons Learned

  1. Multi-task loss weighting matters enormously. When one task (CE) converges much faster than others, it stops contributing to gradients. Dynamic weighting or curriculum learning could help.

  2. Length prediction may be the wrong auxiliary task. Predicting exact token count from embeddings is inherently noisy. A classification-based approach or removing it entirely might be better.

  3. The predictor needs more capacity. With only 4 layers and 256 dimensions, the predictor is likely under-capacity for the complex mapping from source to target embeddings.

What's Next

The next iteration of JEPALM will focus on:

Conclusion

JEPALM demonstrates that embedding-space sequence prediction is a viable alternative to autoregressive generation — validating LeCun's core hypothesis. The architecture works, but the loss function design needs significant refinement.

The experiment also highlights a broader lesson in AI research: the devil is in the loss function. Even with a sound architectural idea, poor loss design can dominate training and obscure the signal. The next iteration will address these issues and provide a clearer picture of whether JEPA-style language models can compete with autoregressive approaches.


This experiment was conducted on a dual A100 80GB server. The full training code and logs are available in the JEPALM repository. We welcome contributions and ideas for improvement.