The Big Idea: What If LLMs Didn't Generate Token by Token?
Every large language model you've interacted with — GPT-4, Claude, Llama — generates text one token at a time. This autoregressive approach is simple but inherently slow: to generate 1000 tokens, the model must run 1000 forward passes.
Yann LeCun, Chief AI Scientist at Meta, has been advocating for a fundamentally different architecture: JEPA (Joint Embedding Predictive Architecture). Instead of predicting the next token, JEPA predicts the next chunk of meaning — in embedding space. The model generates the entire next sequence in a single forward pass.
We built a prototype called JEPALM (JEPA Language Model) and trained it on Shakespeare. Here's what we learned.
The Architecture
JEPALM has three components:
- TextEncoder: Converts the source text into embeddings (4-layer Transformer)
- Predictor: Maps source embeddings to target embeddings in one shot (4-layer cross-attention Transformer)
- TextDecoder: Converts predicted embeddings back to text tokens (4-layer Transformer)
[Source Text] → Encoder → [Source Embeddings]
|
Predictor (one-shot!)
|
[Target Embeddings]
|
Decoder → [Target Text]
Total parameters: 10.1 million — tiny by modern LLM standards, but sufficient for a proof of concept.
The Training Setup
| Parameter | Value |
|---|---|
| Dataset | Tiny Shakespeare (1.1M characters, 97 unique chars) |
| Model size | 10.1M parameters |
| Training | 10 epochs × 2,500 batches = 25,000 batches |
| GPU | NVIDIA A100 80GB |
| Training time | ~8.6 hours total |
| Learning rate | 1e-3 with cosine decay |
The loss function has three components:
- Cross-Entropy (CE): Measures how well the decoder reconstructs text
- Embedding MSE: Measures how well the predictor maps embeddings
- Length MSE: Measures how well the model predicts target sequence length
The Results: Three Surprising Discoveries
Discovery 1: The Model Learned Shakespeare Almost Perfectly
The cross-entropy loss — which measures how well the model reproduces text — converged to essentially zero within the first 500 batches. This means given the right embeddings, the decoder can reconstruct Shakespeare's text with near-perfect accuracy.

CE loss drops from 0.029 to ~0 by batch 500, then stays flat for the remaining 24,500 batches.
Discovery 2: Embedding Prediction Works — But Slowly
The core innovation — predicting target embeddings from source embeddings — showed steady but modest improvement. Over 10 epochs, the embedding MSE decreased by 25.3% (from 242 to 181).

Each line represents one epoch. The clear downward trend across epochs confirms the predictor is learning.
Discovery 3: The Length Prediction Task Is Broken
The length prediction loss was highly volatile, fluctuating between 0 and 1,500+ even in the stable phase. It contributed 42.5% of the total loss — nearly half — while providing minimal benefit to generation quality.

Length MSE spikes unpredictably throughout training. The 20-batch moving average shows a slight upward trend — the opposite of what we want.
Loss Composition: What's Actually Training?
After the first few hundred batches, the loss composition stabilized to:
- Embedding MSE: 57.5%
- Length MSE: 42.5%
- Cross-Entropy: ~0%
This means the model stopped optimizing for text quality very early and spent the remaining 99% of training on embedding and length prediction. The CE loss essentially became irrelevant.
Feasibility Assessment
What Worked
- The three-component architecture is functional — all components produce valid outputs
- Embedding-space prediction is viable — the predictor learns measurable improvements
- Non-autoregressive generation is achievable — the model generates entire sequences in one forward pass
What Didn't Work
- Loss imbalance: CE converged too quickly, leaving the model training on only two tasks
- Length prediction noise: The volatile length loss introduces gradient noise that may slow embedding convergence
- Slow embedding learning: 25.3% improvement over 10 epochs suggests the task is harder than expected
Key Lessons Learned
Multi-task loss weighting matters enormously. When one task (CE) converges much faster than others, it stops contributing to gradients. Dynamic weighting or curriculum learning could help.
Length prediction may be the wrong auxiliary task. Predicting exact token count from embeddings is inherently noisy. A classification-based approach or removing it entirely might be better.
The predictor needs more capacity. With only 4 layers and 256 dimensions, the predictor is likely under-capacity for the complex mapping from source to target embeddings.
What's Next
The next iteration of JEPALM will focus on:
- Dynamic loss weighting using uncertainty-based methods
- Removing or reducing the length loss component
- Scaling up the predictor to 8-12 layers with 512-1024 dimensions
- Two-phase training: pretrain encoder/decoder first, then train the predictor
- Larger datasets: moving from Tiny Shakespeare to WikiText-103 or OpenWebText
Conclusion
JEPALM demonstrates that embedding-space sequence prediction is a viable alternative to autoregressive generation — validating LeCun's core hypothesis. The architecture works, but the loss function design needs significant refinement.
The experiment also highlights a broader lesson in AI research: the devil is in the loss function. Even with a sound architectural idea, poor loss design can dominate training and obscure the signal. The next iteration will address these issues and provide a clearer picture of whether JEPA-style language models can compete with autoregressive approaches.
This experiment was conducted on a dual A100 80GB server. The full training code and logs are available in the JEPALM repository. We welcome contributions and ideas for improvement.