JEPALM: Can a Language Model Predict the Entire Next Sentence in One Shot?

The Big Idea: What If LLMs Didn't Generate Token by Token?

Every large language model you've interacted with — GPT-4, Claude, Llama — generates text one token at a time. This autoregressive approach is simple but inherently slow: to generate 1000 tokens, the model must run 1000 forward passes.

Yann LeCun, Chief AI Scientist at Meta, has been advocating for a fundamentally different architecture: JEPA (Joint Embedding Predictive Architecture). Instead of predicting the next token, JEPA predicts the next chunk of meaning — in embedding space. The model generates the entire next sequence in a single forward pass.

We built a prototype called JEPALM (JEPA Language Model) and trained it on Shakespeare. Here's what we learned.

The Architecture

JEPALM has three components:

TextEncoder: Converts the source text into embeddings (4-layer Transformer)
Predictor: Maps source embeddings to target embeddings in one shot (4-layer cross-attention Transformer)
TextDecoder: Converts predicted embeddings back to text tokens (4-layer Transformer)

[Source Text] → Encoder → [Source Embeddings]
                              |
                        Predictor (one-shot!)
                              |
                        [Target Embeddings]
                              |
                        Decoder → [Target Text]

Total parameters: 10.1 million — tiny by modern LLM standards, but sufficient for a proof of concept.

The Training Setup

Parameter	Value
Dataset	Tiny Shakespeare (1.1M characters, 97 unique chars)
Model size	10.1M parameters
Training	10 epochs × 2,500 batches = 25,000 batches
GPU	NVIDIA A100 80GB
Training time	~8.6 hours total
Learning rate	1e-3 with cosine decay

The loss function has three components:

Cross-Entropy (CE): Measures how well the decoder reconstructs text
Embedding MSE: Measures how well the predictor maps embeddings
Length MSE: Measures how well the model predicts target sequence length

The Results: Three Surprising Discoveries

Discovery 1: The Model Learned Shakespeare Almost Perfectly

The cross-entropy loss — which measures how well the model reproduces text — converged to essentially zero within the first 500 batches. This means given the right embeddings, the decoder can reconstruct Shakespeare's text with near-perfect accuracy.

CE Convergence

CE loss drops from 0.029 to ~0 by batch 500, then stays flat for the remaining 24,500 batches.

Discovery 2: Embedding Prediction Works — But Slowly

The core innovation — predicting target embeddings from source embeddings — showed steady but modest improvement. Over 10 epochs, the embedding MSE decreased by 25.3% (from 242 to 181).

Embedding Trend

Each line represents one epoch. The clear downward trend across epochs confirms the predictor is learning.

Discovery 3: The Length Prediction Task Is Broken

The length prediction loss was highly volatile, fluctuating between 0 and 1,500+ even in the stable phase. It contributed 42.5% of the total loss — nearly half — while providing minimal benefit to generation quality.

Length Volatility

Length MSE spikes unpredictably throughout training. The 20-batch moving average shows a slight upward trend — the opposite of what we want.

Loss Composition: What's Actually Training?

After the first few hundred batches, the loss composition stabilized to:

Embedding MSE: 57.5%
Length MSE: 42.5%
Cross-Entropy: ~0%

This means the model stopped optimizing for text quality very early and spent the remaining 99% of training on embedding and length prediction. The CE loss essentially became irrelevant.

Feasibility Assessment

What Worked

The three-component architecture is functional — all components produce valid outputs
Embedding-space prediction is viable — the predictor learns measurable improvements
Non-autoregressive generation is achievable — the model generates entire sequences in one forward pass

What Didn't Work

Loss imbalance: CE converged too quickly, leaving the model training on only two tasks
Length prediction noise: The volatile length loss introduces gradient noise that may slow embedding convergence
Slow embedding learning: 25.3% improvement over 10 epochs suggests the task is harder than expected

Key Lessons Learned

Multi-task loss weighting matters enormously. When one task (CE) converges much faster than others, it stops contributing to gradients. Dynamic weighting or curriculum learning could help.
Length prediction may be the wrong auxiliary task. Predicting exact token count from embeddings is inherently noisy. A classification-based approach or removing it entirely might be better.
The predictor needs more capacity. With only 4 layers and 256 dimensions, the predictor is likely under-capacity for the complex mapping from source to target embeddings.

What's Next

The next iteration of JEPALM will focus on:

Dynamic loss weighting using uncertainty-based methods
Removing or reducing the length loss component
Scaling up the predictor to 8-12 layers with 512-1024 dimensions
Two-phase training: pretrain encoder/decoder first, then train the predictor
Larger datasets: moving from Tiny Shakespeare to WikiText-103 or OpenWebText

Conclusion

JEPALM demonstrates that embedding-space sequence prediction is a viable alternative to autoregressive generation — validating LeCun's core hypothesis. The architecture works, but the loss function design needs significant refinement.

The experiment also highlights a broader lesson in AI research: the devil is in the loss function. Even with a sound architectural idea, poor loss design can dominate training and obscure the signal. The next iteration will address these issues and provide a clearer picture of whether JEPA-style language models can compete with autoregressive approaches.

This experiment was conducted on a dual A100 80GB server. The full training code and logs are available in the JEPALM repository. We welcome contributions and ideas for improvement.