Training Data
Select text corpora to concatenate for training.
| Author/Text | Language | Script | Words | Characters |
|---|
Architecture
Select a preset and/or go deeper with architecture settings.
Parameters
5.7M
Batch Size
32
Learning Rate
0.001
Tokenizer
Choose how input text is tokenized.
Hyperparameters
Decide on core settings for training.
Understand
Better understand the model architecture, code, and math.
Train
Train your model.
Demo mode: pre-training disabled.
Metrics
Training loss and evaluation checkpoints.
Progress
0.0%
Elapsed
-
ETA
-
Loss
-
Running Loss
-
Grad Norm
-
What am I looking at?
Progress: How far along the training is (based on max iterations).
Loss (Cross Entropy): The main objective function. Lower is better. It measures how "surprised" the model is by the real next token.
Running Loss: A smoothed average of the loss. This helps you see trends better when the raw loss is jumping around.
Grad Norm: The size of the update step. If this explodes (goes huge), training might crash. If it goes to zero, the model has stopped learning.
Dynamics
Visualize gradient norms and optimization trajectory.
What am I looking at?
Gradient Norms per Layer: This shows the magnitude (L2 norm) of the gradients for each part of the model. It tells you "how hard" each layer is trying to change.
- Layers 0-N: The main transformer blocks.
- Embedding: The input token embeddings.
- Head: The final output layer projecting to vocabulary.
- Norm: The Layer Normalization parameters (scale/bias) that help stabilize training.
Loss Landscape Trajectory: Neural networks live in a massive multi-dimensional space (millions of parameters). We can't see that, so we cheat:
- Random Projections: We pick two random, fixed directions (vectors) at the start. Why random? In high-dimensional spaces (millions of parameters), random projections preserve relative distances surprisingly well (Johnson-Lindenstrauss lemma). This allows us to see the "shape" of the optimization without needing to know the future trajectory.
- These vectors become our X and Y axes (unitless dimensions).
- As the model trains, we project its current weights onto this 2D plane.
- In 3D: We add the Loss as the vertical Z-axis. This lets you literally see the "descent" down the loss surface. The line shows the path the model is taking.
Inspect Batch
Peek at tokens, next-token predictions, and attention.
Select a sample to inspect tokens and predictions.
What am I looking at?
Input Tokens: The unique integer IDs (and colored labels) for each part of the text. Models don't see words; they see tokens.
Target: The actual next token in the training data that the model is trying to predict.
Predictions: The top 10 tokens the model thought should come next, and their probabilities.
Attention Heatmap: This grid shows how much the model attends to past tokens when processing the current token.
- Vertical Axis (Query): The token currently being processed.
- Horizontal Axis (Key): The past tokens it is "looking back" at.
- Bright Cells: Strong attention relationships.
Evaluation
Train and validation losses recorded at eval intervals.
What am I looking at?
Train Loss: The error on the data the model is actively learning from.
Val (Validation) Loss: The error on data the model has NEVER seen before.
Why compare them?
- If Val Loss starts going UP while Train Loss goes DOWN, the model is significantly overfitting (memorizing, not learning).
- Ideally, both go down together.