Training Data

Select text corpora to concatenate for training.

Upload Custom Files
Select Sources
Author/Text Language Script Words Characters

Architecture

Select a preset and/or go deeper with architecture settings.

Architecture
Size
Components
Dimensions
Mixture of Experts

Parameters

5.7M

Batch Size

32

Learning Rate

0.001

Tokenizer

Choose how input text is tokenized.

Hyperparameters

Decide on core settings for training.

Core
Optimization
Evaluation & Checkpointing

Understand

Better understand the model architecture, code, and math.

Architecture Diagram
Rendering diagram...
Equations

Key notation

  • x: input tensor [B, L, d_model]
  • W_Q, W_K, W_V, W_O: attention matrices
  • W_in, W_out: MLP matrices
  • d_model: model dimension
  • n_heads: number of attention heads

Token embedding

ERV×dmodelE \in \mathbb{R}^{V \times d_{model}}

x0=E[tokens]x_0 = E[\text{tokens}]


Positional encoding

If learned positions: x0=x0+P[positions]x_0 = x_0 + P[\text{positions}]


Attention

Q=xWQ,  K=xWK,  V=xWVQ = x W_Q, \; K = x W_K, \; V = x W_V

attn(x)=softmax(QKT/dhead)V\text{attn}(x) = \text{softmax}(QK^T / \sqrt{d_{head}}) V


MLP

mlp(x)=Woutσ(Winx)\text{mlp}(x) = W_{out} \sigma(W_{in} x)


Residual block

xi+1=xi+attn(xi)x_{i+1} = x_i + \text{attn}(x_i)

xi+2=xi+1+mlp(xi+1)x_{i+2} = x_{i+1} + \text{mlp}(x_{i+1})

Code

No snippets available yet.

Train

Train your model.

Demo mode: pre-training disabled.

Metrics

Training loss and evaluation checkpoints.

Progress

0.0%

Elapsed

-

ETA

-

Loss

-

Running Loss

-

Grad Norm

-

What am I looking at?

Progress: How far along the training is (based on max iterations).

Loss (Cross Entropy): The main objective function. Lower is better. It measures how "surprised" the model is by the real next token.

Running Loss: A smoothed average of the loss. This helps you see trends better when the raw loss is jumping around.

Grad Norm: The size of the update step. If this explodes (goes huge), training might crash. If it goes to zero, the model has stopped learning.

Dynamics

Visualize gradient norms and optimization trajectory.

Gradient Norms per Layer
Start training to see gradients
Loss Landscape in 3D (Loss on Z-axis)
Start training to see 3D landscape
What am I looking at?

Gradient Norms per Layer: This shows the magnitude (L2 norm) of the gradients for each part of the model. It tells you "how hard" each layer is trying to change.

  • Layers 0-N: The main transformer blocks.
  • Embedding: The input token embeddings.
  • Head: The final output layer projecting to vocabulary.
  • Norm: The Layer Normalization parameters (scale/bias) that help stabilize training.

Loss Landscape Trajectory: Neural networks live in a massive multi-dimensional space (millions of parameters). We can't see that, so we cheat:

  • Random Projections: We pick two random, fixed directions (vectors) at the start. Why random? In high-dimensional spaces (millions of parameters), random projections preserve relative distances surprisingly well (Johnson-Lindenstrauss lemma). This allows us to see the "shape" of the optimization without needing to know the future trajectory.
  • These vectors become our X and Y axes (unitless dimensions).
  • As the model trains, we project its current weights onto this 2D plane.
  • In 3D: We add the Loss as the vertical Z-axis. This lets you literally see the "descent" down the loss surface. The line shows the path the model is taking.

Inspect Batch

Peek at tokens, next-token predictions, and attention.

0 / 31

Select a sample to inspect tokens and predictions.

0 / 3
0 / 3
Query ↓Key →
What am I looking at?

Input Tokens: The unique integer IDs (and colored labels) for each part of the text. Models don't see words; they see tokens.

Target: The actual next token in the training data that the model is trying to predict.

Predictions: The top 10 tokens the model thought should come next, and their probabilities.

Attention Heatmap: This grid shows how much the model attends to past tokens when processing the current token.

  • Vertical Axis (Query): The token currently being processed.
  • Horizontal Axis (Key): The past tokens it is "looking back" at.
  • Bright Cells: Strong attention relationships.

Evaluation

Train and validation losses recorded at eval intervals.

What am I looking at?

Train Loss: The error on the data the model is actively learning from.

Val (Validation) Loss: The error on data the model has NEVER seen before.

Why compare them?

  • If Val Loss starts going UP while Train Loss goes DOWN, the model is significantly overfitting (memorizing, not learning).
  • Ideally, both go down together.

Logs

No logs yet.