Build an LLM

Training Data

Select text corpora to concatenate for training.

Upload Custom Files

Select Sources

	Author/Text	Language	Script	Words	Characters

Architecture

Select a preset and/or go deeper with architecture settings.

Architecture

Size

Components

Positional Encoding

Normalization

Activation

Attention Type

Use Einops

Dimensions

d_model

n_heads

n_layers

n_ctx

d_head

d_mlp

Mixture of Experts

Use Mixture of Experts

Parameters

5.7M

Batch Size

32

Learning Rate

0.001

Tokenizer

Choose how input text is tokenized.

Hyperparameters

Decide on core settings for training.

Core

Batch Size

Epochs

Max Steps/Epoch

Optimization

Learning Rate

Weight Decay

Evaluation & Checkpointing

Eval Interval

Save Interval

Auto Start

Understand

Better understand the model architecture, code, and math.

Architecture Diagram

Rendering diagram...

Equations

Key notation

x: input tensor [B, L, d_model]
W_Q, W_K, W_V, W_O: attention matrices
W_in, W_out: MLP matrices
d_model: model dimension
n_heads: number of attention heads

Token embedding

$E \in \mathbb{R}^{V \times d_{model}}$

$x_0 = E[\text{tokens}]$

Positional encoding

If learned positions: $x_0 = x_0 + P[\text{positions}]$

Attention

$Q = x W_Q, \; K = x W_K, \; V = x W_V$

$\text{attn}(x) = \text{softmax}(QK^T / \sqrt{d_{head}}) V$

MLP

$\text{mlp}(x) = W_{out} \sigma(W_{in} x)$

Residual block

$x_{i+1} = x_i + \text{attn}(x_i)$

$x_{i+2} = x_{i+1} + \text{mlp}(x_{i+1})$

Code

No snippets available yet.

Train

Train your model.

Demo mode: pre-training disabled.

Metrics

Training loss and evaluation checkpoints.

Progress

0.0%

Elapsed

-

ETA

-

Loss

-

Running Loss

-

Grad Norm

-

What am I looking at?

Progress: How far along the training is (based on max iterations).

Loss (Cross Entropy): The main objective function. Lower is better. It measures how "surprised" the model is by the real next token.

Running Loss: A smoothed average of the loss. This helps you see trends better when the raw loss is jumping around.

Grad Norm: The size of the update step. If this explodes (goes huge), training might crash. If it goes to zero, the model has stopped learning.

Dynamics

Visualize gradient norms and optimization trajectory.

Gradient Norms per Layer

Start training to see gradients

Loss Landscape in 3D (Loss on Z-axis)

Start training to see 3D landscape

What am I looking at?

Gradient Norms per Layer: This shows the magnitude (L2 norm) of the gradients for each part of the model. It tells you "how hard" each layer is trying to change.

Layers 0-N: The main transformer blocks.
Embedding: The input token embeddings.
Head: The final output layer projecting to vocabulary.
Norm: The Layer Normalization parameters (scale/bias) that help stabilize training.

Loss Landscape Trajectory: Neural networks live in a massive multi-dimensional space (millions of parameters). We can't see that, so we cheat:

Random Projections: We pick two random, fixed directions (vectors) at the start. Why random? In high-dimensional spaces (millions of parameters), random projections preserve relative distances surprisingly well (Johnson-Lindenstrauss lemma). This allows us to see the "shape" of the optimization without needing to know the future trajectory.
These vectors become our X and Y axes (unitless dimensions).
As the model trains, we project its current weights onto this 2D plane.
In 3D: We add the Loss as the vertical Z-axis. This lets you literally see the "descent" down the loss surface. The line shows the path the model is taking.

Inspect Batch

Peek at tokens, next-token predictions, and attention.

Sample

0 / 31

Select a sample to inspect tokens and predictions.

Layer

0 / 3

Head

0 / 3

Query ↓Key →

What am I looking at?

Input Tokens: The unique integer IDs (and colored labels) for each part of the text. Models don't see words; they see tokens.

Target: The actual next token in the training data that the model is trying to predict.

Predictions: The top 10 tokens the model thought should come next, and their probabilities.

Attention Heatmap: This grid shows how much the model attends to past tokens when processing the current token.

Vertical Axis (Query): The token currently being processed.
Horizontal Axis (Key): The past tokens it is "looking back" at.
Bright Cells: Strong attention relationships.

Evaluation

Train and validation losses recorded at eval intervals.

What am I looking at?

Train Loss: The error on the data the model is actively learning from.

Val (Validation) Loss: The error on data the model has NEVER seen before.

Why compare them?

If Val Loss starts going UP while Train Loss goes DOWN, the model is significantly overfitting (memorizing, not learning).
Ideally, both go down together.

Logs

No logs yet.