Training Data

Provide a text file or use a default set of George Orwell's books.

Architecture

Select a preset and/or go deeper with architecture settings.

Architecture
Size
Components
Dimensions
Mixture of Experts

Parameters

5.7M

Batch Size

32

Learning Rate

0.001

Tokenizer

Choose how input text is tokenized.

Hyperparameters

Decide on core settings for training.

Core
Optimization
Evaluation & Checkpointing

Understand

Better understand the model architecture, code, and math.

Architecture Diagram
Rendering diagram...
Equations

Key notation

  • x: input tensor [B, L, d_model]
  • W_Q, W_K, W_V, W_O: attention matrices
  • W_in, W_out: MLP matrices
  • d_model: model dimension
  • n_heads: number of attention heads

Token embedding

ERV×dmodelE \in \mathbb{R}^{V \times d_{model}}

x0=E[tokens]x_0 = E[\text{tokens}]


Positional encoding

If learned positions: x0=x0+P[positions]x_0 = x_0 + P[\text{positions}]


Attention

Q=xWQ,  K=xWK,  V=xWVQ = x W_Q, \; K = x W_K, \; V = x W_V

attn(x)=softmax(QKT/dhead)V\text{attn}(x) = \text{softmax}(QK^T / \sqrt{d_{head}}) V


MLP

mlp(x)=Woutσ(Winx)\text{mlp}(x) = W_{out} \sigma(W_{in} x)


Residual block

xi+1=xi+attn(xi)x_{i+1} = x_i + \text{attn}(x_i)

xi+2=xi+1+mlp(xi+1)x_{i+2} = x_{i+1} + \text{mlp}(x_{i+1})

Code

No snippets available yet.

Train

Train your model.

Demo mode: pre-training disabled.

Metrics

Training loss and evaluation checkpoints.

Progress

0.0%

Elapsed

-

ETA

-

Loss

-

Running Loss

-

Grad Norm

-

Inspect Batch

Peek at tokens, next-token predictions, and attention.

0 / 31

Select a sample to inspect tokens and predictions.

0 / 3
0 / 3
Query ↓Key →

Evaluation

Train and validation losses recorded at eval intervals.

Logs

No logs yet.