Deep dive: Llama3 from scratch, LinearBoost, LoRA Learns and Forgets Less
LLaMa 3 Implemented From Scratch in Python A new repository, praised by Andrej Karpathy, provides a detailed implementation of Llama3, breaking down each comp

LLaMa 3 Implemented From Scratch in Python
A new repository, praised by Andrej Karpathy, provides a detailed implementation of Llama3, breaking down each component from matrix multiplication in attention mechanisms to positional encoding. Users can load tensors directly from Meta's official Llama3 model file after downloading the weights.
Core Components Covered
- Tokenization Process: Uses the tiktoken library for tokenization with Meta-Llama-3-8B's tokenizer model. Special tokens and mergeable ranks are loaded, converting text into tokens and embeddings after RMS normalization.
- Attention Mechanism: Details weights for queries, keys, values, and outputs. Explains splitting query vectors into pairs, rotating them using RoPE (rotary positional embedding), and obtaining complex numbers for each token's query element. Uses the rope_theta parameter from the model's config, resulting in a rotated query vector.
- Multi-Head Attention Operations: Includes matrix multiplication for query-key scores and masking future tokens during training. The attention scores matrix maps token relationships. Values are computed similarly, resulting in a final attention vector.
- Feedforward Network: Uses SwiGLU to process edited embeddings further. Each layer performs these operations, with the final embedding normalized and decoded into token predictions after 32 transformer layers.
- Visualizations and Practical Examples: Offers practical examples and visualizations, such as heatmaps for attention scores, to aid in understanding Llama3's architecture.
This detailed, step-by-step implementation makes Llama3's architecture and functioning accessible for further experimentation and research.
LinearBoost outperforms CatBoost, XGBoost, LightGBM on five benchmark datasets
LinearBoost is based on boosting a linear classifier to significantly enhance performance. The testing shows it outperforms traditional GBDT algorithms in terms of accuracy and response time across five well-known datasets. The key to LinearBoost's enhanced performance lies in its approach at each estimator stage. Unlike decision trees used in GBDTs, which select features sequentially, LinearBoost utilizes a linear classifier as its building block, considering all available features simultaneously. This comprehensive feature integration allows for more robust decision-making processes at every step.
LoRA Learns Less and Forgets Less
Problem: Low-Rank Adaptation (LoRA) is a parameter-efficient finetuning method for large language models, but its performance compared to full finetuning is unclear, particularly in specialized domains like programming and mathematics.
Solution: The study compares LoRA and full finetuning on programming and mathematics, using instruction finetuning (≈100K prompt-response pairs) and continued pretraining (≈10B tokens). It evaluates performance and regularization effects, analyzing perturbation ranks and comparing with techniques like weight decay and dropout.
**Results: **LoRA underperforms full finetuning in target domains but better preserves base model performance. LoRA achieves a rank 10-100X lower than full finetuning, explaining performance gaps. Applying LoRA to all layers yields better results, and it offers stronger regularization, reducing overfitting more effectively than dropout and weight decay.
Want this implemented in your workflow?
I work with SaaS companies, real-estate, finance, and regulated-industry teams on AI adoption. Book a 20-minute strategy call — no pitch, just a focused conversation about your situation.
I publish one post like this per month. Join AI Command Room and I'll send it directly to you.