Deep-dive technical articles on Medium, implementing LLM inference primitives from scratch with full math derivations.

LLM Inference Series

Demystifying GPTQ: From Lagrange Multipliers to Vectorized PyTorch

Derives the GPTQ weight-quantization update from scratch with Lagrange multipliers, shadow-runs a 2D example by hand, and translates it into vectorized PyTorch. Covers the inverse-Hessian compensation that lets 4-bit LLMs keep their perplexity.

Minimal LLM Inference: Continuous Batching

Implements the scheduling loop, iteration-level batching, and KV-cache eviction policy that power high-throughput LLM serving. Explains why sequential serving leaves 70%+ of GPU compute idle.

Speculative Decoding: The Clever Trick Making LLMs 2x Faster

Full derivation of the draft-verify acceptance probability, gamma sweep analysis for optimal draft length, and wall-clock benchmarks comparing DistilGPT2-drafted vs. standard GPT-2 decoding.

Accelerating Transformer Inference with KV Caching

Derives exact VRAM consumption for KV cache given model width, heads, and sequence length. Walks through a from-scratch PyTorch implementation showing precisely where and why memory explodes with context length.

LoRA: Building on Fundamental Principles for Low-Rank Adaptation

Implements LoRA rank decomposition without any PEFT library, covering the math behind low-rank approximations and demonstrating significant parameter reduction with minimal quality loss.

Reward Model Training for RLHF

Trains a reward model from paired preference data for an RLHF pipeline — covering loss functions, Bradley-Terry model math, and training stability tricks for robust preference learning.

Kaggle Competition Writeup — Bristol-Myers Squibb

Bristol-Myers Squibb Molecular Translation — Part 1: Introduction and EDA

Overview of the Kaggle molecular translation challenge, dataset exploration, and analysis of IUPAC name distributions. Sets up the problem framing for the deep learning approach.

Bristol-Myers Squibb Molecular Translation — Part 2: Deep Learning Modelling with LSTM

Details the EfficientNet + LSTM + Bahdanau Attention architecture, TPU training strategy, Teacher Forcing, beam search, and final results achieving Levenshtein distance 8.9 (top 500).