Blog | Sudhir Pol

Deep-dive technical articles on Medium, implementing LLM inference primitives from scratch with full math derivations.

LLM Inference Series

Unpacking Speculative Decoding: The Math Behind the Speedup

Derives the speculative decoding acceptance probability and its Total Variation identity, works a 5-token vocabulary example by hand, and builds the speedup objective, showing why a weak early draft token is far more costly than a weak late one.

The Math of AWQ: Protecting Salient Channels from the Inside Out

Builds AWQ from first principles: the anatomy of a quantized weight, why output error is proportional to activation magnitude, and the "tug of war" with the group step size that the per-channel scale search must balance.

Demystifying GPTQ: From Lagrange Multipliers to Vectorized PyTorch

Derives the GPTQ weight-quantization update from scratch with Lagrange multipliers, shadow-runs a 2D example by hand, and translates it into vectorized PyTorch. Covers the inverse-Hessian compensation that lets 4-bit LLMs keep their perplexity.

Minimal LLM Inference: Continuous Batching

Implements the scheduling loop, iteration-level batching, and KV-cache eviction policy that power high-throughput LLM serving. Explains why sequential serving leaves 70%+ of GPU compute idle.

Speculative Decoding: The Clever Trick Making LLMs 2x Faster

Full derivation of the draft-verify acceptance probability, gamma sweep analysis for optimal draft length, and wall-clock benchmarks comparing DistilGPT2-drafted vs. standard GPT-2 decoding.

Accelerating Transformer Inference with KV Caching

Derives exact VRAM consumption for KV cache given model width, heads, and sequence length. Walks through a from-scratch PyTorch implementation showing precisely where and why memory explodes with context length.

LoRA: Building on Fundamental Principles for Low-Rank Adaptation

Implements LoRA rank decomposition without any PEFT library, covering the math behind low-rank approximations and demonstrating significant parameter reduction with minimal quality loss.

Reward Model Training for RLHF

Trains a reward model from paired preference data for an RLHF pipeline — covering loss functions, Bradley-Terry model math, and training stability tricks for robust preference learning.

Kaggle Competition Writeup — Bristol-Myers Squibb

Bristol-Myers Squibb Molecular Translation — Part 1: Introduction and EDA

Overview of the Kaggle molecular translation challenge, dataset exploration, and analysis of IUPAC name distributions. Sets up the problem framing for the deep learning approach.

Bristol-Myers Squibb Molecular Translation — Part 2: Deep Learning Modelling with LSTM

Details the EfficientNet + LSTM + Bahdanau Attention architecture, TPU training strategy, Teacher Forcing, beam search, and final results achieving Levenshtein distance 8.9 (top 500).