Projects | Sudhir Pol

Continuous Batching Inference Server

PythonLLM InferencevLLM

Minimal LLM inference server from scratch featuring a continuous batching engine with dynamic request scheduling — the same core mechanism behind vLLM's throughput advantage over naive sequential serving. Handles variable-length requests, manages a KV-cache budget, and preempts lower-priority sequences under VRAM pressure.

→ GitHub

Speculative Decoding: DistilGPT2 + GPT-2

PyTorchSpeculative DecodingLLM

Full draft-verify loop from scratch using DistilGPT2 as the draft model and GPT-2 as the target. Derives acceptance probability from first principles, runs gamma sweep benchmarks to find the optimal draft length, and measures wall-clock speedup. Companion to the Medium article.

→ GitHub

Multi-Document RAG System

RAGFAISSPythonMLflow

Production-grade multi-document RAG pipeline: semantic chunking, FAISS dense retrieval, cross-encoder reranking, AWQ/GPTQ quantization, and MLflow experiment tracking. Demonstrates the full retrieval stack at the fidelity expected in a production MLE role.

→ GitHub

CUDA Kernels for LLM Inference

CUDA C++GPU ComputingLLM Inference

In-progress repository building CUDA kernels for LLM inference from the ground up. Walks through thread and memory hierarchy, memory coalescing, and reduction patterns, then applies them to deep-learning kernels — softmax and a worked-out flash attention. Includes register tiling, vectorized float4 loads, and a profiling guide using CUDA events and Nsight.

→ GitHub

Blog Writing Agent

PythonAgentic AILangGraph

Autonomous multi-step agent that researches a topic, outlines structure, drafts sections, and self-reviews content before producing a finished blog post. Built with a LangGraph workflow and tool-use loop — demonstrates end-to-end agentic pipeline design with human-in-the-loop checkpoints.

→ GitHub

Research Agent

PythonAgentic AITool Use

Agentic research assistant that autonomously queries multiple sources, synthesizes information, and produces structured summaries. Uses a ReAct-style reasoning loop with web search and document retrieval tools, illustrating how to compose reliable agent workflows for knowledge-intensive tasks.

→ GitHub

Bristol-Myers Squibb Molecular Translation — Kaggle

PythonTensorFlowEfficientNetLSTMTPU Jan 2021

Image captioning model for molecular IUPAC name prediction using an Encoder-Decoder architecture on Kaggle TPUs. Applied Teacher Forcing and beam search with EfficientNet V3 encoder and LSTM decoder with Bahdanau Attention, achieving a Levenshtein distance of 8.9 and ranking in the top 500 globally. Cut training time by 1.5 days via efficient TPU preprocessing.

→ GitHub → Blog Part 1 → Blog Part 2

NFL Big Data Bowl 2020 — Kaggle

PythonScikit-learnFlaskPandas Nov 2020

Developed machine learning models for predicting NFL rushing play outcomes. Engineered spatiotemporal features from player tracking data and applied isotonic regression for probability calibration, achieving a CRPS of 0.017 and ranking 237th globally. Deployed the model as a Flask application on Heroku.

→ GitHub

Neural Machine Translation — Italian to English

PythonTensorFlowSeq2SeqAttention

Sequence-to-sequence neural machine translation system for Italian-to-English translation with attention mechanism. Achieved a BLEU score of 0.83 using a bidirectional LSTM encoder with Bahdanau attention and beam search decoding.

Some of my work

Neural Machine Translation — Italian to English