Projects
Some of my work
Continuous Batching Inference Server
Minimal LLM inference server from scratch featuring a continuous batching engine with dynamic request scheduling — the same core mechanism behind vLLM's throughput advantage over naive sequential serving. Handles variable-length requests, manages a KV-cache budget, and preempts lower-priority sequences under VRAM pressure.
Speculative Decoding: DistilGPT2 + GPT-2
Full draft-verify loop from scratch using DistilGPT2 as the draft model and GPT-2 as the target. Derives acceptance probability from first principles, runs gamma sweep benchmarks to find the optimal draft length, and measures wall-clock speedup. Companion to the Medium article.
Multi-Document RAG System
Production-grade multi-document RAG pipeline: semantic chunking, FAISS dense retrieval, cross-encoder reranking, AWQ/GPTQ quantization, and MLflow experiment tracking. Demonstrates the full retrieval stack at the fidelity expected in a production MLE role.
CUDA Kernels for LLM Inference
In-progress repository building CUDA kernels for LLM inference from the ground up. Walks through thread and memory hierarchy, memory coalescing, and reduction patterns, then applies them to deep-learning kernels — softmax and a worked-out flash attention. Includes register tiling, vectorized float4 loads, and a profiling guide using CUDA events and Nsight.
Blog Writing Agent
Autonomous multi-step agent that researches a topic, outlines structure, drafts sections, and self-reviews content before producing a finished blog post. Built with a LangGraph workflow and tool-use loop — demonstrates end-to-end agentic pipeline design with human-in-the-loop checkpoints.
Research Agent
Agentic research assistant that autonomously queries multiple sources, synthesizes information, and produces structured summaries. Uses a ReAct-style reasoning loop with web search and document retrieval tools, illustrating how to compose reliable agent workflows for knowledge-intensive tasks.
Bristol-Myers Squibb Molecular Translation — Kaggle
Image captioning model for molecular IUPAC name prediction using an Encoder-Decoder architecture on Kaggle TPUs. Applied Teacher Forcing and beam search with EfficientNet V3 encoder and LSTM decoder with Bahdanau Attention, achieving a Levenshtein distance of 8.9 and ranking in the top 500 globally. Cut training time by 1.5 days via efficient TPU preprocessing.
NFL Big Data Bowl 2020 — Kaggle
Developed machine learning models for predicting NFL rushing play outcomes. Engineered spatiotemporal features from player tracking data and applied isotonic regression for probability calibration, achieving a CRPS of 0.017 and ranking 237th globally. Deployed the model as a Flask application on Heroku.
Neural Machine Translation — Italian to English
Sequence-to-sequence neural machine translation system for Italian-to-English translation with attention mechanism. Achieved a BLEU score of 0.83 using a bidirectional LSTM encoder with Bahdanau attention and beam search decoding.