
Production LLM agent runtime with multi-step tool use, rate limiting, idempotency, audit logging, token metering, and OpenTelemetry tracing.
Production RAG pipeline with BM25 + dense retrieval, cross-encoder reranking, RAGAS evaluation suite, and Prometheus metrics for faithfulness and latency monitoring.
Systematic benchmark comparing Full FT, LoRA, and QLoRA across GPU memory, training cost, inference latency, and task performance on instruction-following datasets.
End-to-end MLOps platform with experiment tracking, registry-based model promotion, canary serving with traffic splitting, statistical drift detection, and automated retraining triggers.
Point-in-time consistent feature store with online/offline parity, sub-10ms serving from Redis, Kafka-based feature ingestion, and training-serving skew detection.
Two-tower retrieval model with FAISS approximate nearest neighbor search, LambdaMART reranking optimized on NDCG, and a feature pipeline serving sub-50ms p99 latency.
Centralized LLM gateway with per-tenant policy enforcement, semantic caching, prompt injection detection, cost tracking per model and team, and unified observability across OpenAI and Anthropic.
More projects available on GitHub