Home | Raghav Kachroo

I'm a software engineer drawn to operational problems — the kind where something is slow, manual, or breaking under load. Most of my work has been finding those points and building systems that fix them structurally: faster log search, automated on-call workflows, distributed ingestion pipelines. Lately that's been pulling me toward AI systems specifically — where the same reliability and latency problems show up, but the infrastructure is less mature.

Most of my recent work is the same shape: profile, find the friction point, ship a measured fix. The friction is usually a deferred cost being misattributed, a hidden default nobody benchmarked, or a tool whose own findings are wrong. Recent examples at the Hao AI Lab on FastVideo (open-source video-diffusion framework): 15.6% off Cosmos 2.5 inference latency (a "slow stage" was absorbing a deferred half-gigabyte GPU→CPU transfer), 22-47% off Wan2.1 via a model-agnostic adaptive caching module that has no per-model adapters, and the first software port of the framework to NVIDIA's DGX Spark (GB10/Blackwell). Same instinct outside the lab — a Go log indexing engine sustaining 42M rows/hour with sub-microsecond lookups, and an LLM-powered anomaly detection system that scores its own outputs against labeled HDFS data.

Work Experience

See all

01 Hao AI Lab, UC San Diego Student Researcher
Jan2026 - Present
- Built a clean-room model-agnostic adaptive caching module for video-diffusion inference (residual-skip heuristic, no per-model adapters) — 22-47% latency reduction on Wan2.1 across an SSIM 0.875–0.946 quality frontier; runs on Wan2.2 MoE where the per-model baseline crashes.
- Audited an LLM-agent GPU profiler’s 34 skills against ground-truth Nsight Systems traces (H100 + L40S workloads), landing 2 maintainer-confirmed upstream bug fixes.
- First software port of FastVideo to NVIDIA’s DGX Spark (GB10/Blackwell): 4 models running, FlashAttention-2 built from source for sm_121, shipped Cosmos 2.5 sampling-preset fix (sharpness 92→431) and Wan VAE-precision flip (1.3× decode) upstream.
- Cut Cosmos 2.5 inference latency 15.6% on A100 by quantizing decoded frames on-GPU before a 500MB device-to-host transfer; output-identical (SSIM=1.0), cross-validated on H100/Wan.
02 Amazon Software Development Engineer Intern
Jun2025 - Sep2025
- Built a distributed log indexing and query service over 42M+ log entries/hour using parallelized binary search, reducing incident triage latency from 15+ minutes to under 45 seconds.
- Automated on-call SOPs using AWS Step Functions and Lambda, saving 12+ engineer-hours per week.
- Integrated the log query service with internal diagnostic tooling via MCP, adding caching and query batching to maintain sub-2s response times under concurrent incident response load.
03 Aark Global Software Developer, AI/ML
Apr2023 - Sep2024
- Developed an async document ingestion pipeline processing 18,000+ pages/day, distributing tasks via Azure Queue Storage to a VM worker pool.
- Implemented a read/write routing layer during datastore migration — reads from Cosmos DB replicas, writes to MongoDB primary — maintaining sub-100ms P95 latency throughout cutover.
- Designed a full-text search pipeline ingesting scanned PDFs through OCR into indexed Elasticsearch documents, enabling sub-180ms query latency over previously unsearchable content.
04 Concentrix Data Engineer
Jun2022 - Mar2023
- Replaced sequential scrapers with Airflow-orchestrated distributed ingestion jobs streamed through Kafka, increasing throughput by 60%.
- Migrated aggregations from batch to Kafka streaming, reducing data freshness lag from 3 days to 6 hours.

Recent Projects

See all

01 JobHunter

GitHub ↗ →

Automated job scraper monitoring 40+ companies across 8 ATS platforms every 30 minutes, deduplicates postings in SQLite, and emails new listings via Resend API — deployed on a DigitalOcean VPS with a live dashboard.
02 Log Triage v2: Sub-Millisecond Incident Log Search

GitHub ↗ →

Go-based log triage system that finds the closest log entry to an incident timestamp in sub-millisecond time across millions of log lines — reimagined from a production system at Amazon.
03 Flare: LLM-Powered Log Anomaly Detection

GitHub ↗ →

End-to-end log anomaly detection pipeline with multi-model comparison (Isolation Forest, LOF, One-Class SVM), MLflow experiment tracking, and LLM summarization on real HDFS log data.

Latest Posts

See all