I'm a software engineer drawn to operational problems — the kind where something is slow, manual, or breaking under load. Most of my work has been finding those points and building systems that fix them structurally: faster log search, automated on-call workflows, distributed ingestion pipelines. Lately that's been pulling me toward AI systems specifically — where the same reliability and latency problems show up, but the infrastructure is less mature.

Most of my recent work is the same shape: profile, find the friction point, ship a measured fix. The friction is usually a deferred cost being misattributed, a hidden default nobody benchmarked, or a tool whose own findings are wrong. Recent examples at the Hao AI Lab on FastVideo (open-source video-diffusion framework): 15.6% off Cosmos 2.5 inference latency (a "slow stage" was absorbing a deferred half-gigabyte GPU→CPU transfer), 22-47% off Wan2.1 via a model-agnostic adaptive caching module that has no per-model adapters, and the first software port of the framework to NVIDIA's DGX Spark (GB10/Blackwell). Same instinct outside the lab — a Go log indexing engine sustaining 42M rows/hour with sub-microsecond lookups, and an LLM-powered anomaly detection system that scores its own outputs against labeled HDFS data.

See all
  • 01 Hao AI Lab, UC San Diego Student Researcher
    Jan2026 - Present
    • Built a clean-room model-agnostic adaptive caching module for video-diffusion inference (residual-skip heuristic, no per-model adapters) — 22-47% latency reduction on Wan2.1 across an SSIM 0.875–0.946 quality frontier; runs on Wan2.2 MoE where the per-model baseline crashes.
    • Audited an LLM-agent GPU profiler’s 34 skills against ground-truth Nsight Systems traces (H100 + L40S workloads), landing 2 maintainer-confirmed upstream bug fixes.
    • First software port of FastVideo to NVIDIA’s DGX Spark (GB10/Blackwell): 4 models running, FlashAttention-2 built from source for sm_121, shipped Cosmos 2.5 sampling-preset fix (sharpness 92→431) and Wan VAE-precision flip (1.3× decode) upstream.
    • Cut Cosmos 2.5 inference latency 15.6% on A100 by quantizing decoded frames on-GPU before a 500MB device-to-host transfer; output-identical (SSIM=1.0), cross-validated on H100/Wan.
  • 02 Amazon Software Development Engineer Intern
    Jun2025 - Sep2025
    • Built a distributed log indexing and query service over 42M+ log entries/hour using parallelized binary search, reducing incident triage latency from 15+ minutes to under 45 seconds.
    • Automated on-call SOPs using AWS Step Functions and Lambda, saving 12+ engineer-hours per week.
    • Integrated the log query service with internal diagnostic tooling via MCP, adding caching and query batching to maintain sub-2s response times under concurrent incident response load.
  • 03 Aark Global Software Developer, AI/ML
    Apr2023 - Sep2024
    • Developed an async document ingestion pipeline processing 18,000+ pages/day, distributing tasks via Azure Queue Storage to a VM worker pool.
    • Implemented a read/write routing layer during datastore migration — reads from Cosmos DB replicas, writes to MongoDB primary — maintaining sub-100ms P95 latency throughout cutover.
    • Designed a full-text search pipeline ingesting scanned PDFs through OCR into indexed Elasticsearch documents, enabling sub-180ms query latency over previously unsearchable content.
  • 04 Concentrix Data Engineer
    Jun2022 - Mar2023
    • Replaced sequential scrapers with Airflow-orchestrated distributed ingestion jobs streamed through Kafka, increasing throughput by 60%.
    • Migrated aggregations from batch to Kafka streaming, reducing data freshness lag from 3 days to 6 hours.
See all
See all