GPU Rental AI Agent | Raghav Kachroo

Overview

An autonomous LLM agent that manages cloud GPU rental from multiple vendors, capable of orchestrating complex workflows including vendor selection, budgeting, payment processing, job submission, and recovery from failures. The system survives crashes and serverless restarts through durable state management.

Key Achievements

Autonomous Multi-Step Orchestration: Agent handles entire GPU rental lifecycle from vendor selection through job completion
Intelligent Planning: Structured JSON planning with Llama-3 via Fireworks AI, evaluating cost, reliability, and historical failures
Durable State Machine: MongoDB-backed persistence supports multi-hour workflows across restarts and serverless boundaries
Automatic Failure Recovery: When vendors fail mid-job, agent reloads reasoning history, replans, and switches vendors without losing progress
x402 Payment Integration: Machine-to-machine payments via Coinbase CDP with automatic 402-challenge handling
Robust Testing: Mock multi-vendor ecosystem with simulated job logs, payment errors, and failure injection

Why I Built It

GPU access is one of the most manual parts of ML workflows — checking availability across vendors, babysitting jobs, handling failures mid-run. I wanted to see how far an agent with durable state could go in handling the full lifecycle autonomously, including replanning when a vendor fails partway through.

Technologies

TypeScript · MongoDB · Fireworks AI (Llama 3) · x402 / Coinbase CDP · Vercel