Asynchronous End-to-End Post-Training Pipeline
ActiveArchitected an asynchronous post-training pipeline combining SFT and GRPO reinforcement learning on Modal serverless 8×H200 compute, training open-source models to closed-source-level performance for binary and firmware vulnerability research.
Designed and built a fully asynchronous pipeline that decouples trajectory generation from policy training via independent workers communicating through a shared volume, so each stage scales independently on Modal serverless 8×H200 nodes. Paired with a complete data and evaluation environment: data pipelines for firmware vulnerability datasets, reverse-engineering tasks, and ASan-verified binary targets; a custom agentic harness with real exploitation sandboxes (gdb, AFL++, ASan) deployed via AWS ECR; and a comprehensive benchmarking suite spanning agentic, long-context, and security tasks. Developed under a SOCOM CRADA.