Flexible LLM Inference with Multi Model Prefill and Decode
Georgia Institute of Technology
Implemented a stitched LLM architecture using separate models for prefill and decode phases, improving both TTFT and TBT while maintaining model accuracy. Achieved 5% decrease in latency than the baseline model (using same bigger model in both the phases). Tested out generalization capability for the architecture on different tasks (q/a, summarization, code completion, math problems)