Publications
Publications by categories in reversed chronological order. 1 represents co-first author.
2026
- NSDIFlexLLM: Token-Level Co-Serving of LLM Inference and Fine-Tuning with SLO GuaranteesProceedings of NSDI Conference 2026
- EuroSysAdaServe: Accelerating Multi-SLO LLM Serving with SLO-Customized Speculative DecodingProceedings of EuroSys Conference 2026
2025
- OSDIMirage: A Multi-Level Superoptimizer for Tensor ProgramsProceedings of OSDI Conference 2025
- ASPLOSHelix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUsProceedings of ASPLOS Conference 2025
- SIGMODPQCache: Product Quantization-based KVCache for Long Context LLM InferenceProceedings of SIGMOD Conference 2025
- ICLRNetMoE: Accelerating MoE Training through Dynamic Sample Placement (Spotlight)Proceedings of ICLR Conference 2025
2024
- SOSPEnabling Parallelism Hot Switching for Efficient Training of Large Language ModelsProceedings of SOSP Conference 2024
- ASPLOSSpotServe: Serving Generative Large Language Models on Preemptible Instances (Distinguished Artifact Award), (IEEE Micro Top Picks Honorable Mention)Proceedings of ASPLOS Conference 2024
- ASPLOSSpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree VerificationProceedings of ASPLOS Conference 2024
- ASPLOSOptimal Kernel Orchestration for Tensor Programs with KorchProceedings of ASPLOS Conference 2024
2023
- OSDIEinNet: Optimizing Tensor Programs with Derivation-Based TransformationsProceedings of OSDI Conference 2023