Shang Yang

I am a third-year Ph.D. student at HAN LAB of MIT EECS, advised by Prof. Song Han. My long-term goal is to build efficient machine learning systems for applications at different scales, especially the Large Language Models (LLMs). Recently, I am activately working on the efficient inference systems for LLMs/VLMs.

News

[2025/11] 🏆 TLT, our efficient RL framework for reasoning LLMs, has been accepted by ASPLOS 2026!
[2025/05] 🔥 I presented QServe and LServe at MLSys 2025! [QServe Video] / [LServe Video]
[2025/02] 🏆 Both QServe and LServe have been accepted by MLSys 2025!
[2025/02] 🔥 We released LServe, substantially accelerating long-sequence LLM inference with Unified Sparse Attention.
[2024/05] 🔥 We released QServe, an efficient large-scale LLM serving framework with W4A8KV4 Quantization.
[2024/05] 🏆 AWQ&TinyChat receives the Best Paper Award of MLSys 2024!
[2024/03] We have released an updated version of TinyChat. Visual Language Models (e.g. VILA) are supported! Play with our demo!
[2024/02] 🔥 AWQ is accepted by MLSys 2024!
[2023/10] 🔥 I presented TorchSparse++ at MICRO 2023! See the video and slides here!

Selected Publications

ASPLOS

Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter

Qinghao Hu*, Shang Yang*, Junxian Guo, Xiaozhe Yao, Yujun Lin, Yuxian Gu, Han Cai, Chuang Gan, Ana Klimovic, Song Han

The 31th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2026.

Code

MLSys

LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

Shang Yang*, Junxian Guo*, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, Song Han.

The Eighth Annual Conference on Machine Learning and Systems (MLSys), 2025.

Code

MLSys

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

Yujun Lin*, Haotian Tang*, Shang Yang*, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han.

The Eighth Annual Conference on Machine Learning and Systems (MLSys), 2025.

Code

MLSys

AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration

Ji Lin*, Jiaming Tang*, Haotian Tang†, Shang Yang†, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, Song Han.

The Seventh Annual Conference on Machine Learning and Systems (MLSys), 2024.

Code Best Paper Award

MICRO

TorchSparse++: Efficient Training and Inference Framework for Sparse Convolution on GPUs

Haotian Tang*, Shang Yang*, Zhijian Liu, Ke Hong, Zhongming Yu, Xiuyu Li, Guohao Dai, Yu Wang, Song Han.

56th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2023.

Code

Blogs

TinyChat 2.0: Accelerating Edge AI with Efficient LLM and VLM Deployment

Explore the latest advancement in TinyChat - the 2.0 version with significant advancements in prefilling speed of Edge LLMs and VLMs. Apart from the 3-4x decoding speedups achieved with AWQ quantization, TinyChat 2.0 now delivers state-of-the-art Time-To-First-Token, which is 1.5-1.7x faster than the legacy version of TinyChat.

Code

TinyChat: Visual Language Models & Edge AI 2.0

Explore the latest advancement in TinyChat and AWQ - the integration of Visual Language Models (VLM) on the edge! The exciting advancements in VLM allows LLMs to comprehend visual inputs, enabling seamless image understanding tasks like caption generation, question answering, and more. With the latest release, TinyChat now supports leading VLMs such as VILA, which can be easily quantized with AWQ, empowering users with seamless experience for image understanding tasks.

Code

TinyChat: Large Language Model on the Edge

Running large language models (LLMs) on the edge is of great importance. In this blog, we introduce TinyChat, an efficient and lightweight system for LLM deployment on the edge. It runs Meta's latest LLaMA-2 model at 30 tokens / second on NVIDIA Jetson Orin and can easily support different models and hardware.

Code