我们非常激动地宣布,Mooncake 已正式加入 PyTorch 生态系统!通过将 Mooncake 的高性能 KVCache 传输与存储能力与 SGLang、vLLM 以及 TensorRT-LLM 等 PyTorch 原生推理引擎集成,我们正在为大语言模型部署解锁更高水平的吞吐量和可扩展性。
查看 PyTorch 生态系统,请访问 PyTorch 生态全景图。了解更多关于项目如何 加入 PyTorch 生态系统 的信息。
关于 Mooncake
Mooncake 旨在解决大模型推理中的“内存墙”问题。随着上下文长度的增加和模型规模的扩大,将键值缓存(KV Cache)静态绑定到特定的 GPU 工作节点已成为主要瓶颈。
Mooncake 使推理引擎能够打破这种绑定,并解锁了四项关键能力:
- (Encoder) Prefill-Decode 分离: Mooncake 的高性能传输引擎(Transfer Engine)将繁重的计算(预填充/编码)与延迟敏感的生成(解码)分离到不同的集群中。
- 全局 KVCache 复用: 通过作为 KV 块的分布式共享内存,Mooncake Store 使得有效的缓存可以在不同的请求和引擎实例之间进行全局复用。
- 弹性专家并行(Elastic Expert Parallelism): 通过将专家从特定工作节点解耦,Mooncake-EP 实现了弹性且具容错性的服务,使得混合专家模型(MoE)的专家可以动态路由或恢复,即使在部分节点故障时也能确保高可用性。
- PyTorch 分布式后端: Mooncake 后端可作为容错的 PyTorch 分布式后端。它提供了强大的集合通信原语,能够在发生排名(rank)故障时无缝地持续运行。
- 权重更新:Mooncake Store 通过在内部存储权重,为强化学习(RL)和检查点场景实现了快速权重更新。它提供原生张量(tensor-native)和零拷贝 API。
广泛的行业应用
Mooncake 源自月之暗面(Moonshot AI)与清华大学的研究合作。它诞生于解决大规模模型(如 Kimi)服务中“内存墙”需求的紧迫性。自开源以来,它已演变成一个蓬勃发展的社区驱动项目。
Mooncake 的架构已在一些全球要求最苛刻的生产环境中得到实战检验。其将计算与内存解耦的能力,使其被众多领先机构广泛采用,包括月之暗面(Kimi)、阿里云、蚂蚁集团、京东、腾讯、美团、Approaching.AI 以及 LightSeek Foundation。
这些机构利用 Mooncake 最大化 GPU 利用率,并确保为数百万并发用户提供流畅的服务体验。
实战:联合解决方案
为了展示该架构的全部潜力,我们推出了一个联合解决方案,将 Mooncake 与生态系统中领先的推理引擎和编排工具相结合。
在此架构中,我们使用 RoleBasedGroup (RBG, https://github.com/sgl-project/rbg) 来编排整个拓扑结构,定义集群中各组件的关系和启动顺序。它部署了 Shepherd Model Gateway (SMG, https://github.com/lightseekorg/smg) 作为关键路由层,根据缓存局部性和系统负载智能地将传入请求引导至合适的工作节点。繁重的计算任务由作为计算节点的 SGLang (https://github.com/sgl-project/sglang) 或 vLLM (https://github.com/vllm-project/vllm) 实例完成,而 Mooncake 则作为高速数据平面运行:其传输引擎通过 RDMA/NVLink 推送预填充的 KV Cache,其存储则将这些缓存持久化,供解码节点进行全局复用。
1. 使用 SGLang + Mooncake + SMG 部署
以下是可立即部署完整 SGLang 架构的 RBG 配置。在此方案中,Prefill-Decode 分离和全局 KVCache 复用均已启用。预填充实例利用 Mooncake TE 将 kvcache 传输至解码实例,而 Mooncake Store 则促进了在预填充实例内跨不同请求复用 KVCache(更多详情请参阅 KEP-74 Mooncake 集成 和 pd-disaggregated-with-mooncake.yaml)。
YAML # Joint Solution: RBG + SMG + SGLang + Mooncake (Production Ready) apiVersion: workloads.x-k8s.io/v1alpha1 kind: RoleBasedGroup metadata: name: sglang-mooncake-smg-v2 spec: roles: # 1. Mooncake Master: Centralized Metadata Server for TE and Store - name: mooncake-master replicas: 1 template: spec: containers: - name: master image: lmsysorg/sglang:latest env: - name: POD_IP valueFrom: fieldRef: fieldPath: status.podIP command: ["mooncake_master"] args: - --enable_http_metadata_server=true - --rpc_address=$(POD_IP) - --rpc_port=50051 - --http_metadata_server_host=$(POD_IP) - --http_metadata_server_port=8080 - --metrics_port=9003 # 2. Mooncake Store: Distributed KVCache Storage Nodes - name: mooncake-store replicas: 3 dependencies: ["mooncake-master"] template: spec: containers: - name: store-node image: lmsysorg/sglang:latest env: - name: MOONCAKE_MASTER value: "s-sglang-mooncake-smg-v2-mooncake-master:50051" - name: MOONCAKE_TE_META_DATA_SERVER value: "http://s-sglang-mooncake-smg-v2-mooncake-master:8080/metadata" - name: MOONCAKE_GLOBAL_SEGMENT_SIZE value: "45gb" - name: MOONCAKE_PROTOCOL value: "rdma" # Use RDMA for zero-copy KVCache transfer command: ["python3", "-m", "mooncake.mooncake_store_service"] resources: limits: memory: "50Gi" rdma/hca: 1 # Required for high-speed TE transfer requests: memory: "50Gi" rdma/hca: 1 # 3. Prefill Worker (SGLang): High-throughput Prefill with Mooncake Push - name: prefill-worker replicas: 1 dependencies: ["mooncake-master", "mooncake-store"] template: spec: containers: - name: sglang-prefill image: lmsysorg/sglang:latest env: - name: MOONCAKE_MASTER value: "s-sglang-mooncake-smg-v2-mooncake-master:50051" - name: MOONCAKE_TE_META_DATA_SERVER value: "http://s-sglang-mooncake-smg-v2-mooncake-master:8080/metadata" - name: MOONCAKE_PROTOCOL value: "rdma" command: - python3 - -m - sglang.launch_server - --model-path /models/Qwen3 - --tp 4 - --disaggregation-mode prefill - --disaggregation-transfer-backend mooncake # Activates Mooncake TE for KVCache Push - --enable-hierarchical-cache # Enables KVCache offloading - --hicache-storage-backend mooncake # Uses Mooncake as the L2/L3 cache backend resources: limits: nvidia.com/gpu: "4" rdma/hca: 1 # 4. Decode Worker (SGLang): Low-latency Generation with Mooncake Pull - name: decode-worker replicas: 2 dependencies: ["mooncake-master", "prefill-worker"] template: spec: containers: - name: sglang-decode image: lmsysorg/sglang:latest command: - python3 - -m - sglang.launch_server - --model-path /models/Qwen3 - --tp 4 - --disaggregation-mode decode # Pulls shared KVCache from Mooncake Store resources: limits: nvidia.com/gpu: "4" rdma/hca: 1 # 5. Shepherd Model Gateway (SMG): Intelligent PD-Disaggregation Router - name: smg-router replicas: 1 dependencies: ["prefill-worker", "decode-worker"] template: spec: containers: - name: router image: lightseekorg/smg:latest command: - smg - --pd-disaggregation - --prefill http://s-sglang-mooncake-smg-v2-prefill-worker:8000 - --decode http://s-sglang-mooncake-smg-v2-decode-worker:8000 - --host 0.0.0.0 - --port 8000
2. 使用 vLLM + Mooncake 部署
vLLM 也已集成对 Mooncake 的支持,允许用户利用 Mooncake 连接器实现无缝 KV 传输。以下是使用 Mooncake 连接器在分离式设置中部署 vLLM 的等效 rbg() 解决方案。
YAML # Joint Solution: RBG + vLLM + Mooncake Connector apiVersion: workloads.x-k8s.io/v1alpha1 kind: RoleBasedGroup metadata: name: vllm-pd-with-mooncake-demo spec: roles: # 1. Gateway: Routing to vLLM instances (SMG or vLLM Proxy) - name: proxy dependencies: [ "prefill", "decode" ] replicas: 1 template: spec: containers: - name: proxy image: lightseekorg/smg:latest command: - smg - --prefiller-host - http://vllm-pd-with-mooncake-demo-prefill-0.s-vllm-pd-with-mooncake-demo-prefill - --prefiller-port - "8000" - --decoder-host - http://vllm-pd-with-mooncake-demo-decode-0.s-vllm-pd-with-mooncake-demo-decode - --decoder-port - "8000" # 2. Prefill Worker (vLLM): Producer role - name: prefill replicas: 1 template: spec: volumes: - name: model persistentVolumeClaim: claimName: qwen2.5-7b - name: dshm emptyDir: medium: Memory sizeLimit: 30Gi containers: - name: prefill image: vllm/vllm-openai:latest command: - sh - -c - | pip install mooncake-transfer-engine && \ vllm serve /models/Qwen2.5-7B-Instruct \ --port 8000 \ --tensor-parallel-size 4 \ --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}' ports: - containerPort: 8000 name: http readinessProbe: initialDelaySeconds: 30 periodSeconds: 10 tcpSocket: port: 8000 resources: limits: nvidia.com/gpu: "4" rdma/hca: 1 memory: "100Gi" requests: nvidia.com/gpu: "4" rdma/hca: 1 memory: "100Gi" volumeMounts: - mountPath: /models/Qwen2.5-7B-Instruct name: model - mountPath: /dev/shm name: dshm # 3. Decode Worker (vLLM): Consumer role - name: decode replicas: 1 template: spec: volumes: - name: model persistentVolumeClaim: claimName: qwen2.5-7b - name: dshm emptyDir: medium: Memory sizeLimit: 30Gi containers: - name: decode image: vllm/vllm-openai:latest command: - sh - -c - | pip install mooncake-transfer-engine && \ vllm serve /models/Qwen2.5-7B-Instruct \ --port 8000 \ --tensor-parallel-size 4 \ --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer"}' ports: - containerPort: 8000 name: http readinessProbe: initialDelaySeconds: 30 periodSeconds: 10 tcpSocket: port: 8000 resources: limits: nvidia.com/gpu: "4" rdma/hca: 1 memory: "100Gi" requests: nvidia.com/gpu: "4" rdma/hca: 1 memory: "100Gi" volumeMounts: - mountPath: /models/Qwen2.5-7B-Instruct name: model - mountPath: /dev/shm name: dshm --- apiVersion: v1 kind: Service metadata: labels: app: vllm-pd-with-mooncake-demo name: vllm-pd-with-mooncake-demo namespace: default spec: ports: - name: http port: 8000 protocol: TCP targetPort: 8000 selector: rolebasedgroup.workloads.x-k8s.io/name: vllm-pd-with-mooncake-demo rolebasedgroup.workloads.x-k8s.io/role: proxy type: ClusterIP
结论
Mooncake 为开源 AI 技术栈增加了一层至关重要的内存虚拟化。通过使 PyTorch 引擎(无论是 SGLang、vLLM 还是 TensorRT-LLM)能够采用以 KVCache 为中心的架构,我们正在为更高效、可扩展且低延迟的 LLM 服务铺平道路。
诚邀您探索该项目并开始构建:
- Mooncake GitHub: https://github.com/kvcache-ai/Mooncake
Mooncake 项目文档: https://docs.kvcache.com.cn/Mooncake/