DeepSeek Open Infra: Open-Sourcing 5 AI Repos in 5 Days

DeepSeek-Open-Infra


Hi there, DeepSeek Launch Infra!

202502 Launch-Offer Week

We’re a itsy-bitsy group @deepseek-ai pushing our limits in AGI exploration.

Starting this week Feb 24, 2025 we will open-source 5 repos – one on a traditional foundation tumble – no longer due to now we possess made gargantuan claims,however simply as developers sharing our little-however-sincere progress with beefy transparency.

These are humble building blocks of our online carrier: documented, deployed, and fight-examined in manufacturing.No vaporware, loyal sincere code that moved our itsy-bitsy yet ambitious dream ahead.

Why? Because every line shared turns into collective momentum that accelerates the scurry.Day-to-day unlocks initiate soon. No ivory towers – loyal pure storage-energy and neighborhood-pushed innovation 🔧

Keep tuned – let’s geek out in the open together.

Day 1 – FlashMLA

Setting pleasant MLA Decoding Kernel for Hopper GPUs
Optimized for variable-length sequences, fight-examined in manufacturing

🔗 Flashmla Github Repo
✅ BF16 pork up
✅ Paged KV cache (block dimension 64)
⚡ Efficiency: 3000 GB/s memory-certain | BF16 580 TFLOPS compute-certain on H800

Day 2 – Deepep

Enraged to introduce Deepep – the most important open-source EP dialog library for MoE mannequin training and inference.

🔗 DeepEP GitHub Repo
✅ Setting pleasant and optimized all-to-all dialog
✅ Both intranode and internode pork up with NVLink and RDMA
✅ Excessive-throughput kernels for training and inference prefilling
✅ Low-latency kernels for inference decoding
✅ Native FP8 dispatch pork up
✅ Flexible GPU resource alter for computation-dialog overlapping

Day 3 – DeepGEMM

Introducing DeepGEMM – an FP8 GEMM library that helps each dense and MoE GEMMs, powering V3/R1 training and inference.

🔗 DeepGEMM GitHub Repo
⚡ Up to 1350+ FP8 TFLOPS on Hopper GPUs
✅ No heavy dependency, as trim as a tutorial
✅ Completely Correct-In-Time compiled
✅ Core good judgment at ~300 traces – yet outperforms expert-tuned kernels across most matrix sizes
✅ Helps dense structure and two MoE layouts

Day 4 – Optimized Parallelism Suggestions

DualPipe – a bidirectional pipeline parallelism algorithm for computation-dialog overlap in V3/R1 training.
🔗 GitHub Repo

EPlb – an authority-parallel load balancer for V3/R1.
🔗 GitHub Repo

📊 Analyze computation-dialog overlap in V3/R1.
🔗 GitHub Repo

Day 5 – 3FS, Thruster for All DeepSeek Knowledge Salvage admission to

Fire-Flyer File Machine (3FS) – a parallel file machine that makes use of the beefy bandwidth of contemporary SSDs and RDMA networks.

⚡ 6.6 TiB/s aggregate read throughput in a 180-node cluster
⚡ 3.66 TiB/min throughput on GraySort benchmark in a 25-node cluster
⚡ 40+ GiB/s high throughput per client node for KVCache search for
🧬 Disaggregated architecture with solid consistency semantics
✅ Training data preprocessing, dataset loading, checkpoint saving/reloading, embedding vector search & KVCache lookups for inference in V3/R1

📥 3FS → 🔗GitHub Repo
Smallpond – data processing framework on 3FS → 🔗GitHub Repo

Day 6 – One Extra Element: DeepSeek-V3/R1 Inference Machine Overview

Optimized throughput and latency thru:
🔧 Inferior-node EP-powered batch scaling
🔄 Computation-dialog overlap
⚖️ Load balancing

Manufacturing data of V3/R1 online products and companies:
73k/14.8K enter/output tokens per second per H800 node
🚀 Impress profit margin 545%

Cost And Theoretical Income.jpg

💡 We hope this week’s insights provide fee to the neighborhood and make contributions to our shared AGI targets.

📖 Deep Dive: 🔗Day 6 – One Extra Element: DeepSeek-V3/R1 Inference Machine Overview
📖 Chinese version: 🔗”https://zhuanlan.zhihu.com/p/27181462601″ rel=”nofollow”>DeepSeek-V3/R1 Inference System Overview

2024 AI Infrastructure Paper (SC24)

Fire-Flyer AI-HPC: A Impress-Effective Application-Hardware Co-Create for Deep Discovering out

📄 Paper Hyperlink
📄 Arxiv Paper Hyperlink

Be taught Extra

Scroll to Top