Le génie de DeepSeek qui augmente son efficacité de 57X ??[MLA]
"MLA/DeepSeek Poster at 17:12 (Free shipping for a limited time with code DEEPSEEK): https://www.welchlabs.com/resources/m... Limited edition MLA Poster and Signed Book: https://www.welchlabs.com/resources/d... Imaginary Numbers book is back in stock! https://www.welchlabs.com/resources/i... Special Thanks to Patrons / welchlabs Juan Benet, Ross Hanson, Yan Babitski, AJ Englehardt, Alvin Khaled, Eduardo Barraza, Hitoshi Yamauchi, Jaewon Jung, Mrgoodlight, Shinichi Hayashi, Sid Sarasvati, Dominic Beaumont, Shannon Prater, Ubiquity Ventures, Matias Forti, Brian Henry, Tim Palade, Petar Vecutin, Nicolas baumann, Jason Singh, Robert Riley, vornska, Barry Silverman, Jake Ehrlich References DeepSeek-V2 paper: https://arxiv.org/pdf/2405.04434 DeepSeek-R1 paper: https://arxiv.org/abs/2501.12948 Great Article by Ege Erdil: https://epoch.ai/gradient-updates/how... GPT-2 Visualizaiton: https://github.com/TransformerLensOrg... Manim Animations: https://github.com/stephencwelch/mani... Technical Notes 1. Note that DeepSeek-V2 paper claims a KV cache size reduction of 93.3%. They don’t exactly publish their methodology, but as far as I can tell it’s something likes this: start with Deepseek-v2 hyperparameters here: https://huggingface.co/deepseek-ai/De.... num_hidden_layers=30, num_attention_heads=32, v_head_dim = 128. If DeepSeek-v2 was implemented with traditional MHA, then KV cache size would be 2*32*128*30*2=491,520 B/token. With MLA with a KV cache size of 576, we get a total cache size of 576*30=34,560 B/token. The percent reduction in KV cache size is then equal to (491,520-34,560)/492,520=92.8%. The numbers I present in this video follow the same approach but are for DeepSeek-v3/R1 architecture: https://huggingface.co/deepseek-ai/De.... num_hidden_layers=61, num_attention_heads=128, v_head_dim = 128. So traditional MHA cache would be 2*128*128*61*2 = 3,997,696 B/token. MLA reduces this to 576*61*2=70,272 B/token. Tor the DeepSeek-V3/R1 architecture, MLA reduces the KV cache size by a factor of 3,997,696/70,272 =56.9X. 2. I claim a couple times that MLA allows DeepSeek to generate tokens more than 6x faster than a vanilla transformer. The DeepSeek-V2 paper claims a slightly less than 6x throughput improvement with MLA, but since the V3/R1 architecture is heavier, we expect a larger lift, which is why i claim “more than 6x faster than a vanilla transformer” - in reality it’s probably significantly more than 6x for the V3/R1 architecture. 3. In all attention patterns and walkthroughs, we’re ignoring the |beginning of sentence| token. “The American flag is red, white, and” actually maps to 10 tokens if we include this starting token, and may attention patterns do assign high values to this token. 4. We’re ignoring bias terms matrix equations. 5. We’re ignoring positional embeddings. These are fascinating. See DeepSeek papers and ROPE."Voir également :
8-Use-Cases-for-Arti..> 2024-07-18 14:29 11K
AI-and-the-Productiv..> 2024-07-30 07:52 18K
Andrew-Dudzik-Three-..> 2024-07-22 21:22 11K
Apprendre-les-langue..> 2024-07-17 15:58 9.6K
Apprendre-les-langue..> 2024-08-03 09:15 19K
Autopsie-d-une-intel..> 2024-07-17 15:05 9.5K
Bases-concepts-et-hi..> 2025-02-28 15:04 13K
But-what-is-a-GPT-Vi..> 2024-07-08 18:17 5.3K
Claude-3-7-depasse-C..> 2025-02-27 15:22 20K
Coder-un-reseau-de-n..> 2024-07-12 07:09 9.6K
Conference-SML-Myste..> 2024-07-10 15:09 8.5K
Daniel-Andler-Intell..> 2024-07-27 16:42 18K
De-l-apprentissage-d..> 2024-07-18 10:08 12K
De-la-regression-lin..> 2024-07-10 16:27 9.7K
De-la-regression-log..> 2024-07-18 13:10 11K
Donner-un-sens-a-l-I..> 2024-09-13 17:01 18K
FORMATION-DEEP-LEARN..> 2024-07-09 11:28 7.6K
GPT-4o-est-arrivee-L..> 2024-05-28 14:54 5.4K
How-Large-Language-M..> 2024-07-30 07:47 18K
Il-a-cree-une-IA-pou..> 2025-02-27 20:08 13K
Il-etait-une-fois-Ch..> 2024-05-27 09:49 5.5K
Intelligence-artific..> 2024-07-05 08:41 5.3K
Intelligence-artific..> 2024-12-12 11:22 19K
L-Entscheidungsprobl..> 2024-07-10 15:20 8.5K
L-IA-et-ses-defis-Co..> 2024-07-19 19:16 11K
L-IA-open-source-peu..> 2025-02-28 10:46 12K
L-INTELLIGENCE-ARTIF..> 2024-07-17 17:48 12K
L-Intelligence-Artif..> 2024-07-04 18:23 5.4K
L-agent-conversation..> 2025-03-04 09:15 12K
L-apprentissage-par-..> 2024-07-20 21:27 12K
L-apprentissage-prof..> 2024-07-08 18:09 5.3K
Lagrangian-Neural-Ne..> 2024-07-18 13:00 11K
Le-pere-fondateur-du..> 2024-07-04 18:23 5.4K
Les-mathematiques-de..> 2024-07-18 10:21 11K
MIT-Introduction-to-..> 2024-07-17 16:16 9.5K
Mais-qu-est-ce-qu-un..> 2024-07-05 16:35 5.3K
Mathematiques-et-IA-..> 2024-07-18 10:14 11K
Michael-Bronstein-Ge..> 2024-07-23 05:44 11K
Michio-Kaku-Quantum-..> 2024-07-18 09:49 11K
Miles-Cranmer-The-Ne..> 2024-07-23 05:49 12K
Neural-and-Non-Neura..> 2024-09-04 06:19 23K
Notre-test-de-Grok-l..> 2024-09-04 06:40 19K
Perdons-nous-le-cont..> 2024-05-27 09:58 5.5K
Philippe-Aghion-Les-..> 2024-07-19 19:18 11K
Reflechir-aux-usages..> 2024-07-19 10:23 12K
Reseaux-de-neurones-..> 2024-07-17 15:39 9.5K
SOMMAIRE-VIDEOS-INTE..> 2025-02-27 20:13 21K
Sitemap-video-INTELL..> 2025-02-28 08:00 7.7K
Sora-l-outil-pour-ge..> 2024-08-01 16:38 19K
Stanford-Webinar-Lar..> 2024-12-19 09:17 19K
Terence-Tao-at-IMO-2..> 2024-08-30 19:18 19K
The-Map-of-Quantum-C..> 2024-07-18 10:25 11K
The-Most-Important-A..> 2024-07-18 13:01 11K
The-Next-Decade-in-A..> 2024-07-31 08:56 19K
The-Potential-for-AI..> 2024-09-03 09:16 18K
The-moment-we-stoppe..> 2024-07-23 07:40 17K
Un-texte-un-mathemat..> 2024-07-10 14:45 9.8K
What-are-LLM-Embeddi..> 2024-07-18 09:59 11K
What-do-tech-pioneer..> 2024-09-04 06:23 20K
What-is-Large-Scale-..> 2024-08-01 15:51 18K
What-is-RAG-Retrieva..> 2024-07-19 19:27 11K
When-Computers-Write..> 2024-09-03 08:40 19K
Yann-Le-Cun-Meta-nou..> 2024-07-19 10:48 12K
Yann-Lecun-Meta-AI-O..> 2024-07-17 16:03 9.6K