Le génie de DeepSeek qui augmente son efficacité de 57X ??[MLA]

"MLA/DeepSeek Poster at 17:12 (Free shipping for a limited time with code DEEPSEEK): https://www.welchlabs.com/resources/m... Limited edition MLA Poster and Signed Book: https://www.welchlabs.com/resources/d... Imaginary Numbers book is back in stock! https://www.welchlabs.com/resources/i... Special Thanks to Patrons / welchlabs Juan Benet, Ross Hanson, Yan Babitski, AJ Englehardt, Alvin Khaled, Eduardo Barraza, Hitoshi Yamauchi, Jaewon Jung, Mrgoodlight, Shinichi Hayashi, Sid Sarasvati, Dominic Beaumont, Shannon Prater, Ubiquity Ventures, Matias Forti, Brian Henry, Tim Palade, Petar Vecutin, Nicolas baumann, Jason Singh, Robert Riley, vornska, Barry Silverman, Jake Ehrlich References DeepSeek-V2 paper: https://arxiv.org/pdf/2405.04434 DeepSeek-R1 paper: https://arxiv.org/abs/2501.12948 Great Article by Ege Erdil: https://epoch.ai/gradient-updates/how... GPT-2 Visualizaiton: https://github.com/TransformerLensOrg... Manim Animations: https://github.com/stephencwelch/mani... Technical Notes 1. Note that DeepSeek-V2 paper claims a KV cache size reduction of 93.3%. They don’t exactly publish their methodology, but as far as I can tell it’s something likes this: start with Deepseek-v2 hyperparameters here: https://huggingface.co/deepseek-ai/De.... num_hidden_layers=30, num_attention_heads=32, v_head_dim = 128. If DeepSeek-v2 was implemented with traditional MHA, then KV cache size would be 2*32*128*30*2=491,520 B/token. With MLA with a KV cache size of 576, we get a total cache size of 576*30=34,560 B/token. The percent reduction in KV cache size is then equal to (491,520-34,560)/492,520=92.8%. The numbers I present in this video follow the same approach but are for DeepSeek-v3/R1 architecture: https://huggingface.co/deepseek-ai/De.... num_hidden_layers=61, num_attention_heads=128, v_head_dim = 128. So traditional MHA cache would be 2*128*128*61*2 = 3,997,696 B/token. MLA reduces this to 576*61*2=70,272 B/token. Tor the DeepSeek-V3/R1 architecture, MLA reduces the KV cache size by a factor of 3,997,696/70,272 =56.9X. 2. I claim a couple times that MLA allows DeepSeek to generate tokens more than 6x faster than a vanilla transformer. The DeepSeek-V2 paper claims a slightly less than 6x throughput improvement with MLA, but since the V3/R1 architecture is heavier, we expect a larger lift, which is why i claim “more than 6x faster than a vanilla transformer” - in reality it’s probably significantly more than 6x for the V3/R1 architecture. 3. In all attention patterns and walkthroughs, we’re ignoring the |beginning of sentence| token. “The American flag is red, white, and” actually maps to 10 tokens if we include this starting token, and may attention patterns do assign high values to this token. 4. We’re ignoring bias terms matrix equations. 5. We’re ignoring positional embeddings. These are fascinating. See DeepSeek papers and ROPE."

Voir également :

 8-Use-Cases-for-Arti..> 2024-07-18 14:29   11K  
 AI-and-the-Productiv..> 2024-07-30 07:52   18K  
 Andrew-Dudzik-Three-..> 2024-07-22 21:22   11K  
 Apprendre-les-langue..> 2024-07-17 15:58  9.6K  
 Apprendre-les-langue..> 2024-08-03 09:15   19K  
 Autopsie-d-une-intel..> 2024-07-17 15:05  9.5K  
 Bases-concepts-et-hi..> 2025-02-28 15:04   13K  
 But-what-is-a-GPT-Vi..> 2024-07-08 18:17  5.3K  
 Claude-3-7-depasse-C..> 2025-02-27 15:22   20K  
 Coder-un-reseau-de-n..> 2024-07-12 07:09  9.6K  
 Conference-SML-Myste..> 2024-07-10 15:09  8.5K  
 Daniel-Andler-Intell..> 2024-07-27 16:42   18K  
 De-l-apprentissage-d..> 2024-07-18 10:08   12K  
 De-la-regression-lin..> 2024-07-10 16:27  9.7K  
 De-la-regression-log..> 2024-07-18 13:10   11K  
 Donner-un-sens-a-l-I..> 2024-09-13 17:01   18K  
 FORMATION-DEEP-LEARN..> 2024-07-09 11:28  7.6K  
 GPT-4o-est-arrivee-L..> 2024-05-28 14:54  5.4K  
 How-Large-Language-M..> 2024-07-30 07:47   18K  
 Il-a-cree-une-IA-pou..> 2025-02-27 20:08   13K  
 Il-etait-une-fois-Ch..> 2024-05-27 09:49  5.5K  
 Intelligence-artific..> 2024-07-05 08:41  5.3K  
 Intelligence-artific..> 2024-12-12 11:22   19K  
 L-Entscheidungsprobl..> 2024-07-10 15:20  8.5K  
 L-IA-et-ses-defis-Co..> 2024-07-19 19:16   11K  
 L-IA-open-source-peu..> 2025-02-28 10:46   12K  
 L-INTELLIGENCE-ARTIF..> 2024-07-17 17:48   12K  
 L-Intelligence-Artif..> 2024-07-04 18:23  5.4K  
 L-agent-conversation..> 2025-03-04 09:15   12K  
 L-apprentissage-par-..> 2024-07-20 21:27   12K  
 L-apprentissage-prof..> 2024-07-08 18:09  5.3K  
 Lagrangian-Neural-Ne..> 2024-07-18 13:00   11K  
 Le-pere-fondateur-du..> 2024-07-04 18:23  5.4K  
 Les-mathematiques-de..> 2024-07-18 10:21   11K  
 MIT-Introduction-to-..> 2024-07-17 16:16  9.5K  
 Mais-qu-est-ce-qu-un..> 2024-07-05 16:35  5.3K  
 Mathematiques-et-IA-..> 2024-07-18 10:14   11K  
 Michael-Bronstein-Ge..> 2024-07-23 05:44   11K  
 Michio-Kaku-Quantum-..> 2024-07-18 09:49   11K  
 Miles-Cranmer-The-Ne..> 2024-07-23 05:49   12K  
 Neural-and-Non-Neura..> 2024-09-04 06:19   23K  
 Notre-test-de-Grok-l..> 2024-09-04 06:40   19K  
 Perdons-nous-le-cont..> 2024-05-27 09:58  5.5K  
 Philippe-Aghion-Les-..> 2024-07-19 19:18   11K  
 Reflechir-aux-usages..> 2024-07-19 10:23   12K  
 Reseaux-de-neurones-..> 2024-07-17 15:39  9.5K  
 SOMMAIRE-VIDEOS-INTE..> 2025-02-27 20:13   21K  
 Sitemap-video-INTELL..> 2025-02-28 08:00  7.7K  
 Sora-l-outil-pour-ge..> 2024-08-01 16:38   19K  
 Stanford-Webinar-Lar..> 2024-12-19 09:17   19K  
 Terence-Tao-at-IMO-2..> 2024-08-30 19:18   19K  
 The-Map-of-Quantum-C..> 2024-07-18 10:25   11K  
 The-Most-Important-A..> 2024-07-18 13:01   11K  
 The-Next-Decade-in-A..> 2024-07-31 08:56   19K  
 The-Potential-for-AI..> 2024-09-03 09:16   18K  
 The-moment-we-stoppe..> 2024-07-23 07:40   17K  
 Un-texte-un-mathemat..> 2024-07-10 14:45  9.8K  
 What-are-LLM-Embeddi..> 2024-07-18 09:59   11K  
 What-do-tech-pioneer..> 2024-09-04 06:23   20K  
 What-is-Large-Scale-..> 2024-08-01 15:51   18K  
 What-is-RAG-Retrieva..> 2024-07-19 19:27   11K  
 When-Computers-Write..> 2024-09-03 08:40   19K  
 Yann-Le-Cun-Meta-nou..> 2024-07-19 10:48   12K  
 Yann-Lecun-Meta-AI-O..> 2024-07-17 16:03  9.6K