Le génie de DeepSeek qui augmente son efficacité de 57X ??[MLA]

"MLA/DeepSeek Poster at 17:12 (Free shipping for a limited time with code DEEPSEEK): https://www.welchlabs.com/resources/m... Limited edition MLA Poster and Signed Book: https://www.welchlabs.com/resources/d... Imaginary Numbers book is back in stock! https://www.welchlabs.com/resources/i... Special Thanks to Patrons / welchlabs Juan Benet, Ross Hanson, Yan Babitski, AJ Englehardt, Alvin Khaled, Eduardo Barraza, Hitoshi Yamauchi, Jaewon Jung, Mrgoodlight, Shinichi Hayashi, Sid Sarasvati, Dominic Beaumont, Shannon Prater, Ubiquity Ventures, Matias Forti, Brian Henry, Tim Palade, Petar Vecutin, Nicolas baumann, Jason Singh, Robert Riley, vornska, Barry Silverman, Jake Ehrlich References DeepSeek-V2 paper: https://arxiv.org/pdf/2405.04434 DeepSeek-R1 paper: https://arxiv.org/abs/2501.12948 Great Article by Ege Erdil: https://epoch.ai/gradient-updates/how... GPT-2 Visualizaiton: https://github.com/TransformerLensOrg... Manim Animations: https://github.com/stephencwelch/mani... Technical Notes 1. Note that DeepSeek-V2 paper claims a KV cache size reduction of 93.3%. They don’t exactly publish their methodology, but as far as I can tell it’s something likes this: start with Deepseek-v2 hyperparameters here: https://huggingface.co/deepseek-ai/De.... num_hidden_layers=30, num_attention_heads=32, v_head_dim = 128. If DeepSeek-v2 was implemented with traditional MHA, then KV cache size would be 2*32*128*30*2=491,520 B/token. With MLA with a KV cache size of 576, we get a total cache size of 576*30=34,560 B/token. The percent reduction in KV cache size is then equal to (491,520-34,560)/492,520=92.8%. The numbers I present in this video follow the same approach but are for DeepSeek-v3/R1 architecture: https://huggingface.co/deepseek-ai/De.... num_hidden_layers=61, num_attention_heads=128, v_head_dim = 128. So traditional MHA cache would be 2*128*128*61*2 = 3,997,696 B/token. MLA reduces this to 576*61*2=70,272 B/token. Tor the DeepSeek-V3/R1 architecture, MLA reduces the KV cache size by a factor of 3,997,696/70,272 =56.9X. 2. I claim a couple times that MLA allows DeepSeek to generate tokens more than 6x faster than a vanilla transformer. The DeepSeek-V2 paper claims a slightly less than 6x throughput improvement with MLA, but since the V3/R1 architecture is heavier, we expect a larger lift, which is why i claim “more than 6x faster than a vanilla transformer” - in reality it’s probably significantly more than 6x for the V3/R1 architecture. 3. In all attention patterns and walkthroughs, we’re ignoring the |beginning of sentence| token. “The American flag is red, white, and” actually maps to 10 tokens if we include this starting token, and may attention patterns do assign high values to this token. 4. We’re ignoring bias terms matrix equations. 5. We’re ignoring positional embeddings. These are fascinating. See DeepSeek papers and ROPE."

Voir également :

[TXT] 8-Use-Cases-for-Arti..> 2024-07-18 14:29   11K  
[TXT] AI-and-the-Productiv..> 2024-07-30 07:52   18K  
[TXT] Andrew-Dudzik-Three-..> 2024-07-22 21:22   11K  
[TXT] Apprendre-les-langue..> 2024-07-17 15:58  9.6K  
[TXT] Apprendre-les-langue..> 2024-08-03 09:15   19K  
[TXT] Autopsie-d-une-intel..> 2024-07-17 15:05  9.5K  
[TXT] Bases-concepts-et-hi..> 2025-02-28 15:04   13K  
[TXT] But-what-is-a-GPT-Vi..> 2024-07-08 18:17  5.3K  
[TXT] Claude-3-7-depasse-C..> 2025-02-27 15:22   20K  
[TXT] Coder-un-reseau-de-n..> 2024-07-12 07:09  9.6K  
[TXT] Conference-SML-Myste..> 2024-07-10 15:09  8.5K  
[TXT] Daniel-Andler-Intell..> 2024-07-27 16:42   18K  
[TXT] De-l-apprentissage-d..> 2024-07-18 10:08   12K  
[TXT] De-la-regression-lin..> 2024-07-10 16:27  9.7K  
[TXT] De-la-regression-log..> 2024-07-18 13:10   11K  
[TXT] Donner-un-sens-a-l-I..> 2024-09-13 17:01   18K  
[TXT] FORMATION-DEEP-LEARN..> 2024-07-09 11:28  7.6K  
[TXT] GPT-4o-est-arrivee-L..> 2024-05-28 14:54  5.4K  
[TXT] How-Large-Language-M..> 2024-07-30 07:47   18K  
[TXT] Il-a-cree-une-IA-pou..> 2025-02-27 20:08   13K  
[TXT] Il-etait-une-fois-Ch..> 2024-05-27 09:49  5.5K  
[TXT] Intelligence-artific..> 2024-07-05 08:41  5.3K  
[TXT] Intelligence-artific..> 2024-12-12 11:22   19K  
[TXT] L-Entscheidungsprobl..> 2024-07-10 15:20  8.5K  
[TXT] L-IA-et-ses-defis-Co..> 2024-07-19 19:16   11K  
[TXT] L-IA-open-source-peu..> 2025-02-28 10:46   12K  
[TXT] L-INTELLIGENCE-ARTIF..> 2024-07-17 17:48   12K  
[TXT] L-Intelligence-Artif..> 2024-07-04 18:23  5.4K  
[TXT] L-agent-conversation..> 2025-03-04 09:15   12K  
[TXT] L-apprentissage-par-..> 2024-07-20 21:27   12K  
[TXT] L-apprentissage-prof..> 2024-07-08 18:09  5.3K  
[TXT] Lagrangian-Neural-Ne..> 2024-07-18 13:00   11K  
[TXT] Le-pere-fondateur-du..> 2024-07-04 18:23  5.4K  
[TXT] Les-mathematiques-de..> 2024-07-18 10:21   11K  
[TXT] MIT-Introduction-to-..> 2024-07-17 16:16  9.5K  
[TXT] Mais-qu-est-ce-qu-un..> 2024-07-05 16:35  5.3K  
[TXT] Mathematiques-et-IA-..> 2024-07-18 10:14   11K  
[TXT] Michael-Bronstein-Ge..> 2024-07-23 05:44   11K  
[TXT] Michio-Kaku-Quantum-..> 2024-07-18 09:49   11K  
[TXT] Miles-Cranmer-The-Ne..> 2024-07-23 05:49   12K  
[TXT] Neural-and-Non-Neura..> 2024-09-04 06:19   23K  
[TXT] Notre-test-de-Grok-l..> 2024-09-04 06:40   19K  
[TXT] Perdons-nous-le-cont..> 2024-05-27 09:58  5.5K  
[TXT] Philippe-Aghion-Les-..> 2024-07-19 19:18   11K  
[TXT] Reflechir-aux-usages..> 2024-07-19 10:23   12K  
[TXT] Reseaux-de-neurones-..> 2024-07-17 15:39  9.5K  
[TXT] SOMMAIRE-VIDEOS-INTE..> 2025-02-27 20:13   21K  
[TXT] Sitemap-video-INTELL..> 2025-02-28 08:00  7.7K  
[TXT] Sora-l-outil-pour-ge..> 2024-08-01 16:38   19K  
[TXT] Stanford-Webinar-Lar..> 2024-12-19 09:17   19K  
[TXT] Terence-Tao-at-IMO-2..> 2024-08-30 19:18   19K  
[TXT] The-Map-of-Quantum-C..> 2024-07-18 10:25   11K  
[TXT] The-Most-Important-A..> 2024-07-18 13:01   11K  
[TXT] The-Next-Decade-in-A..> 2024-07-31 08:56   19K  
[TXT] The-Potential-for-AI..> 2024-09-03 09:16   18K  
[TXT] The-moment-we-stoppe..> 2024-07-23 07:40   17K  
[TXT] Un-texte-un-mathemat..> 2024-07-10 14:45  9.8K  
[TXT] What-are-LLM-Embeddi..> 2024-07-18 09:59   11K  
[TXT] What-do-tech-pioneer..> 2024-09-04 06:23   20K  
[TXT] What-is-Large-Scale-..> 2024-08-01 15:51   18K  
[TXT] What-is-RAG-Retrieva..> 2024-07-19 19:27   11K  
[TXT] When-Computers-Write..> 2024-09-03 08:40   19K  
[TXT] Yann-Le-Cun-Meta-nou..> 2024-07-19 10:48   12K  
[TXT] Yann-Lecun-Meta-AI-O..> 2024-07-17 16:03  9.6K