CRAY Chapel Aggregation Library (CAL) November 12, 2018
Au format texte : Chapel Aggregation Library (CAL) November 12, 2018 Louis Jenkins Marcin Zalewski (Pacific Northwest National Lab.), Michael Ferguson (Cray Inc.) The Problem • Accessing remote data is slow § Multiple orders of magnitude slower to access than local memory Node #1 Data Node #0 Task Node #0 RAM Load – !"" #$ Store – !"" #$ GET – 2%$ L1 L2 ". ' #$ ( #$ PUT – 1%$ The Problem • Accessing remote data is slow § Multiple orders of magnitude slower to access than local memory • “Moving the computation to the data” not always the best solution § Using an !" statement requires migrating tasks to another locale Node #1 Data Node #0 Task Node #1 Task The Problem • Accessing remote data is slow § Multiple orders of magnitude slower to access than local memory • “Moving the computation to the data” not always the best solution § Using an !" statement requires migrating tasks to another locale ü Can become bottleneck if fine-grained The Problem • Accessing remote data is slow § Multiple orders of magnitude slower to access than local memory • “Moving the computation to the data” not always the best solution § Using an !" statement requires migrating tasks to another locale ü Can become bottleneck if fine-grained ü Task creation is relatively expensive • Tasks are too large to spawn in a fire-and-forget manner (issue #9984) • Migrating tasks require individual active messages (issue #9727) Node #1 Heap Task Task Stack Task Stack Task Stack Task Stack Task Stack Task Stack Task Task Task Task Task A Solution • Coarsen the granularity of the data § Buffer units of data to be sent to a locale in destination buffers From: Locale #0 To: Locale #1 A Solution • Coarsen the granularity of the data § Buffer units of data to be sent to a locale in destination buffers From: Locale #0 To: Locale #1 A Solution • Coarsen the granularity of the data § Buffer units of data to be sent to a locale in destination buffers From: Locale #0 To: Locale #1 A Solution • Coarsen the granularity of the data § Buffer units of data to be sent to a locale in destination buffers § When buffer is full, it can be flushed to be handled by the user From: Locale #0 To: Locale #1 Locale #0 Task Locale #1 Task Locale #0 Data Send to Locale #1 A Solution • Coarsen the granularity of the data § Buffer units of data to be sent to a locale in destination buffers § When buffer is full, it can be flushed to be handled by the user § User can perform coalescing to combine aggregated data From: Locale #0 To: Locale #1 Locale #0 Task Locale #1 Task Coalesced Send to Locale #1 Data Locale #0 Coalesced Data Communications Layer Tasking Layer Memory Layer Atomics Implementation begin-stmt on-stmt forall-stmt coforall-stmt User Program Chapel’s Multiresolution Design Philosophy • Higher Level composed of Lower Level abstractions, features, and language constructs § Changes to lower level propagate up to higher level § User free to use either ü High-Level for convenience ü Low-Level for performance Global-View Programming • Abstracts locality for the user § No need to think: “What portion of the array does this task own?” § Array can be accessed from any locale, even if it is not distributed over that locale… ü Remote references are resolved into remote PUT/GET implicitly R 7HQi ;HQ#HamK 4 yc k 7HQi HQ+HamK 4 yc j 7Q` UBMi B 4 HQ+Hai`ic B I HQ+H1M/c BYYV & 9 HQ+HamK Y4 ``(B)c 8 ' e JSAn_1.l*1UHQ+HamK- ;HQ#HamK- XXXVc R p` bmK , 7HQic k 7Q`HH BM `` rBi? UY `2/m+2 bmKV & j bmK Y4 c 9 ' Chapel MPI Global-View Programming • Abstracts locality for the user § No need to think: “What portion of the array does this task own?” § Array can be accessed from any locale, even if it is not distributed over that locale… ü Remote references are resolved into remote PUT/GET implicitly • Multiresolution: More Abstraction R 7HQi ;HQ#HamK 4 yc k 7HQi HQ+HamK 4 yc j 7Q` UBMi B 4 HQ+Hai`ic B I HQ+H1M/c BYYV & 9 HQ+HamK Y4 ``(B)c 8 ' e JSAn_1.l*1UHQ+HamK- ;HQ#HamK- XXXVc Chapel MPI R p` bmK 4 Y `2/m+2 ``c Global-View Programming • Abstracts locality for the user § No need to think: “What portion of the array does this task own?” § Array can be accessed from any locale, even if it is not distributed over that locale… ü Remote references are resolved into remote PUT/GET implicitly • Multiresolution: Less Abstraction R 7HQi ;HQ#HamK 4 yc k 7HQi HQ+HamK 4 yc j 7Q` UBMi B 4 HQ+Hai`ic B I HQ+H1M/c BYYV & 9 HQ+HamK Y4 ``(B)c 8 ' e JSAn_1.l*1UHQ+HamK- ;HQ#HamK- XXXVc Chapel MPI R p` bmK , 7HQic k +Q7Q`HH HQ+ BM GQ+H2b rBi? UY `2/m+2 bmKV /Q QM HQ+ & j +Q7Q`HH iB/ BM yXXO?2`2XKthbFS` rBi? UY `2/m+2 bmKV & 9 7Q` B BM +QKTmi2_M;2U``X/QKBMXHQ+Ham#/QKBMUV- iB/V & 8 bmK Y4 ``(B)c e ' d ' 3 ' Chapel Aggregation Library (CAL) • Written in Chapel, for Chapel § Minimal and User-Friendly ü Unassuming of how data is handled ü Designed specifically for Chapel § Distributed, Scalable, and Parallel-Safe ü Supports Global-View Programming ü Usable with Chapel’s parallel and locality constructs § Modular, Reusable, and Generic ü Generic on user-defined type ü Easy to use and ’plug in’ R +QMbi Kb; 4 ]6`QK GQ+H2Oy iQ GQ+H2OR]c k +QMbi HQ+ 4 GQ+H2b(R)c j p` ;;`2;iQ` 4 M2r ;;`2;iQ`Ubi`BM;Vc 9 p` #m772` 4 ;;`2;iQ`X;;`2;i2UKb;- HQ+Vc 8 B7 #m772` 54 MBH i?2M ?M/H2"m772`U#m772`Vc e (U#m7- HQ+V BM ;;`2;iQ`X7Hmb?UV) QM HQ+ /Q ?M/H2"m772`U#m7Vc Minimalism • CAL is an aggregation library § Processing of the aggregated data is deferred to the user § Buffer is returned to the last task that filled it • Use privatization to enable global-view programming § GlobalClass forwards access to per-locale LocalClass privatized instances § Each privatized instance can communicate and coordinate with others Distributed Object Pattern Locale#0 Locale#N ! !"#$%# !"#$"%& '()! " "&'(") *+(,#+-+#.. * # /0!& +#"&&,%-./ $ 1#" -01 2 23// % & 345$"51067 +8-#97.:;50<":0=.1>4-%?-01@ +#"&&,%-.A/ ' B ! !"### $%!#"&"### ! " '#( "#$ % )*+& # ' ! !"### $%!#"&"### ! " '#( "#$ % )*+& # ' • Aggregator forwards all accesses to per-locale privatized instances • Distributed and parallel access is abstracted § Supports global-view programming Aggregator ! !"#$%# !"#$"%& '()! " "&'(") *$$"&$#+(" * # +,!& +,-.%/01 $ -#" /23 4 ./+1 % & -56$"63278 9:/#;80<=62>"<2?03@5/%A/23B +,-.%/0C1 ' D ! '!"## $('"!%&''(" ! ! )0!( "1 " 1"" #$% & */)1 $ 1"" '(--)6& & *+,,-7(./01"2)&3 4(--)650026"71 % 8 ! '!"## $('"!%&''(" ! ! )0!( "1 " 1"" #$% & */)1 $ 1"" '(--)6& & *+,,-7(./01"2)&3 4(--)650026"71 % 8 Locale#0 Locale#N Aggregator - Performance • 10x – 100x speedup at 32 nodes § Histogram § Hypergraph Generation • Aggregator is allocated on Locale#0, but accessible from Locale#1 § Accesses are forwarded to Locale#1’s privatized instance § Global-View Programming • Implicit parallelism (line 9) vs Explicit parallelism (line 11) Distributed - Example R p` ;;`2;iQ` 4 M2r ;;`2;iQ`UBMiVc k ff JB;`i2 iQ GQ+H2 OR 7`QK GQ+H2 Oy j QM GQ+H2b(R) & 9 ff ;;`2;i2 bBM;H2 pHm2 iQ GQ+H2 Oy 8 p` #m772` 4 ;;`2;iQ`X;;`2;i2Uy- GQ+H2b(y)Vc e ff A7 MQM@MBH- i?2M ?M/H2 #m772`X d B7 #m772` 54 MBH i?2M ?M/H2"m772`U#m772`Vc 3 ff ;;`2;i2 KmHiBTH2 mMBib Q7 /i pB *?T2H^b BKTHB+Bi T`HH2HBbK N p` #m772`b 4 ;;`2;iQ`X;;`2;i2URXXRyk9- GQ+H2b(y)Vc Ry ff *?2+F B7 Mv Q7 i?2 #m772`b `2 MBH RR (#m7 BM #m772`b) B7 #m7 54 MBH i?2M ?M/H2"m772`U#m7Vc Rk ' • Composition of Distributed Objects § Aggregator can be used within other global-view data structures § Future of Distributed Object Oriented Programming (?) Modularity Locale#0 Locale#N ! !"#$%# !"#$"%& '()! " "&'(") *+(,#+-+#.. * # /0!& +#"&&,%-./ $ 1#" -01 2 23// % & 345$"51067 +8-#97.:;50<":0=.1>4-%?-01@ +#"&&,%-.A/ ' B ! !"### $%!#"&"### ! ! '()* "# " +#, $%& ' -.'# $ +#, ())*+)(",* ' -))*+)(",*."/# % 0 ! !"### $%!#"&"### ! ! '()* "# " +#, $%& ' -.'# $ +#, ())*+)(",* ' -))*+)(",*."/# % 0 Future Works • Software release of CAL § Currently only available as module under Chapel HyperGraph Library (CHGL) ü github.com/pnnl/chgl § Independent release coming soon (?) • Integration into Chapel § Mason package or Standard Module (?) § Run-time integration • Aggregation handlers as first-class functions § Once Chapel has better first-class function support Potential Application Light Weight Tasks (LWT) • Chapel Tasks are infeasible to use in fire-and-forget manner § Stack size of tasks in Chapel are static and large (8MB default) § Task migration can be made asynchronous but is not aggregated • Solution – Make a library for LWT § Use Distributed Object pattern for GlobalView programming § Use Aggregator for aggregation § Use First-Class Functions (once improved) to represent a lightweight task R p` Hri 4 M2r GqhUpBbBiVc k T`Q+ pBbBiUp , o2`i2tV & j 7Q` pp BM M2B;?#Q`bUpV & 9 B7 ?bS`QT2`ivUppV & 8 HriXbTrMUpp- ppXHQ+H2Vc e ' d ' 3 ' N 7Q`HH p BM p2`iB+2b & Ry B7 ?bS`QT2`ivUpV & RR HriXbTrMUpVc Rk ' Rj ' Vertex Degree Distribution R ff 6BM/ H`;2bi /2;`22 Q7 HH p2`iB+2b BM /Bbi`B#mi2/ ;`T? k p` L 4 Kt `2/m+2 (p BM ;`T?X;2io2`iB+2bUV) ;`T?X/2;`22UpVc j ff >BbiQ;`K Bb +v+HB+HHv /Bbi`B#mi2/ Qp2` HH HQ+H2b 9 p` ?BbiQ;`K.QKBM 4 &RXXL' /KTT2/ *v+HB+Ubi`iA/t4RVc 8 p` ?BbiQ;`K , (?BbiQ;`K.QKBM) iQKB+ BMic e d ff ;;`2;i2 BM+`2K2Mib iQ ?BbiQ;`K 3 p` ;;`2;iQ` 4 M2r ;;`2;iQ`UBMiVc N 7Q`HH p BM ;`T?X;2io2`iB+2bUV & Ry +QMbi /2; 4 ;`T?X/2;`22UpVc RR +QMbi HQ+ 4 ?BbiQ;`K(/2;)XHQ+H2c Rk p` #m772` 4 ;;`2;iQ`X;;`2;i2U/2;- HQ+Vc Rj B7 #m772` 54 MBH & R9 QM HQ+ /Q (/2; BM #m772`) ?BbiQ;`K(/2;)X//URVc R8 #m772`X/QM2UVc Re ' Rd ' R3 RN ff 6Hmb? ky 7Q`HH U#m7- HQ+V BM ;;`2;iQ`X7Hmb?UV & kR QM HQ+ /Q (/2; BM #m7) ?BbiQ;`K(/2;)X//URVc kk #m772`X/QM2UVc kj '