# Caching Puts and Gets in a PGAS Language Runtime Michael Ferguson Cray Inc. Daniel Buettner Laboratory for Telecommunication Sciences **September 17, 2015** ### Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward looking statements may include statements about our financial guidance and expected operating results, our opportunities and future potential, our product development and new product introduction plans, our ability to expand and penetrate our addressable markets and other statements that are not historical facts. These statements are only predictions and actual results may materially vary from those projected. Please refer to Cray's documents filed with the SEC from time to time concerning factors that could affect the Company and these forward-looking statements. # TRAIN LATENCY (8 HOURTRIP, 60 TON CARS, 60 SEC/CAR) # TRAIN BANDWIDTH # INFINIBAND (IB) LATENCY Request Size (bytes) ### INFINIBAND (IB) BANDWIDTH \* with small 10-node cluster, QDR IB 3500 Max BW: 5000 MB/s 2625 Bandwidth MB/s 1750 875 16 32 128 256 512 8 64 ### AGGREGATION ### **OVERLAP** # CACHE HELPS WITH BOTH! # BACKGROUND: MEMORY MODEL ALLOWS PREFETCH AND WRITE-BEHIND # Memory model for C11, C++11, Chapel: data race free programs are sequentially consistent See Adve, S.V., Boehm, H.-J. 2010. Memory models: a case for rethinking parallel languages and hardware. Communications of the ACM 53(8): 90–101. <a href="http://cacm.acm.org/magazines/2010/8/96610-memory-models-a-case-for-rethinking-parallel-languages-and-hardware/fulltext">http://cacm.acm.org/magazines/2010/8/96610-memory-models-a-case-for-rethinking-parallel-languages-and-hardware/fulltext</a> ANALYZE # A RACY PROGRAM ### Thread 1 x = 42; notify = 1; ### Thread 2 while 0 == notify { /\* wait \*/ } compute\_with(x); ## A RACY PROGRAM #### Thread 1 x = 42; notify = 1; ### Thread 2 while 0 == notify { /\* wait \*/ } compute\_with(x); ${\rm compiler}\ or\ {\rm processor}$ #### Thread 1 r1 = 42; notify = 1; x = r1; ### Thread 2 r2 = notify; while 0 == r2 { /\* wait \*/ } compute\_with(x); Compiler and processor would like to start loads earlier in order to hide memory latency. We'll call that prefetch. Compiler and processor would like to complete stores later in order to hide memory latency. We'll call that write behind. # REMEMBER THE RACY PROGRAM? COMPUTE #### Thread 1 x = 42; notify = true; ### Thread 2 while 0 == notify { /\* wait \*/ } compute\_with(x); compiler *or* processor #### Thread 1 r1 = 42; notify = 1; x = r1; ### Thread 2 r2 = notify; while 0 == r2 { /\* wait \*/ } compute\_with(x); # COMMUNICATION OPTIMIZATION # CACHE FOR REMOTE DATA - Goal: communication aggregation and overlap - Bonus points: avoiding repeated communication - Software cache in Chapel's runtime - One cache per pthread - Write-back cache with dirty bits ## CACHE COHERENCY - Simple, local coherency - Discard all cached data on acquire - Wait for pending operations on a *release* Strategy used in related work with UPC # CACHE FEATURES | | Overlap | | Aggregation | | |---------------------------------------------|---------|-----|-------------|-----| | | GET | PUT | GET | PUT | | Do PUTs in background | | X | | | | Start one PUT per contiguous written region | | | | X | | Round GETs up to 64-byte cache lines | | | X | | | Sequential read-ahead | X | | X | | | Programmer-provided prefetch hints* | X | | | | # WEAK MEMORY CONSISTENCY? ``` 1 x starts at 0; if someOption then 2 x = 2; if someOtherOption then 3 \times = 3; 4 return x; ``` # WEAK MEMORY CONSISTENCY? ``` 1 x starts at 0; 2 PUT 2 into x; 3 PUT 3 into x; 4 GET x; ``` Chapel **OpenSHMEM** result must be 3 result could be 0, 2, or 3 ### COMPILER OPTIMIZATION? ``` CRAY ``` ``` for i in 1..100 { // PUT into B B[f(i)] = i; } ``` Can the compiler prove these PUTs do not overlap? PUT 3 into B[f(3)] # COMPILER OPTIMIZATION? ``` for i in 1..100 { // PUT into B B[f(i)] = i; } PUT 1 i PUT 2 I PUT 3 ``` PUT 1 into B[f(1)] PUT 2 into B[f(2)] PUT 2 into B[f(2)] I With a cache, conflicting access is handled at runtime. ### OVERLAPPING GETS ``` var A:[1..n] int; on Locales[1] { var sum:int; for i in 1..n do sum += A[f(i)] } ``` We would like to overlap the GETs for A[f(i)] with each other ``` var A: [1..n] int; on Locales[1] { var sum:int; var h: [0..k] handles; var bufs: [0..k] int; // Warm up loop for i in 1..k { // nonblocking GET A[f(i) into bufs[i%k] h[i%k] = get_nb(bufs[i%k], A[f(i)]) for i in 1...n { wait (h[i%k]); sum += bufs[i%k]; if i+k<=n { // nonblocking GET A[f(i+k)] into bufs[(i+k)%k] h[(i+k)%k] = get_nb(bufs[(i+k)%k],A[f(i+k)]) ``` **Explicit overlap is messy!** ``` var A:[1..n] int; on Locales[1] { var sum:int; // Optional warm up for i in 1..k do prefetch(A[f(i)]); for i in 1..n { if i+k <= n then prefetch(A[f(i+k)]);</pre> sum += A[f(i)] ``` Much better! # COMMUNICATION AGGREGATION ``` CRAY ``` ``` for i in 1..n do B[i] = compute(i); ``` ``` var localB:[1..n] int; for i in 1..n do localB[i] = compute(i); B = localB; ``` ``` for i in 1..n do consume(A[i]); ``` ``` var localA:[1..n] int = A; for i in 1..n do consume(localA[i]); ``` # Simple, cache aggregates Manual optimization reduces portability ### TEST CONFIGURATIONS - Cray XC30<sup>™</sup> system with 50 nodes, Aries network - GASNet Aries: GASNet with the aries conduit - Cray uGNI: native uGNI support for Chapel - Cray CS400<sup>™</sup> system with 200 nodes, FDR InfiniBand - GASNet ibv: GASNet with the InfinBand Verbs conduit - v1.9+ is revision 5ba6639 - v1.11+ is revision 6c635a1 ## SYNTHETIC BENCHMARKS # APPLICATION BENCHMARKS ## PREFETCH EXAMPLE ``` var A:[1..n] int; on Locales[1] { var sum:int; // Optional warm up for i in 1..k do prefetch(A[f(i)]); for i in 1..n { if i+k <= n then prefetch(A[f(i+k)]);</pre> sum += A[f(i)] ``` ## PREFETCH EXAMPLE ## VS OPTIMIZATION \*for GASNet/Aries, lulesh improved 3.2x between v1.9+ and v1.11+ # VS OPTIMIZATION #### **Legal Disclaimer** Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to any intellectual property rights is granted by this document. Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. Cray hardware and software products may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Cray uses codenames internally to identify products that are in development and not yet publically announced for release. Customers and other third parties are not authorized by Cray Inc. to use codenames in advertising, promotion or marketing and any use of Cray Inc. internal codenames is at the sole risk of the user. Performance tests and ratings are measured using specific systems and/or components and reflect the approximate performance of Cray Inc. products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and design, SONEXION, URIKA, and YARCDATA. The following are trademarks of Cray Inc.: ACE, APPRENTICE2, CHAPEL, CLUSTER CONNECT, CRAYPAT, CRAYPORT, ECOPHLEX, LIBSCI, NODEKARE, THREADSTORM. The following system family marks, and associated model number marks, are trademarks of Cray Inc.: CS, CX, XC, XE, XK, XMT, and XT. The registered trademark LINUX is used pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis. Other trademarks used in this document are the property of their respective owners. Copyright 2014 Cray Inc. ### **Backup Slides** # APPLICATION BENCHMARKS ## ADDING IMPLIED FENCES acquire and release triggered by task or on statement spawn, join, start, and finish ``` release on { acquire release acquire ``` ``` sync { release begin { acquire release acquire ``` ## CACHE DATA STRUCTURES perations Queue UTE Pointer Tree per task: last acquire sequence number ## WRITE BEHIND Write Recorded in Dirty Bits, Page added to Dirty Queue Flushed on *release* or when there are too many dirty pages GET with 2 earlier valid lines triggers synchronous readahead ra skip=1 pg len = 1 pg ## INFINIBAND (IB) LATENCY \* with small 10-node cluster, QDR IB 3600 2700 \_atency (ns) 1800 GASNET get **IB** Benchmark **GASNET** put 32 64 128 256 512 16 ### INFINIBAND (IB) BANDWIDTH \* with small 10-node cluster, QDR IB 3000 **IB** Benchmark Max BW: **GASNET** put 5000 MB/s GASNET get 2250 Bandwidth MB/s 1500 750 32 8 16 64 128 256 512 Request Size (bytes) #### **Legal Disclaimer** Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to any intellectual property rights is granted by this document. Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. Cray hardware and software products may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Cray uses codenames internally to identify products that are in development and not yet publically announced for release. Customers and other third parties are not authorized by Cray Inc. to use codenames in advertising, promotion or marketing and any use of Cray Inc. internal codenames is at the sole risk of the user. Performance tests and ratings are measured using specific systems and/or components and reflect the approximate performance of Cray Inc. products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and design, SONEXION, URIKA, and YARCDATA. The following are trademarks of Cray Inc.: ACE, APPRENTICE2, CHAPEL, CLUSTER CONNECT, CRAYPAT, CRAYPORT, ECOPHLEX, LIBSCI, NODEKARE, THREADSTORM. The following system family marks, and associated model number marks, are trademarks of Cray Inc.: CS, CX, XC, XE, XK, XMT, and XT. The registered trademark LINUX is used pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis. Other trademarks used in this document are the property of their respective owners. Copyright 2014 Cray Inc.