Documentation CRAY

Documentation CRAY Documentation CRAY Manuel Utilisateur User's guide

ou juste avant la balise de fermeture -->

CD ROM Annuaire d'Entreprises France prospect (avec ou sans emails) : REMISE DE 10 % Avec le code réduction AUDEN872

10% de réduction sur vos envois d'emailing --> CLIQUEZ ICI

Retour à l'accueil, cliquez ici

ou juste avant la balise de fermeture -->

barrier
Product: Cray XMT

In code, a barrier is used after a phase. The barrier delays the streams that were executing parallel operations in the phase until all the streams from the phase reach the barrier. Once all the streams reach the barrier, the streams begin work on the next phase.

blade
Product: Cray XMT

1) A field-replaceable physical entity. A Cray XMT service blade consists of AMD Opteron sockets, memory, Cray SeaStar chips, PCI-X or PCIe cards, and a blade control processor. A Cray XMT compute blade consists of Threadstorm processors, memory, Cray SeaStar chips, and a blade control processor. 2) From a system management perspective, a logical grouping of nodes and blade control processor that monitors the nodes on that blade.

blade control processor
Product: Cray X2, Cray XMT, Cray XT series, Cray XE series

A microprocessor on a blade that communicates with a cabinet control processor through the HSS network to monitor and control the nodes on the blade. See also blade, L0 controller, Hardware Supervisory System (HSS).

block scheduling
Product: Cray XMT

Method of thread execution used by the compiler where contiguous blocks of loop iterations are divided equally and assigned to available streams. For example, if there are 100 loop iterations and 10 streams, the compiler assigns 10 iterations to each stream. The advantage to this method is that data in registers can be reused across adjacent iterations rather than releasing a stream after each iteration.

cabinet control processor
Product: Cray X2, Cray XE series, Cray XMT, Cray XT series

A microprocessor in the cabinet that communicates with the HSS via the HSS network to monitor and control the devices in a system cabinet. See also Hardware Supervisory System (HSS).

cage
Product: Cray XMT

A chassis on a Cray XMT series system. See chassis.

chassis
Product: Cray XMT

The hardware component of a Cray XMT cabinet that houses blades. Each cabinet contains three vertically stacked chassis, and each chassis contains eight vertically mounted blades. See also cage.

Cray SeaStar chip
Product: Cray XMT

The component of the system interconnection network that provides message routing and communication services. See also system interconnection network.

dependence analysis
Product: Cray XMT

A technique used by the compiler to determine if any iteration of a loop depends on any other iteration (this is known as a loop-carried dependency).

dynamic scheduling
Product: Cray XMT

In a dynamic schedule, the compiler does not bind iterations to streams at loop startup. Instead, streams compete for each iteration using a shared counter.

future
Product: Cray XMT

Implements user-specified or explicit parallelism by starting new threads. A future is a sequence of code that can be executed by a newly created thread that is running concurrently with other threads in the program. Futures delay the execution of code if the code is using a value that is computed by a future, until the future completes. The thread that spawns the future uses parameters to pass information from the future to the waiting thread, which then executes. In a program, the term future is used as a type qualifier for a synchronization variable or as a keyword for a future statement.

induction variable
Product: Cray XMT

A variable that is increased or decreased by a fixed amount on each iteration of a loop.

inductive loop
Product: Cray XMT

An inductive loop is one which contains no loop-carried dependencies and has the following characteristics: a single entrance at the top of the loop; controlled by an induction variable; and has a single exit that is controlled by comparing the induction variable against an invariant.

interleaved scheduling
Product: Cray XMT

Method of executing loop iterations used by the compiler where contiguous iterations are assigned to distinct streams. For example, for a loop with 100 iterations and 10 streams, one stream performs iterations 1, 11, 21,... while another stream performs iterations 2, 12, 22, ..., and so on. This method is typically used for triangular loops because it reduces imbalances. One disadvantage to using this method is that there is loss of data reuse between loop iterations because the stream is released at the end of the iteration.

L0 processor
Product: Cray XMT

See blade control processor.

linear recurrence
Product: Cray XMT

A special type of recurrence that can be parallelized.

logical machine
Product: Cray XMT

An administrator-defined portion of a physical Cray XMT system, operating as an independent computing resource.

loop-carried dependences
Product: Cray XMT

The value from one iteration of a loop is used during a subsequent iteration of the loop. This type of loop cannot be parallelized by the compiler.

multicore
Product: Cascade, Cray X2, Cray XMT, Cray XT series

A processor that combines multiple independent execution engines ("cores"), each with its own cache and cache controller.

multiprocessor mode
Product: Cray XMT

A mode that can be set at compile time that ensures that when the compiled application is run, iterations of a loop are run on multiple processors.

node
Product: Cray XT series, Cray XMT, Cray XE series, Cray X2

For CLE systems, the logical group of processor(s), memory, and network components that acts as a network end point on the system interconnection network.

phase
Product: Cray XMT

A set of one or more sections of code that the stream executes in parallel. Each section contains an iteration of a loop. Phases and sections are contained in control flow code generated by the compiler to control the parallel execution of a function.

recurrence
Product: Cray XMT

A recurrence occurs when a loop uses values computed in one iteration in subsequent iterations. These subsequent uses of the value imply loop-carried dependences and thus usually prevent parallelization. To increase parallelization, use linear recurrence.

reduction
Product: Cray XMT

A simple form of recurrence that reduces a large amount of data to a single value. It is commonly used to find the minimum and maximum elements of a vector. Although similar to a reduction, it is easier to parallelize and uses less memory.

region
Product: Cray XMT

A region is an area in code where threads are forked in order to perform a parallel operation. The region ends at the point where the threads join back together at the end of the parallel operation.

service node
Product: Cray XMT

Performs support functions for applications and system services such as login, network, I/O, boot, and service database (SDB). Service nodes run a version of CLE.

single-processor mode
Product: Cray XMT

A mode that can be set at compile time that ensures that when the compiled application is run, iterations of a loop are run on a single processor.

Source :

http://docs.cray.com/cgi-bin/craydoc.cgi?mode=Glossary;q=product%3dxmt

Knowledge Base

http://docs.cray.com/kbase/plat.html

Accéder au manuel utilisateur

Overview of Gemini Hardware Counters

http://docs.cray.com/books/S-0025-10//S-0025-10.pdf

Accéder au manuel utilisateur

TotalView

New Feature http://docs.cray.com/books/S-6503-65/S-6503-65.pdf

Accéder au manuel utilisateur

PGI® User’s Guide Parallel Fortran, C and C++ for Scientists and Engineers : http://docs.cray.com/books/S-6516-71/S-6516-71-apr08.pdf

Accéder au manuel utilisateur

About the guide :

http://docs.cray.com/books/004-2182-003/03preface.pdf Scienti?c Libraries User’s Guide 004–2151–002

http://docs.cray.com/books/004-2151-002//004-2151-002-manual.pdf PGI ® User’s Guide Parallel Fortran, C and C++ for Scientists and Engineer http://docs.cray.com/books/S-6516-61/pgi61ug.pdf PGI® User’s Guide Parallel Fortran, C and C++ for Scientists and Engineers http://docs.cray.com/books/S-6516-70/S-6516-70-mar07.pdf PAPI USER’S GUIDE http://docs.cray.com/books/S-6515-35/S-6515-35.pdf SuperLU Users' Guide James W. Demmel 1 John R. Gilbert 2 Xiaoye S. Li 3 Septemb er, 1999 Last update: October, 2003 http://docs.cray.com/books/S-6532-10/ug.pdf SuperLU Users’ Guide James W. Demmel 1 John R. Gilbert 2 Xiaoye S. Li 3 September 1999 Last update: June 2009 http://docs.cray.com/books/S-6532-20/6532-20.pdf February 2011 Programming Environments Release Announcement http://docs.cray.com/books/S-9401-1102//S-9401-1102.pdf Guide to Parallel Vector Applications 004–2182–003 http://docs.cray.com/books/004-2182-003/004-2182-003-manual.pdf CrayDoc™ Installation and Administration Guide S–2340–21 http://docs.cray.com/books/S-2340-21/S-2340-21-manual.pdf Comparing Binaries Between Cray Linux Environment (CLE) Systems, Standalone Whiteboxes, and ESLogin Nodes http://docs.cray.com/books/S-0019-10//S-0019-10.pdf Cray Application Developer's Environment User's Guid http://docs.cray.com/books/S-2396-601/S-2396-601.pdf Cray Application Developer's Environment User's Guide http://docs.cray.com/books/S-2396-60/S-2396-60.pdf Cray Application Developer's Environment User's Guid http://docs.cray.com/books/S-2396-50/S-2396-50.pdf AMD Core Math Library (ACML) Version 4.3.0 http://docs.cray.com/books/S-6511-43/S-6511-43.pdf AMD Core Math Library (ACML) Version 4.0.0 http://docs.cray.com/books/S-6511-40/acml_400_userguide.pdf Cray Fortran Reference Manual

http://docs.cray.com/books/S-3901-80/S-3901-80.pdf Cray C and C++ Reference Manual

http://docs.cray.com/books/S-2179-80/S-2179-80.pdf Lustre File System Operations Manual - Version 1.8 http://docs.cray.com/books/S-6540-1815/S-6540-1815.pdf Cray Linux Environment™ (CLE) 4.0 Software Release Overvie

http://docs.cray.com/books/S-2425-40/S-2425-40.pdf Cray XT™ System Overview : http://docs.cray.com/books/S-2423-22/S-2423-22.pdf Cray X1™ Series System Overview S–2346–25 http://docs.cray.com/books/S-2346-25/S-2346-25.pdf Migrating Applications to the Cray X1™ Series Systems S–2378–54 http://docs.cray.com/books/S-2378-54/S-2378-54.pdf intro_biolib(3) http://docs.cray.com/cgi-bin/craydoc.cgi?idx=man_search;q=id%3dintro_biolib.3;mode=Show;f=man/biolibm/30/cat3/intro_biolib.3.html Getting Started on Cray X2™ Systems S–2471–60 : http://docs.cray.com/books/S-2471-60/S-2471-60.pdf

Cray XT5h ™ System Overview S–2472–21 : http://docs.cray.com/books/S-2472-21/S-2472-21.pdf Cray® Programming Environment 6.0 Releases Overview and Installation Guide S–5212–60 http://docs.cray.com/books/S-5212-60/S-5212-60.pdf Cray® Fortran Reference Manual S–3901–60

http://docs.cray.com/books/S-3901-60/S-3901-60.pdf Cray® C and C++ Reference Manual S–2179–60 :

http://docs.cray.com/books/S-2179-60/S-2179-60.pdf Cray Performance Analysis Tools 5.3 Release Overview and Installation Guid

http://docs.cray.com/books/S-2474-53/S-2474-53.pdf Cray XMT™ System Overview http://docs.cray.com/books/S-2466-20/S-2466-20.pdf Cray XMT™ Programming Environment User's Guide http://docs.cray.com/books/S-2479-20/S-2479-20.pdf Cray XMT™ Programming Model http://docs.cray.com/books/S-2367-20/S-2367-20.pdf Cray XMT™ Debugger Reference Guid

http://docs.cray.com/books/S-2467-20/S-2467-20.pdf Cray XMT™ Performance Tools User's Guide

http://docs.cray.com/books/S-2462-20/S-2462-20.pdf Optimizing Loop-Level Parallelism in Cray XMT™ Applications :

http://docs.cray.com/books/S-2487-14/S-2487-14.pdf Limiting Loop Parallelism in Cray XMT™ Applications June 21, 2010 http://docs.cray.com/books/S-0027-14/S-0027-14.pdf Cray DVS Installation and Configuration Private S–0005–10 http://docs.cray.com/books/S-0005-10//S-0005-10.pdf Application Cleanup by ALPS and Node Health Monitoring : http://docs.cray.com/books/S-0014-22/S-0014-22.pdf Application Programmer’s I/O Guide S–3695–36 : http://docs.cray.com/books/S-3695-36/S-3695-36-manual.pdf Overview of Gemini Hardware Counters This document describes the Gemini Performance Counters and how to use them to optimize individual applications and system traf?c. Send e-mail to docs@cray.com with any comments that will help us to improve the accuracy and usability of this document. Be sure to include the title and number of the document with your comments. We value your comments and will respond to them promptly. Accessing network performance counters is desirable for application developers, system library developers (e.g. MPI), and system administrators. Application developers want to improve their application run-times or measure what affect other traf?c on the system has on their application. System library developers want to optimize their collective operations. System Administrators want to observe the system, looking for hotspots. Effective with the CrayPat (Cray performance analysis tool) version 5.1 and Cray Linux Environment (CLE) version 3.1 software releases for the Cray XE platform, users can monitor many of the performance counters that reside on the Gemini networking chip. There are two categories of Gemini performance counters available to users. NIC performance counters record information about the data moving through theNetwork Interface Controller (NIC). On the Gemini ASIC there are two NICs, each attached to a compute node. Thus, the data from the NIC performance counters re?ects network transfers beginning and ending on the node. These performance counters are read-only. Network router tile counters are available on a per-Gemini basis. There are both read-only and read/write tile counters. Each chip has 48 router tiles, arranged in a 6x8 grid. Eight processor tiles connect to each of the two Gemini NICs. Each NIC connects to a different node, running separate Linux instances. If collection at other points of the application is desired, use the CrayPat API to insert regions as described in the pat_build man page. It is recommended that you do not collect any other performance data when collecting network counters. Data collection of network counters is much more expensive than other performance data collection, and will skew other results. At the time the instrumented executable program is launched with the aprun command, a set of environment variables, PAT_RT_NWPC_*, provide access to the Gemini network performance counters. These environment variables are described in the intro_craypat man page. S–0025–10 1Using the Cray Gemini Hardware Counters 1.1 Using CrayPat to Monitor Gemini Counters The CrayPat utility pat_build instruments an executable ?le. One aspect of the instrumentation includes intercepting entries into and returns out of a function. This is known formally as tracing. Information such as time stamps and performance counter values are recorded at this time. CrayPat supports instrumentation of an application binary for collection of Gemini counters. Counter values are recorded at application runtime, and are presented to the user through a table generated by pat_report. The CrayPat user interface to request instrumentation is similar to that for processor performance counters. There is no Gemini counter display available in Cray Apprentice2 at this time. A new display will be available in a subsequent release of the Cray Apprentice2 software. Although the user interface to request network counters is similar to processor counters, there are some signi?cant differences that must be understood. Depending on the type of counters requested, some are shared across all processors within a node, some are shared between two nodes and some are shared across all applications passing through a chip. Some counters monitor all traf?c for your application, even on nodes that are not reserved for your application, and some monitor locally, that is they monitor only traf?c associated with nodes assigned to a Gemini chip and no other traf?c from the network. Users should also be aware that access to the network counters is more resource-intensive than access to the processor performance counters. Because Gemini counters are a shared resource, the system software is designed to provide dedicated access whenever possible. This is done through the Application Level Placement Scheduler (ALPS) by ensuring that an application collecting counters is not placed on the same Gemini chip as another application collecting performance counters. It does not prevent a second application from being placed on the same Gemini chip that is not collecting counters however. This compromise assures better system utilization because compute nodes are not left unavailable for use by another application. The CrayPat 5.1 release focuses on the use of the NIC and ORB counters available within the Gemini chip. The values collected from these counters are local to a node and therefore speci?c to an application. Traf?c between MPI ranks cannot be distinguished through the counters. The event names that CrayPat supports are listed at the end of this document. Network counters are only collected for the MAIN thread. Values are collected at the beginning and end of the instrumented application. Instrumentation overhead is minimal. This gives a high-level view of the program's use of the networking router in terms of the counters speci?ed. Currently the time to access counter data is too expensive to collect more frequently. A future release of CLE will address these performance limitations. 2 S–0025–10Overview of Gemini Hardware Counters Before attempting the following examples verify that your system has a Gemini network: $ module list xtpe-network-gemini Attempting to collect Gemini performance counters on a system that does not have the Gemini network will result in a fatal error: $ aprun -n 16 my_program+pat CrayPat/X: Version 5.1 Revision 3329 05/20/10 11:26:16 pat[FATAL][0]: initialization of NW performance counter API failed [No such file or directory] Example 1. Collect stalls associated with node traf?c to and from the network This example enables tracing of MAIN. $ pat_build -w my_program $ export PAT_RT_NWPC=GM_ORB_PERF_VC0_STALLED,GM_ORB_PERF_VC1_STALLED $ aprun my_program+pat Example 2. Display network counter data $ pat_report my_program+pat+11171-41tdot.xf> counter_rpt Example output from pat_report: NWPC Data by Function Group and Function Group / Function / Node Id=0='HIDE' ===================================================================== Total --------------------------------------------------------------------- Time% 100.0% Time 2.476423 secs GM_ORB_PERF_VC1_STALLED 0 GM_ORB_PERF_VC1_BLOCKED 0 GM_ORB_PERF_VC1_BLOCKED_PKT_GEN 0 GM_ORB_PERF_VC1_PKTS 48 GM_ORB_PERF_VC1_FLITS 48 GM_ORB_PERF_VC0_STALLED 111 GM_ORB_PERF_VC0_PKTS 48 GM_ORB_PERF_VC0_FLITS 201 ===================================================================== S–0025–10 3Using the Cray Gemini Hardware Counters Example 3. Collect data for a custom group of network counters In this example a user creates a group of network events in a ?le called my_nwpc_groups, one called 1 and the other called CQ_AMO: $ cat my_nwpc_groups # Group 1: Outstanding Request Buffer 1 = GM_ORB_PERF_VC1_STALLED, GM_ORB_PERF_VC1_BLOCKED, GM_ORB_PERF_VC1_BLOCKED_PKT_GEN, GM_ORB_PERF_VC1_PKTS, GM_ORB_PERF_VC1_FLITS, GM_ORB_PERF_VC0_STALLED, GM_ORB_PERF_VC0_PKTS, GM_ORB_PERF_VC0_FLITS # Group CQ_AMO: CQ_AMO = GM_AMO_PERF_COUNTER_EN, GM_AMO_PERF_CQ_FLIT_CNTR, GM_AMO_PERF_CQ_PKT_CNTR, GM_AMO_PERF_CQ_STALLED_CNTR, GM_AMO_PERF_CQ_BLOCKED_CNTR $ pat_build -w my_program $ export PAT_RT_NWPC_FILE=my_nwpc_groups $ export PAT_RT_NWPC=1,CQ_AMO $ aprun -n16 my_program+pat 4 S–0025–10Overview of Gemini Hardware Counters Example output from pat_report: NWPC Data by Function Group and Function Group / Function / Node Id=0='HIDE' ===================================================================== Total --------------------------------------------------------------------- Time% 100.0% Time 2.639046 secs GM_ORB_PERF_VC1_STALLED 72525 GM_ORB_PERF_VC1_PKTS 50457 GM_AMO_PERF_COUNTER_EN 0 GM_AMO_PERF_CQ_FLIT_CNTR 11752 GM_AMO_PERF_CQ_PKT_CNTR 5876 GM_AMO_PERF_CQ_STALLED_CNTR 5092 GM_AMO_PERF_CQ_BLOCKED_CNTR 29 ===================================================================== Example 4. Suppress instrumented entry points from recording performance data to reduce overhead This example assumes a NWPC group FMAS exists and is available for use. Because the program is traced, the PAT_RT_TRACE_FUNCTION_NAME is set to suppress any data collection by already instrumented entry points in my_program+pat. This means that NWPC values will only be recorded for the MAIN thread at the start and the end of the instrumented program. Instrumentation overhead is minimal. $ pat_build -u -g mpi my_program $ export PAT_RT_NWPC=FMAS $ export PAT_RT_TRACE_FUNCITON_NAME=*:0 $ aprun -n32 my_program+pat This gives a high-level view of the program's use of the networking router in terms of what the FMAS group describes. If more details about NWPC use during execution of the program are desired, the PAT_RT_TRACE_FUNCTION_NAME environment variable need not be set, but the signi?cant overhead injected by reading the NWPCs may make the resulting performance data inaccurate. To selectively collect NWPCs and the other performance data for traced functions, add them to the end of PAT_RT_TRACE_FUNCTION_NAME: $ export PAT_RT_TRACE_FUNCTION_NAME=0:*,mxm,MPI_Bcast S–0025–10 5Using the Cray Gemini Hardware Counters 1.2 Gemini NIC Counters To better understand how to use the NIC counters, you need to understand some of the terminology speci?c to the Gemini network architecture. The Block Transfer Engine (BTE) A Gemini network packet typically consists of one or more ?its, which are the units of ?ow control for the network. Because ?its are usually larger than the physical datapath, they are divided into phits, which are the units of data that the network can handle physically. A packet must contain at least two phits, one for the header and one for the cyclical redundancy check (CRC). The V0 counters support the request channel and the V1 counters support the response channel. A ?it/pkt ratio can tell the user if the data entering the network was not aligned, eg a ratio greater than 1 indicates misaligned data is being sent across the network. Because there is a bandwidth/pipe size difference between outgoing and incoming (outgoing is smaller), in general you will notice more stalls on the V0 (request) channel. The following counters are recommended as a way to begin using the Gemini NWPC: GM_ORB_PERF_VC0_STALLED GM_ORB_PERF_VC1_STALLED GM_ORB_PERF_VC0_PKTS GM_ORB_PERF_VC1_PKTS GM_ORB_PERF_VC0_FLITS GM_ORB_PERF_VC1_FLITS Table 1. Atomic Memory Operations Performance Counters Name Description GM_AMO_PERF_ACP_COMP_CNTR Number of Atomic Memory Operation (AMO) computations that have occurred. GM_AMO_PERF_ACP_MEM_UPDATE_CNTR Number of AMO logic cache write-throughs that have occurred. GM_AMO_PERF_ACP_STALL_CNTR Number of AMO logic pipeline stalls that have occurred. GM_AMO_PERF_AMO_HEADER_CNTR Number of request headers processed by the Decode Logic that have had an AMO computation. Error packets are not counted. GM_AMO_PERF_COUNTER_EN When set, counting is enabled. When cleared, counting is disabled. GM_AMO_PERF_CQ_BLOCKED_CNTR Number of cycles the CQ FIFO is blocked. 6 S–0025–10Overview of Gemini Hardware Counters Name Description GM_AMO_PERF_CQ_FLIT_CNTR Number of ?its (network ?ow control units) that are read from the CQ FIFO. GM_AMO_PERF_CQ_PKT_CNTR Number of packets that are read from the CQ FIFO. GM_AMO_PERF_CQ_STALLED_CNTR Number of cycles the CQ FIFO is stalled. GM_AMO_PERF_DONE_INV_CNTR Number of times a valid cache entry was invalidated because there were no more outstanding AMO requests targeting it and the last request did not have the cacheable bit set. GM_AMO_PERF_ERROR_HEADER_CNTR Number of request headers processed by the Decode Logic that have had errors. GM_AMO_PERF_FLUSH_HEADER_CNTR Number of request headers processed by the Decode Logic that have had a Flush command. Error packets are not counted. GM_AMO_PERF_FULL_INV_CNTR Number of times a valid but inactive cache entry was invalidated to make room for a new AMO address. A high value in this counter indicates that there are too many cacheable AMO addresses and that the cache is being thrashed. GM_AMO_PERF_GET_HEADER_CNTR Number of request headers processed by the Decode Logic that have had an GET command. Error packets are not counted. GM_AMO_PERF_MSGCOMP_HEADER_CNTR Number of request headers processed by the Decode Logic that have had a MsgComplete command. Error packets are not counted. GM_AMO_PERF_PUT_HEADER_CNTR Number of request headers processed by the Decode Logic that have had an PUT command. Error packets are not counted. GM_AMO_PERF_REQLIST_FULL_STALL_CNTR Number of times an AMO request causes the NRP to stall waiting for a Request List entry to become free. GM_AMO_PERF_RMT_BLOCKED_CNTR Number cycles the RMT FIFO is blocked GM_AMO_PERF_RMT_FLIT_CNTR Number of ?its that are read from the RMT FIFO GM_AMO_PERF_RMT_PKT_CNTR Number of packets that are read from the RMT FIFO GM_AMO_PERF_RMT_STALLED_CNTR Number cycles the RMT FIFO is stalled S–0025–10 7Using the Cray Gemini Hardware Counters Name Description GM_AMO_PERF_TAG_HIT_CNTR Number of AMO requests that have been processed in the Tag Store and have resulted in a cache hit. GM_AMO_PERF_TAG_MISS_CNTR Number of AMO requests that have been processed in the Tag Store and have resulted in a cache miss. GM_AMO_PERF_TAG_STALL_CNTR Number of times a GET/PUT request hits in the cache and causes the NRP to stall. Table 2. Fast Memory Access Performance Counters Name Description GM_FMA_PERF_CQ_PKT_CNT Number of packets from Fast Memory Access (FMA) to CQ. GM_FMA_PERF_CQ_STALLED_CNT Number of clock cycles FMA_CQ was stalled due to lack of credits. GM_FMA_PERF_HT_NP_REQ_FLIT_CNT Number of HT NP request ?its to FMA. GM_FMA_PERF_HT_NP_REQ_PKT_CNT Number of HT NP request packets to FMA. GM_FMA_PERF_HT_P_REQ_FLIT_CNT Number of HT P request ?its to FMA. GM_FMA_PERF_HT_P_REQ_PKT_CNT Number of HT P request packets to FMA. GM_FMA_PERF_HT_RSP_PKT_CNT Number of HT response packets from FMA to HT. GM_FMA_PERF_HT_RSP_STALLED_CNT Number of clock cycles FMA_HT_RSP was stalled due to lack of credits. GM_FMA_PERF_TARB_FLIT_CNT Number of ?its from FMA to TARB. GM_FMA_PERF_TARB_PKT_CNT Number of packets from FMA to TARB. GM_FMA_PERF_TARB_STALLED_CNT Number of clock cycles FMA_TARB was stalled due to lack of credits. 8 S–0025–10Overview of Gemini Hardware Counters Table 3. Hyper-transport Arbiter Performance Counters Name Description GM_HARB_PERF_AMO_NP_BLOCKED Number of times AMO Non-Posted Queue has an entry, but is blocked from using the Non-Posted Initiator Request output channel by the BTE Non-Posted Queue. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_AMO_NP_FLITS Number of ?its coming out of the AMO Non-Posted Queue. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_AMO_NP_PKTS Number of packets coming out of the AMO Non-Posted Queue. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_AMO_NP_STALLED Number of cycles the AMO Non-Posted Queue is stalled due to a lack credits on the Non-Posted Initiator Request channel. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_AMO_P_ACP_BLOCKED Number of times AMO Posted AMO Computation Pipe Queue has an entry, but is blocked from using the Posted Initiator Request output channel by another Posted Queue. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_AMO_P_ACP_FLITS Number of ?its coming out of the AMO Posted AMO Computation Pipe Queue. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). S–0025–10 9Using the Cray Gemini Hardware Counters Name Description GM_HARB_PERF_AMO_P_ACP_PKTS Number of packets coming out of the AMO Posted AMO Computation Pipe Queue. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_AMO_P_ACP_STALLED Number of cycles the AMO Posted AMO Computation Pipe Queue is stalled due to a lack credits on the Posted Initiator Request channel. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_AMO_P_NRP_BLOCKED Number of times AMO Posted New Request Pipe Queue has an entry, but is blocked from using the Posted Initiator Request output channel by another Posted Queue. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_AMO_P_NRP_FLITS Number of ?its coming out of the AMO Posted New Request Pipe Queue. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_AMO_P_NRP_PKTS Number of packets coming out of the AMO Posted New Request Pipe Queue. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_AMO_P_NRP_STALLED Number of cycles the AMO Posted New Request Pipe Queue is stalled due to a lack credits on the Posted Initiator Request channel. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). 10 S–0025–10Overview of Gemini Hardware Counters Name Description GM_HARB_PERF_BTE_NP_BLOCKED Number of times AMO Non-Posted BTE Queue has an entry, but is blocked from using the Non-Posted Initiator Request output channel by another Non-Posted Queue. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_BTE_NP_FLITS Number of ?its coming out of the AMO Non-Posted BTE Queue. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_BTE_NP_PKTS Number of packets coming out of the AMO Non-Posted BTE Queue. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_BTE_NP_STALLED Number of cycles the AMO Non-Posted BTE Queue is stalled due to a lack credits on the Posted Initiator Request channel. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_BTE_P_BLOCKED Number of times AMO Posted BTE Queue has an entry, but is blocked from using the Posted Initiator Request output channel by another Posted Queue. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_BTE_P_FLITS Number of ?its coming out of the AMO Posted BTE Queue. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). S–0025–10 11Using the Cray Gemini Hardware Counters Name Description GM_HARB_PERF_BTE_P_PKTS Number of packets coming out of the AMO Posted BTE Queue. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_BTE_P_STALLED Number of cycles the AMO Posted BTE Queue is stalled due to a lack credits on the Posted Initiator Request channel. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_COUNTER_EN When set, counting is enabled. When clear, counting is disabled. This MMR is reset by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_IREQ_NP_FLITS Number of ?its on the non-posted initiator request output of the HARB block. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_IREQ_NP_PKTS Number of packets on the non-posted initiator request output of the HARB Block. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_IREQ_NP_STALLED Number of cycles on the non-posted initiator request output of the HARB is stalled due to a lack credits on the Non-Posted Initiator Request channel. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). 12 S–0025–10Overview of Gemini Hardware Counters Name Description GM_HARB_PERF_IREQ_P_FLITS Number of ?its on the posted initiator request output of the HARB block. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_IREQ_P_PKTS Number of packets on the posted initiator request output of the HARB Block. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_IREQ_P_STALLED Number of cycles on the posted initiator request output of the HARB is stalled due to a lack credits on the Posted Initiator Request channel. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_RAT_P_BLOCKED Number of times AMO Posted RAT Queue has an entry, but is blocked from using the Posted Initiator Request output channel by another Posted Queue. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_RAT_P_FLITS Number of ?its coming out of the AMO Posted RAT Queue. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). S–0025–10 13Using the Cray Gemini Hardware Counters Name Description GM_HARB_PERF_RAT_P_PKTS Number of packets coming out of the AMO Posted RAT Queue. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_RAT_P_STALLED Number of cycles the AMO Posted RAT Queue is stalled due to a lack credits on the Posted Initiator Request channel. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). Table 4. Network Address Translation Performance Counters Name Description GM_NAT_PERF_BTE_BLOCKED Number of cycles a BTE translation is blocked due to arbitration loss. GM_NAT_PERF_BTE_STALLED Number of cycles a BTE translation is stalled due to MMR access. GM_NAT_PERF_BTE_TRANSLATIONS Number of translations performed for the BTE interface. GM_NAT_PERF_COUNTER_EN When set, counting is enabled. When cleared, counting is disabled. GM_NAT_PERF_REQ_BLOCKED Number of cycles a REQ translation is blocked due to arbitration loss. GM_NAT_PERF_REQ_STALLED Number of cycles a REQ translation is stalled due to MMR access. GM_NAT_PERF_REQ_TRANSLATIONS Number of translations performed for the REQ interface. GM_NAT_PERF_RSP_BLOCKED Number of cycles a RSP translation is blocked due to arbitration loss. GM_NAT_PERF_RSP_STALLED Number of cycles a RSP translation is stalled due to MMR access. GM_NAT_PERF_RSP_TRANSLATIONS Number of translations performed for the RSP interface. GM_NAT_PERF_TRANS_ERROR0 Number of translations that failed due to error 0 (Uncorrectable error in translation). 14 S–0025–10Overview of Gemini Hardware Counters Name Description GM_NAT_PERF_TRANS_ERROR1 Number of translations that failed due to error 1 (VMDH table invalid entry). GM_NAT_PERF_TRANS_ERROR2 Number of translations that failed due to error 2 (MDDT/MRT invalid or illegal entry). GM_NAT_PERF_TRANS_ERROR3 Number of translations that failed due to error 3 (Protection tag violation). GM_NAT_PERF_TRANS_ERROR4 Number of translations that failed due to error 4 (memory bounds error). GM_NAT_PERF_TRANS_ERROR5 Number of translations that failed due to error 5 (write permission error) Table 5. Netlink Performance Counters Name Description GM_NL_PERF_ALL_LCBS_REQS_TO_NIC_0_STALLED Number of ticks all LCBs requests have stalled to NIC 0. GM_NL_PERF_ALL_LCBS_REQS_TO_NIC_1_STALLED Number of ticks all LCBs requests have stalled to NIC 1. GM_NL_PERF_ALL_LCBS_RSP_TO_NIC_0_STALLED Number of ticks all LCBs responses have stalled to NIC 0. GM_NL_PERF_ALL_LCBS_RSP_TO_NIC_1_STALLED Number of ticks all LCBs responses have stalled to NIC 1. GM_NL_PERF_CNTRL Controls the performance counters. Writing a 1 to the Start ?eld starts the counters. Writing a 1 to the Stop ?eld stops the counters. Writing a 1 to the Clear ?eld clears the counters. GM_NL_PERF_LCB_n_REQ_CMP_22 Decompressed request data to two phit LCB_n, where n is a value from 0 to 7 that speci?es the LCB. GM_NL_PERF_LCB_n_REQ_CMP_44 Decompressed request data to one phit LCB_n, where n is a value from 0 to 7 that speci?es the LCB. GM_NL_PERF_LCB_n_REQ_TO_NIC_0 Number of requests from LCB_n to NIC 0. GM_NL_PERF_LCB_n_REQ_TO_NIC_0_STALLED Number of ticks LCB_n requests are blocked to NIC 0. GM_NL_PERF_LCB_n_REQ_TO_NIC_1 Number of requests from LCB_n to NIC 1. S–0025–10 15Using the Cray Gemini Hardware Counters Name Description GM_NL_PERF_LCB_n_REQ_TO_NIC_1_STALLED Number of ticks LCB_n requests are blocked to NIC 1. GM_NL_PERF_LCB_n_REQ_TO_PHITS Number of request phits received on LCB_n. GM_NL_PERF_LCB_n_REQ_TO_PKTS Number of request packets received on LCB_n. GM_NL_PERF_LCB_n_RSP_CMP_22 Decompressed response data to two phit LCB_n GM_NL_PERF_LCB_n_RSP_TO_NIC_1 Number of responses from LCB_n to NIC 1. GM_NL_PERF_LCB_n_RSP_TO_NIC_1_STALLED Number of ticks LCB_n responses are blocked to NIC 1. GM_NL_PERF_NIC_0_REQ_STALLED_TO_ALL_LCBS Number of ticks NIC_0 requests are blocked to all LCBs. GM_NL_PERF_NIC_0_REQ_TO_LCB_n Number of requests from NIC_0 LCB_ n. GM_NL_PERF_NIC_0_REQ_TO_LCB_n_STALLED Number of ticks NIC_0 requests are blocked to LCB_n. GM_NL_PERF_NIC_0_RSP_STALLED_TO_ALL_LCBS Number of ticks NIC_0 responses are blocked to all LCBs. GM_NL_PERF_NIC_0_RSP_TO_LCB_n Number of responses from NIC_0 LCB_ n. GM_NL_PERF_NIC_0_RSP_TO_LCB_n_STALLED Number of ticks NIC_0 responses are blocked to LCB_n. GM_NL_PERF_NIC_1_REQ_STALLED_TO_ALL_LCBS Number of ticks NIC_0 requests are blocked to all LCBs. GM_NL_PERF_NIC_1_REQ_TO_LCB_n Number of requests from NIC_1 to LCB_ n. GM_NL_PERF_NIC_1_REQ_TO_LCBn_STALLED Number of ticks NIC_1 requests are blocked to LCB_n. GM_NL_PERF_NIC_1_RSP_STALLED_TO_ALL_LCBS Number of ticks NIC_1 responses are blocked to all LCBs. GM_NL_PERF_NIC_1_RSP_TO_LCB_n Number of responses from NIC_1 LCB_ n. GM_NL_PERF_NIC_1_RSP_TO_LCB_n_STALLED Number of ticks NIC_1 responses are blocked to LCB_n. 16 S–0025–10Overview of Gemini Hardware Counters Table 6. NPT Performance Counters Name Description GM_NPT_PERF_ACP_BLOCKED_CNTR Number of cycles the ACP FIFO is blocked. GM_NPT_PERF_ACP_FLIT_CNTR Number of ?its that are read from the ACP FIFO. GM_NPT_PERF_ACP_PKT_CNTR Number of packets that are read from the ACP FIFO. GM_NPT_PERF_ACP_STALLED_CNTR Number of cycles the ACP FIFO is stalled. GM_NPT_PERF_BTE_RSP_PKT_CNTR Number of packets that are sent to the Netlink as Get or Flush responses. GM_NPT_PERF_COUNTER_EN Provides the count enable. GM_NPT_PERF_FILL_RSP_PKT_CNTR Number of packets that are sent to the AMO block as ?ll responses. GM_NPT_PERF_HTIRSP_ERR_CNTR Number of packets that are received from the HT cave and have an error status. GM_NPT_PERF_HTIRSP_FLIT_CNTR Number of ?its that are received from the HT cave. GM_NPT_PERF_HTIRSP_PKT_CNTR Number of packets that are received from the HT cave. GM_NPT_PERF_LB_BLOCKED_CNTR Number of cycles the LB FIFO is blocked. GM_NPT_PERF_LB_FLIT_CNTR Number of ?its that are read from the LB FIFO. GM_NPT_PERF_LB_PKT_CNTR Number of packets that are read from the LB FIFO. GM_NPT_PERF_LB_STALLED_CNTR Number of cycles the LB FIFO is stalled. GM_NPT_PERF_NL_RSP_PKT_CNTR Number of packets that are sent to the AMO block as ?ll responses. GM_NPT_PERF_NPT_BLOCKED_CNTR Number of cycles the NPT FIFO is blocked. GM_NPT_PERF_NPT_FLIT_CNTR Number of ?its that are read from the NPT FIFO. GM_NPT_PERF_NPT_PKT_CNTR Number of packets that are read from the NPT FIFO. GM_NPT_PERF_NPT_STALLED_CNTR Number of cycles the NPT FIFO is stalled. GM_NPT_PERF_NRP_BLOCKED_CNTR Number of cycles the NRP FIFO is blocked. GM_NPT_PERF_NRP_FLIT_CNTR Number of ?its that are read from the NRP FIFO. GM_NPT_PERF_NRP_PKT_CNTR Number of packets that are read from the NRP FIFO. GM_NPT_PERF_NRP_STALLED_CNTR Number of cycles the NRP FIFO is stalled. S–0025–10 17Using the Cray Gemini Hardware Counters Table 7. ORB Performance Counters Name Description GM_ORB_PERF_VC0_FLITS Number of ?its to come into the TX Input Queue from the SSID. GM_ORB_PERF_VC0_PKTS Number of packets to come into the TX Input Queue from the SSID. GM_ORB_PERF_VC0_STALLED Number of packets not given access to the TX Control Logic because there is not enough credits available from the NL Block, or there are no available memory locations from the ORD RAM, or a tail ?it has not been received in the ORB Input Queue when performing store-and-forward. GM_ORB_PERF_VC1_BLOCKED Number of packets not given access to the RX Control Logic because the read address and write address into the ORD RAM are attempting to access the same bank of the ORD RAM or because there is a read access to the ORD RAM from the Local Block. GM_ORB_PERF_VC1_BLOCKED_PKT_GEN Number of times the RX Response FIFO is blocked because a packet in the RX Control Logic is being translated into the format used by the rest of the NIC. GM_ORB_PERF_VC1_FLITS Number of ?its to come into the Receive Response FIFO from the network. GM_ORB_PERF_VC1_PKTS Number of packets to come into the Receive Response FIFO from the network. GM_ORB_PERF_VC1_STALLED Number of packets not given access to the RX Control Logic because there is not enough credits available from the RAT. 18 S–0025–10Overview of Gemini Hardware Counters Table 8. RAT Performance Counters Name Description GM_RAT_PERF_COUNTER_EN Enables the performance counters. GM_RAT_PERF_DATA_FLITS_VC0 Number of data ?its received on VC0 (request pipeline). GM_RAT_PERF_DATA_FLITS_VC1 Number of data ?its received on VC1 (request pipeline). GM_RAT_PERF_HEADER_FLITS_VC0 Number of header ?its received on VC0 (request pipeline). GM_RAT_PERF_HEADER_FLITS_VC1 Number of header ?its received on VC1 (request pipeline). GM_RAT_PERF_STALLED_CREDITS_VC0 Number of cycles VC0 (request pipeline) is stalled due to insuf?cient credits. GM_RAT_PERF_STALLED_CREDITS_VC1 Number of cycles VC1 (request pipeline) is stalled due to insuf?cient credits. GM_RAT_PERF_STALLED_TRANSLATION_VC0 Number of cycles VC0 (request pipeline) is stalled due to unavailable translation data. GM_RAT_PERF_STALLED_TRANSLATION_VC1 Number of cycles VC1 (request pipeline) is stalled due to unavailable translation data. GM_RAT_PERF_TRANSLATION_ERRORS_VC0 Number of translation errors seen on VC0 (request pipeline). GM_RAT_PERF_TRANSLATION_ERRORS_VC1 Number of translation errors seen on VC1 (request pipeline). GM_RAT_PERF_TRANSLATIONS_VC0 Number of translations requested on VC0 (request pipeline). GM_RAT_PERF_TRANSLATIONS_VC1 Number of translations requested on VC1 (request pipeline). S–0025–10 19Using the Cray Gemini Hardware Counters Table 9. RMT Performance Counters Name Description GM_RMT_PERF_PUT_BYTES_RX Tally of bytes received in all PUT packets that had the RMT Enable ?eld set that entered and exited the RMT with OK status. GM_RMT_PERF_PUT_CAM_EVIT PUT sequences evicted from the CAM. GM_RMT_PERF_PUT_CAM_FILL New PUT sequence packet arrived and successfully allocated in the CAM. GM_RMT_PERF_PUT_CAM_HITS Packet for PUT sequence currently stored in RMT arrived and successfully located entry in CAM. GM_RMT_PERF_PUT_CAM_MISS New PUT sequence packet arrived, but did not allocate because CAM was full. GM_RMT_PERF_PUT_PARITY Number of sequences evicted from CAM due to uncorrectable parity errors. GM_RMT_PERF_PUT_RECV_COMPLETE Number of MsgRcvComplete packets received which evicted a CAM entry. GM_RMT_PERF_PUT_TIMEOUTS Number of sequences evicted from CAM due to timeout. GM_RMT_PERF_SEND_BYTES_RX Tally of bytes received in all SEND packets that had the RMT Enable ?eld set and entered and exited the RMT with OK status. GM_RMT_PERF_SEND_CAM_EVIT SEND sequences evicted from the CAM. GM_RMT_PERF_SEND_CAM_FILL New SEND sequence packet arrived and successfully allocated in the CAM. GM_RMT_PERF_SEND_CAM_HITS Packet for SEND sequence currently stored in RMT arrived and successfully located entry in CAM. GM_RMT_PERF_SEND_CAM_MISS New SEND sequence packet arrived, but did not allocate because CAM was full. GM_RMT_PERF_SEND_PARITY Number of sequences evicted from CAM due to uncorrectable parity errors. GM_RMT_PERF_SEND_ABORTS Number of SEND sequences that were aborted. GM_RMT_PERF_SEND_TIMEOUTS Number of sequences evicted from CAM due to timeout. 20 S–0025–10Overview of Gemini Hardware Counters Table 10. SSID Performance Counters Name Description GM_SSID_PERF_COMPLETION_COUNT_1 Provides a count of completed request packet sequences. The type of sequence completions counted by this register is controlled by the SSID Performance – Completion Count Selector Register. GM_SSID_PERF_COMPLETION_COUNT_2 Provides a count of completed request packet sequences. The type of sequence completions counted by this register is controlled by the SSID Performance – Completion Count Selector Register. GM_SSID_PERF_COMPLETION_COUNT_SELECTOR Speci?es the types of completion events that are counted in the SSID Performance – Completion Count 1 Register (bits 3-0) and the SSID Performance – Completion Count 2 Register (bits 11-8). See the table of SSID_PerfCompletionCountSelect Encoding values for encoding of these ?elds. GM_SSID_PERF_OUT_STALLED_DURATION The accumulated number of cycles of cclk for which the SSID had a valid ?it available to send to the ORB but sending of the ?it had to be stalled while waiting for a credit from the ORB. This value is cleared by writing any value to this register. GM_SSID_PERF_OUTOFSSIDS_COUNT The number of Allocate SSID requests that have been received for which processing of the request had to be stalled for one or more clock cycles because a free SSID was not immediately available to service the request. This value is cleared by writing any value to this register. GM_SSID_PERF_OUTOFSSIDS_DURATION The accumulated number of cycles of cclk for which processing of Allocate SSID requests has been stalled because a free SSID is not available to service the request. This value is cleared by writing any value to this register. S–0025–10 21Using the Cray Gemini Hardware Counters Name Description GM_SSID_PERF_SSID_ALLOCATE_COUNT The total number of Allocate SSID requests that have been received, across all channels (all FMA descriptors and all BTE VCs), because this register was last cleared, and that resulted in a SSID actually being allocated. Allocate SSID requests that do not result in a SSID being allocated (i.e. redundant Allocate requests) are not counted. This value is cleared by writing any value to this register. GM_SSID_PERF_SSIDS_IN_USE Bits 7-0 specify the number of SSIDs currently in use across all Request Channels. This value is not affected by writes to this register. This ?eld is initialized to its reset value by a full reset and by an ht reset. Bits 23-16 specify the maximum number of SSIDs that have been in use simultaneously, across all channels (all FMA descriptors and all BTE Vcs), since this register was last initialized. This value is initialized to CurrentSSIDsInUse by writing any value to this register. This ?eld is initialized to its reset value by a full reset. Table 11. Transmit Arbiter Performance Counters Name Description GM_TARB_PERF_BTE_BLOCKED Transmit Arbiter (TARB) Performance BTE Blocked Count GM_TARB_PERF_BTE_FLITS TARB Performance BTE Flit Count GM_TARB_PERF_BTE_PKTS TARB Performance BTE Packet Count GM_TARB_PERF_BTE_STALLED TARB Performance BTE Stalled Count GM_TARB_PERF_FMA_BLOCKED TARB Performance FMA Blocked Count GM_TARB_PERF_FMA_FLITS TARB Performance FMA Flit Count GM_TARB_PERF_FMA_PKTS TARB Performance FMA Packet Count GM_TARB_PERF_FMA_STALLED TARB Performance FMA Stalled Count GM_TARB_PERF_LB_BLOCKED TARB Performance LB Blocked Count GM_TARB_PERF_LB_FLITS TARB Performance LB Flit Count GM_TARB_PERF_LB_PKTS TARB Performance LB Packet Count 22 S–0025–10Overview of Gemini Hardware Counters Name Description GM_TARB_PERF_LB_STALLED TARB Performance LB Stalled Count GM_TARB_PERF_OUT_FLITS TARB Performance Output Flit Count GM_TARB_PERF_OUT_PKTS TARB Performance Output Packet Count GM_TARB_PERF_OUT_STALLED TARB Performance Output Stalled Count 1.3 Gemini Tile MMRs The Gemini network consists of 48 tiles, arranged in 6 rows of 8 columns. Within each tile there are memory-mapped registers associated with the LCB and with the rest of the tile. The local block has shared connections to each row of tiles. By default, when only the name of the MMR is used, an event is counted on all 48 tiles. To address an individual tile, append the row (0-5) and column (0-7) to the name, as shown in the table. Table 12. Description of Gemini Tile MMRs Name Description GM_TILE_PERF_VC0_PHIT_CNT:n:m Number of vc0 phits read from inq buffer GM_TILE_PERF_VC1_PHIT_CNT:n:m Number of vc1 phits read from inq buffer GM_TILE_PERF_VC0_PKT_CNT:n:m Number of vc0 packets read from inq buffer GM_TILE_PERF_VC10_PKT_CNT:n:m Number of vc1 packets read from inq buffer GM_TILE_PERF_INQ_STALL:n:m Number of clock periods a valid reference is blocked from the routing pipeline. GM_TILE_PERF_CREDIT_STALL:n:m Number of clock periods a valid reference is stalled in the column buffers, waiting on transmissions credits. S–0025–10 23Using the Cray Gemini Hardware Counters © 2010 Cray Inc. All Rights Reserved. This document or parts thereof may not be reproduced in any form unless permitted by contract or by written permission of Cray Inc. Cray, LibSci, PathScale, and UNICOS are federally registered trademarks and Active Manager, Baker, Cascade, Cray Apprentice2, Cray Apprentice2 Desktop, Cray C++ Compiling System, Cray CX, Cray CX1, Cray CX1-iWS, Cray CX1-LC, Cray CX1000, Cray CX1000-C, Cray CX1000-G, Cray CX1000-S, Cray CX1000-SC, Cray CX1000-SM, Cray CX1000-HN, Cray Fortran Compiler, Cray Linux Environment, Cray SHMEM, Cray X1, Cray X1E, Cray X2, Cray XD1, Cray XE, Cray XE6, Cray XMT, Cray XR1, Cray XT, Cray XTm, Cray XT3, Cray XT4, Cray XT5, Cray XT5 h , Cray XT5m, Cray XT6, Cray XT6m, CrayDoc, CrayPort, CRInform, ECOphlex, Gemini, Libsci, NodeKARE, RapidArray, SeaStar, SeaStar2, SeaStar2+, Threadstorm, UNICOS/lc, UNICOS/mk, and UNICOS/mp are trademarks of Cray Inc. Version 1.0 Published July 2010 Supports CrayPat release 5.1 and CLE release 3.1 running on Cray XT systems. 24 S–0025–10 Overview of Gemini Hardware Counters This document describes the Gemini Performance Counters and how to use them to optimize individual applications and system traf?c. Send e-mail to docs@cray.com with any comments that will help us to improve the accuracy and usability of this document. Be sure to include the title and number of the document with your comments. We value your comments and will respond to them promptly. Accessing network performance counters is desirable for application developers, system library developers (e.g. MPI), and system administrators. Application developers want to improve their application run-times or measure what affect other traf?c on the system has on their application. System library developers want to optimize their collective operations. System Administrators want to observe the system, looking for hotspots. Effective with the CrayPat (Cray performance analysis tool) version 5.1 and Cray Linux Environment (CLE) version 3.1 software releases for the Cray XE platform, users can monitor many of the performance counters that reside on the Gemini networking chip. There are two categories of Gemini performance counters available to users. NIC performance counters record information about the data moving through theNetwork Interface Controller (NIC). On the Gemini ASIC there are two NICs, each attached to a compute node. Thus, the data from the NIC performance counters re?ects network transfers beginning and ending on the node. These performance counters are read-only. Network router tile counters are available on a per-Gemini basis. There are both read-only and read/write tile counters. Each chip has 48 router tiles, arranged in a 6x8 grid. Eight processor tiles connect to each of the two Gemini NICs. Each NIC connects to a different node, running separate Linux instances. If collection at other points of the application is desired, use the CrayPat API to insert regions as described in the pat_build man page. It is recommended that you do not collect any other performance data when collecting network counters. Data collection of network counters is much more expensive than other performance data collection, and will skew other results. At the time the instrumented executable program is launched with the aprun command, a set of environment variables, PAT_RT_NWPC_*, provide access to the Gemini network performance counters. These environment variables are described in the intro_craypat man page. S–0025–10 1Using the Cray Gemini Hardware Counters 1.1 Using CrayPat to Monitor Gemini Counters The CrayPat utility pat_build instruments an executable ?le. One aspect of the instrumentation includes intercepting entries into and returns out of a function. This is known formally as tracing. Information such as time stamps and performance counter values are recorded at this time. CrayPat supports instrumentation of an application binary for collection of Gemini counters. Counter values are recorded at application runtime, and are presented to the user through a table generated by pat_report. The CrayPat user interface to request instrumentation is similar to that for processor performance counters. There is no Gemini counter display available in Cray Apprentice2 at this time. A new display will be available in a subsequent release of the Cray Apprentice2 software. Although the user interface to request network counters is similar to processor counters, there are some signi?cant differences that must be understood. Depending on the type of counters requested, some are shared across all processors within a node, some are shared between two nodes and some are shared across all applications passing through a chip. Some counters monitor all traf?c for your application, even on nodes that are not reserved for your application, and some monitor locally, that is they monitor only traf?c associated with nodes assigned to a Gemini chip and no other traf?c from the network. Users should also be aware that access to the network counters is more resource-intensive than access to the processor performance counters. Because Gemini counters are a shared resource, the system software is designed to provide dedicated access whenever possible. This is done through the Application Level Placement Scheduler (ALPS) by ensuring that an application collecting counters is not placed on the same Gemini chip as another application collecting performance counters. It does not prevent a second application from being placed on the same Gemini chip that is not collecting counters however. This compromise assures better system utilization because compute nodes are not left unavailable for use by another application. The CrayPat 5.1 release focuses on the use of the NIC and ORB counters available within the Gemini chip. The values collected from these counters are local to a node and therefore speci?c to an application. Traf?c between MPI ranks cannot be distinguished through the counters. The event names that CrayPat supports are listed at the end of this document. Network counters are only collected for the MAIN thread. Values are collected at the beginning and end of the instrumented application. Instrumentation overhead is minimal. This gives a high-level view of the program's use of the networking router in terms of the counters speci?ed. Currently the time to access counter data is too expensive to collect more frequently. A future release of CLE will address these performance limitations. 2 S–0025–10Overview of Gemini Hardware Counters Before attempting the following examples verify that your system has a Gemini network: $ module list xtpe-network-gemini Attempting to collect Gemini performance counters on a system that does not have the Gemini network will result in a fatal error: $ aprun -n 16 my_program+pat CrayPat/X: Version 5.1 Revision 3329 05/20/10 11:26:16 pat[FATAL][0]: initialization of NW performance counter API failed [No such file or directory] Example 1. Collect stalls associated with node traf?c to and from the network This example enables tracing of MAIN. $ pat_build -w my_program $ export PAT_RT_NWPC=GM_ORB_PERF_VC0_STALLED,GM_ORB_PERF_VC1_STALLED $ aprun my_program+pat Example 2. Display network counter data $ pat_report my_program+pat+11171-41tdot.xf> counter_rpt Example output from pat_report: NWPC Data by Function Group and Function Group / Function / Node Id=0='HIDE' ===================================================================== Total --------------------------------------------------------------------- Time% 100.0% Time 2.476423 secs GM_ORB_PERF_VC1_STALLED 0 GM_ORB_PERF_VC1_BLOCKED 0 GM_ORB_PERF_VC1_BLOCKED_PKT_GEN 0 GM_ORB_PERF_VC1_PKTS 48 GM_ORB_PERF_VC1_FLITS 48 GM_ORB_PERF_VC0_STALLED 111 GM_ORB_PERF_VC0_PKTS 48 GM_ORB_PERF_VC0_FLITS 201 ===================================================================== S–0025–10 3Using the Cray Gemini Hardware Counters Example 3. Collect data for a custom group of network counters In this example a user creates a group of network events in a ?le called my_nwpc_groups, one called 1 and the other called CQ_AMO: $ cat my_nwpc_groups # Group 1: Outstanding Request Buffer 1 = GM_ORB_PERF_VC1_STALLED, GM_ORB_PERF_VC1_BLOCKED, GM_ORB_PERF_VC1_BLOCKED_PKT_GEN, GM_ORB_PERF_VC1_PKTS, GM_ORB_PERF_VC1_FLITS, GM_ORB_PERF_VC0_STALLED, GM_ORB_PERF_VC0_PKTS, GM_ORB_PERF_VC0_FLITS # Group CQ_AMO: CQ_AMO = GM_AMO_PERF_COUNTER_EN, GM_AMO_PERF_CQ_FLIT_CNTR, GM_AMO_PERF_CQ_PKT_CNTR, GM_AMO_PERF_CQ_STALLED_CNTR, GM_AMO_PERF_CQ_BLOCKED_CNTR $ pat_build -w my_program $ export PAT_RT_NWPC_FILE=my_nwpc_groups $ export PAT_RT_NWPC=1,CQ_AMO $ aprun -n16 my_program+pat 4 S–0025–10Overview of Gemini Hardware Counters Example output from pat_report: NWPC Data by Function Group and Function Group / Function / Node Id=0='HIDE' ===================================================================== Total --------------------------------------------------------------------- Time% 100.0% Time 2.639046 secs GM_ORB_PERF_VC1_STALLED 72525 GM_ORB_PERF_VC1_PKTS 50457 GM_AMO_PERF_COUNTER_EN 0 GM_AMO_PERF_CQ_FLIT_CNTR 11752 GM_AMO_PERF_CQ_PKT_CNTR 5876 GM_AMO_PERF_CQ_STALLED_CNTR 5092 GM_AMO_PERF_CQ_BLOCKED_CNTR 29 ===================================================================== Example 4. Suppress instrumented entry points from recording performance data to reduce overhead This example assumes a NWPC group FMAS exists and is available for use. Because the program is traced, the PAT_RT_TRACE_FUNCTION_NAME is set to suppress any data collection by already instrumented entry points in my_program+pat. This means that NWPC values will only be recorded for the MAIN thread at the start and the end of the instrumented program. Instrumentation overhead is minimal. $ pat_build -u -g mpi my_program $ export PAT_RT_NWPC=FMAS $ export PAT_RT_TRACE_FUNCITON_NAME=*:0 $ aprun -n32 my_program+pat This gives a high-level view of the program's use of the networking router in terms of what the FMAS group describes. If more details about NWPC use during execution of the program are desired, the PAT_RT_TRACE_FUNCTION_NAME environment variable need not be set, but the signi?cant overhead injected by reading the NWPCs may make the resulting performance data inaccurate. To selectively collect NWPCs and the other performance data for traced functions, add them to the end of PAT_RT_TRACE_FUNCTION_NAME: $ export PAT_RT_TRACE_FUNCTION_NAME=0:*,mxm,MPI_Bcast S–0025–10 5Using the Cray Gemini Hardware Counters 1.2 Gemini NIC Counters To better understand how to use the NIC counters, you need to understand some of the terminology speci?c to the Gemini network architecture. The Block Transfer Engine (BTE) A Gemini network packet typically consists of one or more ?its, which are the units of ?ow control for the network. Because ?its are usually larger than the physical datapath, they are divided into phits, which are the units of data that the network can handle physically. A packet must contain at least two phits, one for the header and one for the cyclical redundancy check (CRC). The V0 counters support the request channel and the V1 counters support the response channel. A ?it/pkt ratio can tell the user if the data entering the network was not aligned, eg a ratio greater than 1 indicates misaligned data is being sent across the network. Because there is a bandwidth/pipe size difference between outgoing and incoming (outgoing is smaller), in general you will notice more stalls on the V0 (request) channel. The following counters are recommended as a way to begin using the Gemini NWPC: GM_ORB_PERF_VC0_STALLED GM_ORB_PERF_VC1_STALLED GM_ORB_PERF_VC0_PKTS GM_ORB_PERF_VC1_PKTS GM_ORB_PERF_VC0_FLITS GM_ORB_PERF_VC1_FLITS Table 1. Atomic Memory Operations Performance Counters Name Description GM_AMO_PERF_ACP_COMP_CNTR Number of Atomic Memory Operation (AMO) computations that have occurred. GM_AMO_PERF_ACP_MEM_UPDATE_CNTR Number of AMO logic cache write-throughs that have occurred. GM_AMO_PERF_ACP_STALL_CNTR Number of AMO logic pipeline stalls that have occurred. GM_AMO_PERF_AMO_HEADER_CNTR Number of request headers processed by the Decode Logic that have had an AMO computation. Error packets are not counted. GM_AMO_PERF_COUNTER_EN When set, counting is enabled. When cleared, counting is disabled. GM_AMO_PERF_CQ_BLOCKED_CNTR Number of cycles the CQ FIFO is blocked. 6 S–0025–10Overview of Gemini Hardware Counters Name Description GM_AMO_PERF_CQ_FLIT_CNTR Number of ?its (network ?ow control units) that are read from the CQ FIFO. GM_AMO_PERF_CQ_PKT_CNTR Number of packets that are read from the CQ FIFO. GM_AMO_PERF_CQ_STALLED_CNTR Number of cycles the CQ FIFO is stalled. GM_AMO_PERF_DONE_INV_CNTR Number of times a valid cache entry was invalidated because there were no more outstanding AMO requests targeting it and the last request did not have the cacheable bit set. GM_AMO_PERF_ERROR_HEADER_CNTR Number of request headers processed by the Decode Logic that have had errors. GM_AMO_PERF_FLUSH_HEADER_CNTR Number of request headers processed by the Decode Logic that have had a Flush command. Error packets are not counted. GM_AMO_PERF_FULL_INV_CNTR Number of times a valid but inactive cache entry was invalidated to make room for a new AMO address. A high value in this counter indicates that there are too many cacheable AMO addresses and that the cache is being thrashed. GM_AMO_PERF_GET_HEADER_CNTR Number of request headers processed by the Decode Logic that have had an GET command. Error packets are not counted. GM_AMO_PERF_MSGCOMP_HEADER_CNTR Number of request headers processed by the Decode Logic that have had a MsgComplete command. Error packets are not counted. GM_AMO_PERF_PUT_HEADER_CNTR Number of request headers processed by the Decode Logic that have had an PUT command. Error packets are not counted. GM_AMO_PERF_REQLIST_FULL_STALL_CNTR Number of times an AMO request causes the NRP to stall waiting for a Request List entry to become free. GM_AMO_PERF_RMT_BLOCKED_CNTR Number cycles the RMT FIFO is blocked GM_AMO_PERF_RMT_FLIT_CNTR Number of ?its that are read from the RMT FIFO GM_AMO_PERF_RMT_PKT_CNTR Number of packets that are read from the RMT FIFO GM_AMO_PERF_RMT_STALLED_CNTR Number cycles the RMT FIFO is stalled S–0025–10 7Using the Cray Gemini Hardware Counters Name Description GM_AMO_PERF_TAG_HIT_CNTR Number of AMO requests that have been processed in the Tag Store and have resulted in a cache hit. GM_AMO_PERF_TAG_MISS_CNTR Number of AMO requests that have been processed in the Tag Store and have resulted in a cache miss. GM_AMO_PERF_TAG_STALL_CNTR Number of times a GET/PUT request hits in the cache and causes the NRP to stall. Table 2. Fast Memory Access Performance Counters Name Description GM_FMA_PERF_CQ_PKT_CNT Number of packets from Fast Memory Access (FMA) to CQ. GM_FMA_PERF_CQ_STALLED_CNT Number of clock cycles FMA_CQ was stalled due to lack of credits. GM_FMA_PERF_HT_NP_REQ_FLIT_CNT Number of HT NP request ?its to FMA. GM_FMA_PERF_HT_NP_REQ_PKT_CNT Number of HT NP request packets to FMA. GM_FMA_PERF_HT_P_REQ_FLIT_CNT Number of HT P request ?its to FMA. GM_FMA_PERF_HT_P_REQ_PKT_CNT Number of HT P request packets to FMA. GM_FMA_PERF_HT_RSP_PKT_CNT Number of HT response packets from FMA to HT. GM_FMA_PERF_HT_RSP_STALLED_CNT Number of clock cycles FMA_HT_RSP was stalled due to lack of credits. GM_FMA_PERF_TARB_FLIT_CNT Number of ?its from FMA to TARB. GM_FMA_PERF_TARB_PKT_CNT Number of packets from FMA to TARB. GM_FMA_PERF_TARB_STALLED_CNT Number of clock cycles FMA_TARB was stalled due to lack of credits. 8 S–0025–10Overview of Gemini Hardware Counters Table 3. Hyper-transport Arbiter Performance Counters Name Description GM_HARB_PERF_AMO_NP_BLOCKED Number of times AMO Non-Posted Queue has an entry, but is blocked from using the Non-Posted Initiator Request output channel by the BTE Non-Posted Queue. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_AMO_NP_FLITS Number of ?its coming out of the AMO Non-Posted Queue. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_AMO_NP_PKTS Number of packets coming out of the AMO Non-Posted Queue. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_AMO_NP_STALLED Number of cycles the AMO Non-Posted Queue is stalled due to a lack credits on the Non-Posted Initiator Request channel. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_AMO_P_ACP_BLOCKED Number of times AMO Posted AMO Computation Pipe Queue has an entry, but is blocked from using the Posted Initiator Request output channel by another Posted Queue. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_AMO_P_ACP_FLITS Number of ?its coming out of the AMO Posted AMO Computation Pipe Queue. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). S–0025–10 9Using the Cray Gemini Hardware Counters Name Description GM_HARB_PERF_AMO_P_ACP_PKTS Number of packets coming out of the AMO Posted AMO Computation Pipe Queue. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_AMO_P_ACP_STALLED Number of cycles the AMO Posted AMO Computation Pipe Queue is stalled due to a lack credits on the Posted Initiator Request channel. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_AMO_P_NRP_BLOCKED Number of times AMO Posted New Request Pipe Queue has an entry, but is blocked from using the Posted Initiator Request output channel by another Posted Queue. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_AMO_P_NRP_FLITS Number of ?its coming out of the AMO Posted New Request Pipe Queue. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_AMO_P_NRP_PKTS Number of packets coming out of the AMO Posted New Request Pipe Queue. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_AMO_P_NRP_STALLED Number of cycles the AMO Posted New Request Pipe Queue is stalled due to a lack credits on the Posted Initiator Request channel. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). 10 S–0025–10Overview of Gemini Hardware Counters Name Description GM_HARB_PERF_BTE_NP_BLOCKED Number of times AMO Non-Posted BTE Queue has an entry, but is blocked from using the Non-Posted Initiator Request output channel by another Non-Posted Queue. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_BTE_NP_FLITS Number of ?its coming out of the AMO Non-Posted BTE Queue. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_BTE_NP_PKTS Number of packets coming out of the AMO Non-Posted BTE Queue. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_BTE_NP_STALLED Number of cycles the AMO Non-Posted BTE Queue is stalled due to a lack credits on the Posted Initiator Request channel. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_BTE_P_BLOCKED Number of times AMO Posted BTE Queue has an entry, but is blocked from using the Posted Initiator Request output channel by another Posted Queue. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_BTE_P_FLITS Number of ?its coming out of the AMO Posted BTE Queue. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). S–0025–10 11Using the Cray Gemini Hardware Counters Name Description GM_HARB_PERF_BTE_P_PKTS Number of packets coming out of the AMO Posted BTE Queue. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_BTE_P_STALLED Number of cycles the AMO Posted BTE Queue is stalled due to a lack credits on the Posted Initiator Request channel. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_COUNTER_EN When set, counting is enabled. When clear, counting is disabled. This MMR is reset by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_IREQ_NP_FLITS Number of ?its on the non-posted initiator request output of the HARB block. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_IREQ_NP_PKTS Number of packets on the non-posted initiator request output of the HARB Block. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_IREQ_NP_STALLED Number of cycles on the non-posted initiator request output of the HARB is stalled due to a lack credits on the Non-Posted Initiator Request channel. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). 12 S–0025–10Overview of Gemini Hardware Counters Name Description GM_HARB_PERF_IREQ_P_FLITS Number of ?its on the posted initiator request output of the HARB block. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_IREQ_P_PKTS Number of packets on the posted initiator request output of the HARB Block. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_IREQ_P_STALLED Number of cycles on the posted initiator request output of the HARB is stalled due to a lack credits on the Posted Initiator Request channel. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_RAT_P_BLOCKED Number of times AMO Posted RAT Queue has an entry, but is blocked from using the Posted Initiator Request output channel by another Posted Queue. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_RAT_P_FLITS Number of ?its coming out of the AMO Posted RAT Queue. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). S–0025–10 13Using the Cray Gemini Hardware Counters Name Description GM_HARB_PERF_RAT_P_PKTS Number of packets coming out of the AMO Posted RAT Queue. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). GM_HARB_PERF_RAT_P_STALLED Number of cycles the AMO Posted RAT Queue is stalled due to a lack credits on the Posted Initiator Request channel. The Local Block has read/write access to the full counter. Bits 63:48 of this MMR are unimplemented and always return zero. This MMR is reset to all zeros by the chip reset (i_reset), but not by HT reset (i_ht_reset). Table 4. Network Address Translation Performance Counters Name Description GM_NAT_PERF_BTE_BLOCKED Number of cycles a BTE translation is blocked due to arbitration loss. GM_NAT_PERF_BTE_STALLED Number of cycles a BTE translation is stalled due to MMR access. GM_NAT_PERF_BTE_TRANSLATIONS Number of translations performed for the BTE interface. GM_NAT_PERF_COUNTER_EN When set, counting is enabled. When cleared, counting is disabled. GM_NAT_PERF_REQ_BLOCKED Number of cycles a REQ translation is blocked due to arbitration loss. GM_NAT_PERF_REQ_STALLED Number of cycles a REQ translation is stalled due to MMR access. GM_NAT_PERF_REQ_TRANSLATIONS Number of translations performed for the REQ interface. GM_NAT_PERF_RSP_BLOCKED Number of cycles a RSP translation is blocked due to arbitration loss. GM_NAT_PERF_RSP_STALLED Number of cycles a RSP translation is stalled due to MMR access. GM_NAT_PERF_RSP_TRANSLATIONS Number of translations performed for the RSP interface. GM_NAT_PERF_TRANS_ERROR0 Number of translations that failed due to error 0 (Uncorrectable error in translation). 14 S–0025–10Overview of Gemini Hardware Counters Name Description GM_NAT_PERF_TRANS_ERROR1 Number of translations that failed due to error 1 (VMDH table invalid entry). GM_NAT_PERF_TRANS_ERROR2 Number of translations that failed due to error 2 (MDDT/MRT invalid or illegal entry). GM_NAT_PERF_TRANS_ERROR3 Number of translations that failed due to error 3 (Protection tag violation). GM_NAT_PERF_TRANS_ERROR4 Number of translations that failed due to error 4 (memory bounds error). GM_NAT_PERF_TRANS_ERROR5 Number of translations that failed due to error 5 (write permission error) Table 5. Netlink Performance Counters Name Description GM_NL_PERF_ALL_LCBS_REQS_TO_NIC_0_STALLED Number of ticks all LCBs requests have stalled to NIC 0. GM_NL_PERF_ALL_LCBS_REQS_TO_NIC_1_STALLED Number of ticks all LCBs requests have stalled to NIC 1. GM_NL_PERF_ALL_LCBS_RSP_TO_NIC_0_STALLED Number of ticks all LCBs responses have stalled to NIC 0. GM_NL_PERF_ALL_LCBS_RSP_TO_NIC_1_STALLED Number of ticks all LCBs responses have stalled to NIC 1. GM_NL_PERF_CNTRL Controls the performance counters. Writing a 1 to the Start ?eld starts the counters. Writing a 1 to the Stop ?eld stops the counters. Writing a 1 to the Clear ?eld clears the counters. GM_NL_PERF_LCB_n_REQ_CMP_22 Decompressed request data to two phit LCB_n, where n is a value from 0 to 7 that speci?es the LCB. GM_NL_PERF_LCB_n_REQ_CMP_44 Decompressed request data to one phit LCB_n, where n is a value from 0 to 7 that speci?es the LCB. GM_NL_PERF_LCB_n_REQ_TO_NIC_0 Number of requests from LCB_n to NIC 0. GM_NL_PERF_LCB_n_REQ_TO_NIC_0_STALLED Number of ticks LCB_n requests are blocked to NIC 0. GM_NL_PERF_LCB_n_REQ_TO_NIC_1 Number of requests from LCB_n to NIC 1. S–0025–10 15Using the Cray Gemini Hardware Counters Name Description GM_NL_PERF_LCB_n_REQ_TO_NIC_1_STALLED Number of ticks LCB_n requests are blocked to NIC 1. GM_NL_PERF_LCB_n_REQ_TO_PHITS Number of request phits received on LCB_n. GM_NL_PERF_LCB_n_REQ_TO_PKTS Number of request packets received on LCB_n. GM_NL_PERF_LCB_n_RSP_CMP_22 Decompressed response data to two phit LCB_n GM_NL_PERF_LCB_n_RSP_TO_NIC_1 Number of responses from LCB_n to NIC 1. GM_NL_PERF_LCB_n_RSP_TO_NIC_1_STALLED Number of ticks LCB_n responses are blocked to NIC 1. GM_NL_PERF_NIC_0_REQ_STALLED_TO_ALL_LCBS Number of ticks NIC_0 requests are blocked to all LCBs. GM_NL_PERF_NIC_0_REQ_TO_LCB_n Number of requests from NIC_0 LCB_ n. GM_NL_PERF_NIC_0_REQ_TO_LCB_n_STALLED Number of ticks NIC_0 requests are blocked to LCB_n. GM_NL_PERF_NIC_0_RSP_STALLED_TO_ALL_LCBS Number of ticks NIC_0 responses are blocked to all LCBs. GM_NL_PERF_NIC_0_RSP_TO_LCB_n Number of responses from NIC_0 LCB_ n. GM_NL_PERF_NIC_0_RSP_TO_LCB_n_STALLED Number of ticks NIC_0 responses are blocked to LCB_n. GM_NL_PERF_NIC_1_REQ_STALLED_TO_ALL_LCBS Number of ticks NIC_0 requests are blocked to all LCBs. GM_NL_PERF_NIC_1_REQ_TO_LCB_n Number of requests from NIC_1 to LCB_ n. GM_NL_PERF_NIC_1_REQ_TO_LCBn_STALLED Number of ticks NIC_1 requests are blocked to LCB_n. GM_NL_PERF_NIC_1_RSP_STALLED_TO_ALL_LCBS Number of ticks NIC_1 responses are blocked to all LCBs. GM_NL_PERF_NIC_1_RSP_TO_LCB_n Number of responses from NIC_1 LCB_ n. GM_NL_PERF_NIC_1_RSP_TO_LCB_n_STALLED Number of ticks NIC_1 responses are blocked to LCB_n. 16 S–0025–10Overview of Gemini Hardware Counters Table 6. NPT Performance Counters Name Description GM_NPT_PERF_ACP_BLOCKED_CNTR Number of cycles the ACP FIFO is blocked. GM_NPT_PERF_ACP_FLIT_CNTR Number of ?its that are read from the ACP FIFO. GM_NPT_PERF_ACP_PKT_CNTR Number of packets that are read from the ACP FIFO. GM_NPT_PERF_ACP_STALLED_CNTR Number of cycles the ACP FIFO is stalled. GM_NPT_PERF_BTE_RSP_PKT_CNTR Number of packets that are sent to the Netlink as Get or Flush responses. GM_NPT_PERF_COUNTER_EN Provides the count enable. GM_NPT_PERF_FILL_RSP_PKT_CNTR Number of packets that are sent to the AMO block as ?ll responses. GM_NPT_PERF_HTIRSP_ERR_CNTR Number of packets that are received from the HT cave and have an error status. GM_NPT_PERF_HTIRSP_FLIT_CNTR Number of ?its that are received from the HT cave. GM_NPT_PERF_HTIRSP_PKT_CNTR Number of packets that are received from the HT cave. GM_NPT_PERF_LB_BLOCKED_CNTR Number of cycles the LB FIFO is blocked. GM_NPT_PERF_LB_FLIT_CNTR Number of ?its that are read from the LB FIFO. GM_NPT_PERF_LB_PKT_CNTR Number of packets that are read from the LB FIFO. GM_NPT_PERF_LB_STALLED_CNTR Number of cycles the LB FIFO is stalled. GM_NPT_PERF_NL_RSP_PKT_CNTR Number of packets that are sent to the AMO block as ?ll responses. GM_NPT_PERF_NPT_BLOCKED_CNTR Number of cycles the NPT FIFO is blocked. GM_NPT_PERF_NPT_FLIT_CNTR Number of ?its that are read from the NPT FIFO. GM_NPT_PERF_NPT_PKT_CNTR Number of packets that are read from the NPT FIFO. GM_NPT_PERF_NPT_STALLED_CNTR Number of cycles the NPT FIFO is stalled. GM_NPT_PERF_NRP_BLOCKED_CNTR Number of cycles the NRP FIFO is blocked. GM_NPT_PERF_NRP_FLIT_CNTR Number of ?its that are read from the NRP FIFO. GM_NPT_PERF_NRP_PKT_CNTR Number of packets that are read from the NRP FIFO. GM_NPT_PERF_NRP_STALLED_CNTR Number of cycles the NRP FIFO is stalled. S–0025–10 17Using the Cray Gemini Hardware Counters Table 7. ORB Performance Counters Name Description GM_ORB_PERF_VC0_FLITS Number of ?its to come into the TX Input Queue from the SSID. GM_ORB_PERF_VC0_PKTS Number of packets to come into the TX Input Queue from the SSID. GM_ORB_PERF_VC0_STALLED Number of packets not given access to the TX Control Logic because there is not enough credits available from the NL Block, or there are no available memory locations from the ORD RAM, or a tail ?it has not been received in the ORB Input Queue when performing store-and-forward. GM_ORB_PERF_VC1_BLOCKED Number of packets not given access to the RX Control Logic because the read address and write address into the ORD RAM are attempting to access the same bank of the ORD RAM or because there is a read access to the ORD RAM from the Local Block. GM_ORB_PERF_VC1_BLOCKED_PKT_GEN Number of times the RX Response FIFO is blocked because a packet in the RX Control Logic is being translated into the format used by the rest of the NIC. GM_ORB_PERF_VC1_FLITS Number of ?its to come into the Receive Response FIFO from the network. GM_ORB_PERF_VC1_PKTS Number of packets to come into the Receive Response FIFO from the network. GM_ORB_PERF_VC1_STALLED Number of packets not given access to the RX Control Logic because there is not enough credits available from the RAT. 18 S–0025–10Overview of Gemini Hardware Counters Table 8. RAT Performance Counters Name Description GM_RAT_PERF_COUNTER_EN Enables the performance counters. GM_RAT_PERF_DATA_FLITS_VC0 Number of data ?its received on VC0 (request pipeline). GM_RAT_PERF_DATA_FLITS_VC1 Number of data ?its received on VC1 (request pipeline). GM_RAT_PERF_HEADER_FLITS_VC0 Number of header ?its received on VC0 (request pipeline). GM_RAT_PERF_HEADER_FLITS_VC1 Number of header ?its received on VC1 (request pipeline). GM_RAT_PERF_STALLED_CREDITS_VC0 Number of cycles VC0 (request pipeline) is stalled due to insuf?cient credits. GM_RAT_PERF_STALLED_CREDITS_VC1 Number of cycles VC1 (request pipeline) is stalled due to insuf?cient credits. GM_RAT_PERF_STALLED_TRANSLATION_VC0 Number of cycles VC0 (request pipeline) is stalled due to unavailable translation data. GM_RAT_PERF_STALLED_TRANSLATION_VC1 Number of cycles VC1 (request pipeline) is stalled due to unavailable translation data. GM_RAT_PERF_TRANSLATION_ERRORS_VC0 Number of translation errors seen on VC0 (request pipeline). GM_RAT_PERF_TRANSLATION_ERRORS_VC1 Number of translation errors seen on VC1 (request pipeline). GM_RAT_PERF_TRANSLATIONS_VC0 Number of translations requested on VC0 (request pipeline). GM_RAT_PERF_TRANSLATIONS_VC1 Number of translations requested on VC1 (request pipeline). S–0025–10 19Using the Cray Gemini Hardware Counters Table 9. RMT Performance Counters Name Description GM_RMT_PERF_PUT_BYTES_RX Tally of bytes received in all PUT packets that had the RMT Enable ?eld set that entered and exited the RMT with OK status. GM_RMT_PERF_PUT_CAM_EVIT PUT sequences evicted from the CAM. GM_RMT_PERF_PUT_CAM_FILL New PUT sequence packet arrived and successfully allocated in the CAM. GM_RMT_PERF_PUT_CAM_HITS Packet for PUT sequence currently stored in RMT arrived and successfully located entry in CAM. GM_RMT_PERF_PUT_CAM_MISS New PUT sequence packet arrived, but did not allocate because CAM was full. GM_RMT_PERF_PUT_PARITY Number of sequences evicted from CAM due to uncorrectable parity errors. GM_RMT_PERF_PUT_RECV_COMPLETE Number of MsgRcvComplete packets received which evicted a CAM entry. GM_RMT_PERF_PUT_TIMEOUTS Number of sequences evicted from CAM due to timeout. GM_RMT_PERF_SEND_BYTES_RX Tally of bytes received in all SEND packets that had the RMT Enable ?eld set and entered and exited the RMT with OK status. GM_RMT_PERF_SEND_CAM_EVIT SEND sequences evicted from the CAM. GM_RMT_PERF_SEND_CAM_FILL New SEND sequence packet arrived and successfully allocated in the CAM. GM_RMT_PERF_SEND_CAM_HITS Packet for SEND sequence currently stored in RMT arrived and successfully located entry in CAM. GM_RMT_PERF_SEND_CAM_MISS New SEND sequence packet arrived, but did not allocate because CAM was full. GM_RMT_PERF_SEND_PARITY Number of sequences evicted from CAM due to uncorrectable parity errors. GM_RMT_PERF_SEND_ABORTS Number of SEND sequences that were aborted. GM_RMT_PERF_SEND_TIMEOUTS Number of sequences evicted from CAM due to timeout. 20 S–0025–10Overview of Gemini Hardware Counters Table 10. SSID Performance Counters Name Description GM_SSID_PERF_COMPLETION_COUNT_1 Provides a count of completed request packet sequences. The type of sequence completions counted by this register is controlled by the SSID Performance – Completion Count Selector Register. GM_SSID_PERF_COMPLETION_COUNT_2 Provides a count of completed request packet sequences. The type of sequence completions counted by this register is controlled by the SSID Performance – Completion Count Selector Register. GM_SSID_PERF_COMPLETION_COUNT_SELECTOR Speci?es the types of completion events that are counted in the SSID Performance – Completion Count 1 Register (bits 3-0) and the SSID Performance – Completion Count 2 Register (bits 11-8). See the table of SSID_PerfCompletionCountSelect Encoding values for encoding of these ?elds. GM_SSID_PERF_OUT_STALLED_DURATION The accumulated number of cycles of cclk for which the SSID had a valid ?it available to send to the ORB but sending of the ?it had to be stalled while waiting for a credit from the ORB. This value is cleared by writing any value to this register. GM_SSID_PERF_OUTOFSSIDS_COUNT The number of Allocate SSID requests that have been received for which processing of the request had to be stalled for one or more clock cycles because a free SSID was not immediately available to service the request. This value is cleared by writing any value to this register. GM_SSID_PERF_OUTOFSSIDS_DURATION The accumulated number of cycles of cclk for which processing of Allocate SSID requests has been stalled because a free SSID is not available to service the request. This value is cleared by writing any value to this register. S–0025–10 21Using the Cray Gemini Hardware Counters Name Description GM_SSID_PERF_SSID_ALLOCATE_COUNT The total number of Allocate SSID requests that have been received, across all channels (all FMA descriptors and all BTE VCs), because this register was last cleared, and that resulted in a SSID actually being allocated. Allocate SSID requests that do not result in a SSID being allocated (i.e. redundant Allocate requests) are not counted. This value is cleared by writing any value to this register. GM_SSID_PERF_SSIDS_IN_USE Bits 7-0 specify the number of SSIDs currently in use across all Request Channels. This value is not affected by writes to this register. This ?eld is initialized to its reset value by a full reset and by an ht reset. Bits 23-16 specify the maximum number of SSIDs that have been in use simultaneously, across all channels (all FMA descriptors and all BTE Vcs), since this register was last initialized. This value is initialized to CurrentSSIDsInUse by writing any value to this register. This ?eld is initialized to its reset value by a full reset. Table 11. Transmit Arbiter Performance Counters Name Description GM_TARB_PERF_BTE_BLOCKED Transmit Arbiter (TARB) Performance BTE Blocked Count GM_TARB_PERF_BTE_FLITS TARB Performance BTE Flit Count GM_TARB_PERF_BTE_PKTS TARB Performance BTE Packet Count GM_TARB_PERF_BTE_STALLED TARB Performance BTE Stalled Count GM_TARB_PERF_FMA_BLOCKED TARB Performance FMA Blocked Count GM_TARB_PERF_FMA_FLITS TARB Performance FMA Flit Count GM_TARB_PERF_FMA_PKTS TARB Performance FMA Packet Count GM_TARB_PERF_FMA_STALLED TARB Performance FMA Stalled Count GM_TARB_PERF_LB_BLOCKED TARB Performance LB Blocked Count GM_TARB_PERF_LB_FLITS TARB Performance LB Flit Count GM_TARB_PERF_LB_PKTS TARB Performance LB Packet Count 22 S–0025–10Overview of Gemini Hardware Counters Name Description GM_TARB_PERF_LB_STALLED TARB Performance LB Stalled Count GM_TARB_PERF_OUT_FLITS TARB Performance Output Flit Count GM_TARB_PERF_OUT_PKTS TARB Performance Output Packet Count GM_TARB_PERF_OUT_STALLED TARB Performance Output Stalled Count 1.3 Gemini Tile MMRs The Gemini network consists of 48 tiles, arranged in 6 rows of 8 columns. Within each tile there are memory-mapped registers associated with the LCB and with the rest of the tile. The local block has shared connections to each row of tiles. By default, when only the name of the MMR is used, an event is counted on all 48 tiles. To address an individual tile, append the row (0-5) and column (0-7) to the name, as shown in the table. Table 12. Description of Gemini Tile MMRs Name Description GM_TILE_PERF_VC0_PHIT_CNT:n:m Number of vc0 phits read from inq buffer GM_TILE_PERF_VC1_PHIT_CNT:n:m Number of vc1 phits read from inq buffer GM_TILE_PERF_VC0_PKT_CNT:n:m Number of vc0 packets read from inq buffer GM_TILE_PERF_VC10_PKT_CNT:n:m Number of vc1 packets read from inq buffer GM_TILE_PERF_INQ_STALL:n:m Number of clock periods a valid reference is blocked from the routing pipeline. GM_TILE_PERF_CREDIT_STALL:n:m Number of clock periods a valid reference is stalled in the column buffers, waiting on transmissions credits. S–0025–10 23Using the Cray Gemini Hardware Counters © 2010 Cray Inc. All Rights Reserved. This document or parts thereof may not be reproduced in any form unless permitted by contract or by written permission of Cray Inc. Cray, LibSci, PathScale, and UNICOS are federally registered trademarks and Active Manager, Baker, Cascade, Cray Apprentice2, Cray Apprentice2 Desktop, Cray C++ Compiling System, Cray CX, Cray CX1, Cray CX1-iWS, Cray CX1-LC, Cray CX1000, Cray CX1000-C, Cray CX1000-G, Cray CX1000-S, Cray CX1000-SC, Cray CX1000-SM, Cray CX1000-HN, Cray Fortran Compiler, Cray Linux Environment, Cray SHMEM, Cray X1, Cray X1E, Cray X2, Cray XD1, Cray XE, Cray XE6, Cray XMT, Cray XR1, Cray XT, Cray XTm, Cray XT3, Cray XT4, Cray XT5, Cray XT5 h , Cray XT5m, Cray XT6, Cray XT6m, CrayDoc, CrayPort, CRInform, ECOphlex, Gemini, Libsci, NodeKARE, RapidArray, SeaStar, SeaStar2, SeaStar2+, Threadstorm, UNICOS/lc, UNICOS/mk, and UNICOS/mp are trademarks of Cray Inc. Version 1.0 Published July 2010 Supports CrayPat release 5.1 and CLE release 3.1 running on Cray XT systems. 24 S–0025–10 Using the Cray XMT™ for all streams Pragmas Abstract This document describes the for all streams compiler directives and how to use them to execute a block of code on multiple streams.© 2010 Cray Inc. All Rights Reserved. This document or parts thereof may not be reproduced in any form unless permitted by contract or by written permission of Cray Inc. Cray, LibSci, and PathScale are federally registered trademarks and Active Manager, Baker, Cascade, Cray Apprentice2, Cray Apprentice2 Desktop, Cray C++ Compiling System, Cray CX, Cray CX1, Cray CX1-iWS, Cray CX1-LC, Cray CX1000, Cray CX1000-C, Cray CX1000-G, Cray CX1000-S, Cray CX1000-SC, Cray CX1000-SM, Cray CX1000-HN, Cray Fortran Compiler, Cray Linux Environment, Cray SHMEM, Cray X1, Cray X1E, Cray X2, Cray XD1, Cray XE, Cray XE6, Cray XMT, Cray XR1, Cray XT, Cray XTm, Cray XT3, Cray XT4, Cray XT5, Cray XT5 h , Cray XT5m, Cray XT6, Cray XT6m, CrayDoc, CrayPort, CRInform, ECOphlex, Gemini, Libsci, NodeKARE, RapidArray, SeaStar, SeaStar2, SeaStar2+, Threadstorm, and UNICOS/lc are trademarks of Cray Inc. UNIX, the “X device,” X Window System, and X/Open are trademarks of The Open Group in the United States and other countries. All other trademarks are the property of their respective owners. RECORD OF REVISION S–0038–14 Published October 2010 Supports 1.4 and later releases running on the Cray XMT hardware.Using the Cray XMT™ for all streams Pragmas Using the Cray XMT for all streams Pragmas Overview In some programming situations it is useful to specify that a block of code should execute exactly once on each stream of a parallel region, allowing the application to manage data on a per-thread basis. Effective with the 1.4 release two pragma compiler directives were added that support this. Description The syntax of the for all streams pragmas is as follows: #pragma mta for all streams This directive starts up a parallel region (if the code is not already in a parallel region) and cause the next statement or block of statements to be executed exactly once on every stream allocated to the region. If the pragmas appear in code that would otherwise not be parallel, they cause it to go parallel. For example, #pragma mta for all streams printf("Stream checking in\n"); would cause every stream to print the phrase "Stream checking in" once. In this example the pragma executes a block of code that increments a counter before printing the phrase: int counter = 0; #pragma mta for all streams { counter++; printf("%d streams checked in \n", counter) }; #pragma mta for all streams i of n This directive is similar to the for all streams pragma except that it also sets the variable n to the total number of streams executing the region, and the variable i to a unique per-stream identifier between 0 and n-1. For example: int i, n; int check_in_array[MAX_PROCESSORS * MAX_STREAMS_PER_PROCESSOR]; for (int i = 0; i < MAX_PROCESSORS * MAX_STREAMS_PER_PROCESSOR; i++) check_in_array[i] = 0; #pragma mta for all streams i of n { check_in_array[i] = 1; printf("Stream %d of %d checked in.\n", i, n); } Note that the integer variables i and n must be declared separately from the pragma. S–0038–14 3Using the Cray XMT™ for all streams Pragmas You can use the for all streams pragmas in conjunction with the use n streams pragma to ask the compiler to allocate a certain number of streams per processor to the parallel region executing the for all streams block. #pragma mta use 100 streams #pragma mta for all streams {// do something } Be aware, however, that there is no guarantee that the runtime will grant the requested number of streams. For example, sufficient streams may not be available due to other jobs, the OS, or other simultaneous parallel regions in the current job. Examples In the following example, taken from a breadth-first search procedure, the for all streams pragma is used to divide a data structure between threads. int processQueue(int *Q,unsigned &head, unsigned &tail, unsigned qcap, const Neighbor neighbors[], const int numNeighbors[], sync int *Marked) { #pragma mta trace "process" #pragma mta noalias *Q, *Marked, *neighbors, *numNeighbors // elements [head,tail) are readonly // we can write to other elements of Q const unsigned oldtail = tail; const unsigned oldhead = head; unsigned newhead = head; unsigned stubbed = 0; #pragma mta use 100 streams #pragma mta for all streams { unsigned outhead = 0, outtail = 0; for(;;) { // grab INBLOCK nodes (& stubs) from the input unsigned inhead = int_fetch_add(&newhead, INBLOCK); // avoid overrun unsigned intail = std::min(inhead + INBLOCK, oldtail); if (inhead>=intail) break; // stop if we ran out of work #pragma mta assert nodep *Q,*numNeighbors,*neighbors for(int i=inhead; i=0) { int begin = numNeighbors[u]; // |N| int end = numNeighbors[u+1]; // |N| #pragma mta assert nodep *Q, *neighbors, *Marked for(int j=begin;j=outtail) { outhead = int_fetch_add(&tail, OUTBLOCK); outtail = outhead+OUTBLOCK; } Q[(outhead++)%qcap] = v; // |N| }else { Marked[v] = mark; // unlock & keep mark } } } } } #ifdef PHASES stubbed += outtail-outhead; #endif // stub-out the rest of reserved space // ), where is the number of streams the compiler requests.Limiting Loop Parallelism in Cray XMT™ Application S–0027–14 Cray Inc. 7 ? Limits the number of processors used by a multiprocessor parallel loop to max(1, c / ), where is the number of streams the compiler requests for each processor used by the parallel loop. ? If c is larger than or equal to , the total number of streams used by the parallel loop will be at most c. ? If c is less than , one processor will be used and streams will be requested by the compiler. ? Limits the number of futures created for a loop that uses loop future parallelism to c. ? If multiple max concurrency c pragmas are specified on one loop, the value of c specified by the last pragma will be used. ? For collapsible loop nests, the max concurrency value specified by the outer loop (if any) will be used for the collapsed loop. ? The max concurrency c pragma is not allowed to be used on a loop that also uses the use n streams pragma. Examples The following example illustrates using the max concurrency c pragma on a single processor parallel loop. /* Use at most 95 streams. */ #pragma mta loop single processor #pragma mta max concurrency 95 for(i = 0; i < size; i++) { array[i] += array[i] + (size + i); } The following example illustrates using the max concurrency c pragma on a multiprocessor parallel loop. /* Use at most 512 streams across all processors. */ #pragma mta max concurrency 512 for(i = 0; i < size; i++) { array[i] += array[i] + (size + i); }Limiting Loop Parallelism in Cray XMT™ Application S–0027–14 Cray Inc. 8 The following example illustrates using the max concurrency c pragma on a loop that uses loop future parallelism. /* Create at most 512 futures. */ #pragma mta loop future #pragma mta max concurrency 512 for(i = 0; i < size; i++) { array[i] += array[i] + (size + i); } Multiprocessor parallel loops are allowed to use both the max n processors and max concurrency c pragmas, and can use both on a single loop. In cases where both pragmas are used, the lower bound of the number of processors estimated by the two limits will be the limit used on the loop. For example, the following code illustrates the use of both pragmas on one multiprocessor parallel loop. /* Use at most 512 streams across all processors or * at most 8 processors, whichever is smaller. */ #pragma mta max concurrency 512 #pragma mta max 8 processors for(i = 0; i < size; i++) { array[i] += array[i] + (size + i); } In the above example, if the compiler were to request 64 streams per processor, then the max concurrency 512 would estimate that 8 processors should be used for the loop (i.e., 512/64). The max 8 processors has the same limit on the number of processors so the loop would be limited to 8 processors. If the compiler instead requested 32 streams per processor, then the max concurrency 512 would estimate that 16 processors should be used, which is more than the limit of 8 specified by the max 8 processors, so the loop would be limited to 8 processors. Because the use n streams pragma cannot be used on the same loop as a max concurrency c pragma, the loop will use the default number of streams determined by the compiler. The user will need to look at the canal details for a loop to determine the default number of streams being requested by the compiler. Effect of Pragmas on Loop Fusion and Parallel Region Merging The new pragmas can prevent the compiler from fusing loops if the loops involved do not have the same limits for the max processors and max concurrency. This is because the compiler will need to put the loops into different parallel regions in order to limit the processors and/or concurrency as requested by the user. This could potentially have a negative impact on the performance of a user's application, so users may need to look at the canal output to see what loops the compiler fused.Limiting Loop Parallelism in Cray XMT™ Application S–0027–14 Cray Inc. 9 The pragmas could also prevent the compiler from merging the parallel regions for different loops into a single parallel region. The limitation for concurrency or processors specified by the new pragmas applies to the current parallel region that contains the loop with the pragmas. The compiler must ensure that all loops in a parallel region have the same limits for max processors and max concurrency. If the loops do not have matching limits, the compiler will put them in different parallel regions to ensure the user's limits on processors and/or concurrency can be correctly applied. This could potentially have a negative impact on the performance of a user's application because more time will be spent tearing down and starting new parallel regions. In the case of nested parallel regions, any limitations for concurrency or processors specified with the pragmas on either region do not affect the other region. For example, if the outer parallel region has a max 8 processors, that pragma will not affect the inner parallel region because the pragmas apply to the current parallel region only. The user can determine what loops the compiler placed in a parallel region by looking at the canal output. The “Additional Loop Details” shows which parallel region a loop is in, and the details for parallel regions state what limits for processors or concurrency (if any) are being applied to the region. The following is an example of two loops that have matching limits for max n processors that could be fused and placed into one parallel region by the compiler. #pragma mta max 64 processors for(i = 0; i < size; i++) array[i] = i; #pragma mta max 64 processors for(i = 0; i < size; i++) { array[i] += array[i] + (size + i); } The following is an example of two loops that cannot be fused or put into one parallel region because the loops specify different limits for the max processors. #pragma mta max 256 processors for(i = 0; i < size; i++) array[i] = i; #pragma mta max 512 processors for(i = 0; i < size; i++) { array[i] += array[i] + (size + i); } The following is another example of two loops that cannot be fused or put into one parallel region because the loops specify different limits for the max processors. The first loop does not use the max n processors pragma, which implies there is no user specified limit. for(i = 0; i < size; i++) array[i] = i;Limiting Loop Parallelism in Cray XMT™ Application S–0027–14 Cray Inc. 10 #pragma mta max 512 processors for(i = 0; i < size; i++) { array[i] += array[i] + (size + i); } Use Case: Applying Max Processors Pragma to GraphCT An example application that uses nested parallelism to improve system utilization and reduce contention on shared data structures is GraphCT (Graph Characterization Toolkit) [1]. GraphCT consists of multiple kernels that perform operations on a graph and the kernel focused on in this example is betweenness centrality. The betweenness centrality kernel of GraphCT is executed concurrently by a small number of threads using loop future parallelism, and each thread uses multiprocessor parallelism to compute the betweenness centrality of a node. The betweenness centrality kernel of GraphCT can see significant variance in performance due to issues with load balancing across the threads. The max n processors pragma can be used to help improve load balancing and increase utilization by evenly distributing the processors across the threads. The betweenness centrality kernel of GraphCT consists of two functions, kcentrality and kcent_core. The kcentrality function creates a small number of threads using loop future parallelism, and each of those threads calls kcent_core to compute the betweenness centrality for the nodes in the graph. Both of these functions were updated to make use of the new max n processors pragma. The changes to kcent_core are limited to applying the max n processors pragma to each parallel loop in the function. The limit for the number of processors to use per thread was determined experimentally based on the default number of threads created in kcentrality in the release version 0.4 of GraphCT, which is 20. This would give each thread approximately 6 processors on a 128P XMT system if each thread got the same number of processors. This led to trying a limit of 8 processors per thread in kcent_core. Experiments showed that using 8 processors per thread performed better than the release version of GraphCT with 20 threads and no max n processors pragmas. A power of two was chosen so the number of processors in the system could be easily divided by the number of processors used per thread. A limit of 16 processors per thread was also tested and was shown to have reasonable performance that could be very similar to the performance with a limit of 8, especially for larger graphs (scale >= 28). The following code snippets show how the max n processors pragma was used for each loop in kcent_core. In these examples, MAX_PROCS is a preprocessor macro that has been defined as 8. <...> #pragma mta max MAX_PROCS processors #pragma mta assert nodep for (j = 0; j < NV; j++) {marks[j] = sigma[NV*(K+1) + j] = 0;} <...>Limiting Loop Parallelism in Cray XMT™ Application S–0027–14 Cray Inc. 11 #pragma mta max MAX_PROCS processors #pragma mta assert nodep for (j = 0; j < (K+1)*NV; j++) { dist[j] = -1; sigma[j] = child_count[j] = 0; } <...> #pragma mta max MAX_PROCS processors #pragma mta assert no dependence #pragma mta block dynamic schedule #pragma mta use 100 streams for (j = Qstart; j < Qend; j++) { <...> #pragma mta max MAX_PROCS processors #pragma mta assert nodep #pragma mta assert no alias *sigma *Q *child *start *QHead #pragma mta use 100 streams for (n = QHead[p]; n < QHead[p+1]; n++) { <...> #pragma mta max MAX_PROCS processors for (j=0; j<(K+1)*NV; j++) delta[j] = 0.0; <...> #pragma mta max MAX_PROCS processors #pragma mta assert nodep #pragma mta block dynamic schedule #pragma mta assert no alias *sigma *Q *BC *delta *child *start *QHead #pragma mta use 100 streams for (n = Qstart; n < Qend; n++) { <...> The pragma was used on all parallel loops in the function to ensure that each thread that calls kcent_core is limited to the desired number of processors, which is 8 in this case. Also, because all of the parallel loops in kcent_core have the same limit for the max processors, the compiler will not need to put the loops into different parallel regions because of a mismatch in limits. Grouping the loops into one region can help reduce the cost of going parallel and improve performance by avoiding starting and tearing down multiple parallel regions. The kcentrality function was modified to compute the number of threads at runtime based on the number of processors used by the application and the number of processors used per thread in kcent_core. The number of threads, INC, is a preprocessor macro in version 0.4 of GraphCT. However, the modifications to kcentrality changed INC to a variable that is computed at runtime. The following code snippet shows the changes made to kcentrality. Again, MAX_PROCS used in the example below has been defined as 8.Limiting Loop Parallelism in Cray XMT™ Application S–0027–14 Cray Inc. 12 <...> /*Compute INC based on the number of processors we're using and limiting each thread to MAX_PROCS processors (in kcent_core()).*/ int INC; INC = mta_get_max_teams(); INC = INC / MAX_PROCS; INC = MTA_INT_MAX(1, INC); <...> #pragma mta loop future for(x=0; x for (int claimedk = int_fetch_add (&k, 1); claimedk < Vs; claimedk = int_fetch_add (&k, 1)) { <...> kcent_core(G, BC, K, s, Q, dist, sigma, marks, QHead, child, child_count); <...> } } <...> These changes to GraphCT helped the betweenness centrality kernel have better load balancing across the threads and achieve higher system utilization, which improved the performance and scalability of the kernel. References [1] “GraphCT – Streaming Graph Analysis”, http://trac.research.cc.gatech.edu/graphs/wiki/GraphCT, May 4, 2010. June 2004 version 6.5 TotalView New FeaturesCopyright © 1999–2004 by Etnus LLC. All rights reserved. Copyright © 1996–1998 by Dolphin Interconnect Solutions, Inc. Copyright © 1993–1996 by BBN Systems and Technologies, a division of BBN Corporation. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise without the prior written permission of Etnus LLC. (Etnus). Use, duplication, or disclosure by the Government is subject to restrictions as set forth in subparagraph (c)(1)(ii) of the Rights in Technical Data and Computer Software clause at DFARS 252.227-7013. Etnus has prepared this manual for the exclusive use of its customers, personnel, and licensees. The information in this manual is subject to change without notice, and should not be construed as a commitment by Etnus. Etnus assumes no responsibility for any errors that appear in this document. TotalView and Etnus are registered trademarks of Etnus LLC. TotalView uses a modified version of the Microline widget library. Under the terms of its license, you are entitled to use these modifications. The source code is available at http://www.etnus.com/Products/TotalView/developers. All other brand names are the trademarks of their respective holders.TotalView New Features: 6.5 iii Contents New Features New Platforms and Compilers ................................................................................. 1 New and Changed GUI Features ............................................................................. 2 Tools > Memory Debugging Command Added ....................................................... 2 Node Display in the Variable Window ....................................................................... 4 STL String data types Transformed .......................................................................... 4 Type Transformations ............................................................................................... 4Contents iv 6.5TotalView New Features: version 6.5 1 New Features This booklet contains information about changes made to TotalView for version 6.5. The information in this document is to let you know what changes have occurred. You’ll find descriptions for all changes within the TotalVie w Us e r s Guide. TotalView has many features and it gives you a great number of tools for finding your program’s problems. An easy way to get acquainted with these features is to subscribe to the “Tip of the Week”. If you subscribe to this mailing list, you’ll receive an email message every week that tells you something about TotalView. ¦ All of the tips are archived on our web site at http://www.etnus.com/ Support/Tips/index.html. ¦ If you like what you see, you can subscribe at http://www.etnus.com/ mojo/mojo.cgi. New Platforms and Compilers TotalView now supports the following operating system versions: ¦ Red Hat Fedora Core 1 on x86 architectures. ¦ SuSE Linux Profession 9.0 and SuSe Linux Personal on x86 and x86-64 architectures. TotalView now supports the following compilers: ¦ gcc 3.4.0 for C and C++ on most platforms. ¦ gcc 3.4.0 for Fortran 77 on x86, x86-64, and ia64 Linux. ¦ Intel C and C++ 8.0.066 on x86 and ia64 Linux. ¦ Intel Fortran 8.0.046 on x86 and ia64 Linux ¦ Portland Group C and C++ 5.1 on x86 and x86-64 Linux. For complete information, see the TotalView Platforms Guide.New Features 2 version 6.5 New and Changed GUI Features Tools > Memory Debugging Command Added This release of TotalView adds to the memory debugging features that previously existed within TotalView. It also consolidates memory debugging interactions within one window. The TotalView Memory Debugger can help you locate many of your program’s memory problems. For example, you can: ¦ Stop execution when free(), realloc(), and other heap API problems occur. If your program tries to free memory that it can’t or shouldn’t free, the Memory Debugger can stop execution. This lets you can identify the statement that caused the problem. ¦ List leaks. The Memory Debugger can display your program’s leaks. (Leaks are memory blocks that are allocated, but which are no longer referenced.) When your program allocates a memory block, the Memory Debugger creates a backtrace. When it makes a list of your leaks, it includes this backtrace in the list. This lets you see the place where your program allocated the memory block. ¦ Paint allocated and deallocated blocks. When your program’s memory manager allocates or deallocates memory, the Memory Debugger can write a bit pattern into it. Writing this bit pattern is called painting. When you see this bit pattern in a Variable or Expression List Window, you can tell that you are using memory before your program initializes it or after your program deallocates it. Depending upon the architecture, you might even be able to force an exception when your program accesses this memory. ¦ Identify dangling pointers. A dangling pointer is a pointer that points into deallocated memory. If the pointer being displayed in a Variable is dangling, TotalView adds information to the data element so that you know about the problem. ¦ Hold onto deallocated memory. When trying to identify memory problems, holding on to memory after your program releases it can sometimes help locate problems. Holding onto freed memory is called hoarding. For example, retaining a block can sometimes force a memory error to occur. Or, when coupled with painting, you’ll be able to tell when your program is trying to access deallocated memory.New and Changed GUI Features TotalView New Features 3 After you select the Tools > Memory Debugging command, TotalView displays the following window:New Features 4 version 6.5 If memory debugging is enabled, you can tell the Memory Debugger to display information whenever execution stops. For example, here is a window showing leak information: The Backtrace Pane shows the stack frames that existed when your program allocated a memory block. The Source Pane shows the line where it made the allocation. For more information, see the Debugging Memory Problems Using TotalView document. Node Display in the Variable Window The View > Nodes command was removed. This command was only used when viewing UPC variables. You can see the nodes upon which a variable resides by right-clicking on the column headers and selecting Node. STL String data types Transformed STLView now transforms String data types. Type Transformations The way in which you create type transformations has been simplified. While older methods still work, the new methods are more direct. For information, see the “Creating Type Transformations” chapter of the TotalView Reference Guide. The Type Transformations Guide has been archived on our web site. It is will no longer be updated. However, it may be useful if you are attempting to transform a very difficult data structure or class. PGI ® User’s Guide Parallel Fortran, C and C++ for Scientists and Engineers The Portland Group™ STMicroelectronics Two Centerpointe Drive Lake Oswego, OR 97035While every precaution has been taken in the preparation of this document, The Portland Group™, a wholly-owned subsidiary of STMicroelectronics, makes no warranty for the use of its products and assumes no responsibility for any errors that may appear, or for damages resulting from the use of the information contained herein. The Portland Group retains the right to make changes to this information at any time, without notice. The software described in this document is distributed under license from STMicroelectronics and may be used or copied only in accordance with the terms of the license agreement. No part of this document may be reproduced or transmitted in any form or by any means, for any purpose other than the purchaser's personal use without the express written permission of The Portland Group. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this manual, The Portland Group was aware of a trademark claim. The designations have been printed in caps or initial caps. Thanks is given to the Parallel Tools Consortium and, in particular, to the High Performance Debugging Forum for their efforts. PGF95, PGF90, PGC++, Cluster Development Kit, CDK, PGI Unified Binary, PGI Visual Fortran, PVF and The Portland Group are trademarks and PGI, PGHPF, PGF77, PGCC, PGPROF, and PGDBG are registered trademarks of STMicroelectronics, Inc. Other brands and names are the property of their respective owners. The use of STLport, a C++ Library, is licensed separately and license, distribution and copyright notice can be found in the online documentation for a given release of the PGI compilers and tools. PGI ® User’s Guide Copyright © 1998 – 2000 The Portland Group, Inc. Copyright © 2000 – 2006 STMicroelectronics, Inc. All rights reserved. Printed in the United States of America First Printing: Release 1.7, Jun 1998 Second Printing: Release 3.0, Jan 1999 Third Printing: Release 3.1, Sep 1999 Fourth Printing: Release 3.2, Sep 2000 Fifth Printing: Release 4.0, May 2002 Sixth Printing: Release 5.0, Jun 2003 Seventh Printing: Release 5.1, Nov 2003 Eight Printing: Release 5.2, Jun 2004 Ninth Printing: Release 6.0, Mar 2005 Tenth Printing: Release 6.1, Dec 2005 Eleventh Printing: Release 6.2, Aug 2006 Twelfth printing: Release 7.0-1, December, 2006 Thirteenth printing: Release 7.1, October, 2007 Technical support: trs@pgroup.com Sales: sales@pgroup.com Web: www.pgroup.com/iii Contents Preface .................................................................................................................................... xix Audience Description ............................................................................................................ xix Compatibility and Conformance to Standards ............................................................................ xix Organization ......................................................................................................................... xx Hardware and Software Constraints ........................................................................................ xxii Conventions ........................................................................................................................ xxii Related Publications ........................................................................................................... xxvii 1. Getting Started .................................................................................................................... 1 Overview ................................................................................................................................ 1 Invoking the Command-level PGI Compilers ............................................................................... 1 Command-line Syntax ...................................................................................................... 2 Command-line Options .................................................................................................... 3 Fortran Directives and C/C++ Pragmas .............................................................................. 3 Filename Conventions .............................................................................................................. 3 Input Files ..................................................................................................................... 3 Output Files ................................................................................................................... 5 Fortran, C, and C++ Data Types ............................................................................................... 6 Parallel Programming Using the PGI Compilers ........................................................................... 7 Running SMP Parallel Programs ...................................................................................... 7 Running Data Parallel HPF Programs ................................................................................. 8 Platform-specific considerations ................................................................................................ 8 Using the PGI Compilers on Linux .................................................................................... 9 Using the PGI Compilers on Windows .............................................................................. 10 Using the PGI Compilers on SUA and SFU ........................................................................ 11 Using the PGI Compilers on Mac OS X ............................................................................. 11 Site-specific Customization of the Compilers .............................................................................. 12 Using siterc Files ........................................................................................................... 12 Using User rc Files ........................................................................................................ 12 Common Development Tasks .................................................................................................. 13 2. Using Command Line Options ....................................................................................... 15PGI® User’s Guide iv Command Line Option Overview ............................................................................................. 15 Command-line Options Syntax ......................................................................................... 15 Command-line Suboptions .............................................................................................. 16 Command-line Conflicting Options ................................................................................... 16 Help with Command-line Options ............................................................................................ 16 Getting Started with Performance ............................................................................................ 18 Using –fast and –fastsse Options ..................................................................................... 18 Other Performance-related Options ................................................................................. 19 Targeting Multiple Systems; Using the -tp Option ....................................................................... 19 Frequently-used Options ......................................................................................................... 19 3. Using Optimization & Parallelization .......................................................................... 21 Overview of Optimization ....................................................................................................... 21 Local Optimization ........................................................................................................ 22 Global Optimization ....................................................................................................... 22 Loop Optimization: Unrolling, Vectorization, and Parallelization ........................................... 22 Interprocedural Analysis (IPA) and Optimization .............................................................. 22 Function Inlining ........................................................................................................... 22 Profile-Feedback Optimization (PFO) .............................................................................. 22 Getting Started with Optimizations ........................................................................................... 23 Local and Global Optimization using -O .................................................................................. 24 Scalar SSE Code Generation ............................................................................................ 26 Loop Unrolling using –Munroll ............................................................................................... 27 Vectorization using –Mvect ..................................................................................................... 28 Vectorization Sub-options ............................................................................................... 28 Vectorization Example Using SSE/SSE2 Instructions ............................................................ 30 Auto-Parallelization using -Mconcur ......................................................................................... 32 Auto-parallelization Sub-options ...................................................................................... 33 Loops That Fail to Parallelize ......................................................................................... 34 Processor-Specific Optimization and the Unified Binary .............................................................. 36 Interprocedural Analysis and Optimization using –Mipa .............................................................. 37 Building a Program Without IPA – Single Step ................................................................... 37 Building a Program Without IPA - Several Steps ................................................................. 38 Building a Program Without IPA Using Make .................................................................... 38 Building a Program with IPA .......................................................................................... 38 Building a Program with IPA - Single Step ........................................................................ 39 Building a Program with IPA - Several Steps ..................................................................... 39 Building a Program with IPA Using Make ........................................................................ 40 Questions about IPA ...................................................................................................... 40 Profile-Feedback Optimization using –Mpfi/–Mpfo ..................................................................... 41 Default Optimization Levels ..................................................................................................... 42 Local Optimization Using Directives and Pragmas ...................................................................... 42 Execution Timing and Instruction Counting ............................................................................... 43 Portability of Multi-Threaded Programs on Linux ....................................................................... 43 libpgbind ..................................................................................................................... 44 libnuma ....................................................................................................................... 44PGI ® User’s Guide v 4. Using Function Inlining .................................................................................................. 45 Invoking Function Inlining ..................................................................................................... 45 Using an Inline Library .................................................................................................. 46 Creating an Inline Library ...................................................................................................... 47 Working with Inline Libraries ......................................................................................... 48 Updating Inline Libraries - Makefiles ............................................................................... 48 Error Detection during Inlining ............................................................................................... 49 Examples ............................................................................................................................. 49 Restrictions on Inlining .......................................................................................................... 49 5. Using OpenMP .................................................................................................................. 51 Fortran Parallelization Directives ............................................................................................. 51 C/C++ Parallelization Pragmas ............................................................................................... 52 Directive and Pragma Recognition ........................................................................................... 53 Directive and Pragma Summary Table ...................................................................................... 53 Directive and Pragma Clauses ................................................................................................. 54 Run-time Library Routines ...................................................................................................... 55 Environment Variables ........................................................................................................... 59 OMP_DYNAMIC ............................................................................................................ 59 OMP_NESTED ............................................................................................................... 59 OMP_NUM_THREADS ................................................................................................... 59 OMP_SCHEDULE ........................................................................................................... 60 OMP_STACK_SIZE ......................................................................................................... 60 OMP_WAIT_POLICY ...................................................................................................... 60 6. Using Directives and Pragmas ....................................................................................... 63 PGI Proprietary Fortran Directives ........................................................................................... 63 PGI Proprietary C and C++ Pragmas ....................................................................................... 64 PGI Proprietary Optimization Fortran Directive and C/C++ Pragma Summary ................................. 64 Scope of Fortran Directives and Command-Line options ............................................................. 66 Scope of C/C++ Pragmas and Command-Line Options ............................................................... 67 Prefetch Directives ............................................................................................................... 69 Format Requirements .................................................................................................... 70 Sample Usage ............................................................................................................... 70 !DEC$ Directive .................................................................................................................... 70 Format Requirements .................................................................................................... 71 ALIAS Directive ............................................................................................................. 71 ATTRIBUTES Directive ................................................................................................... 71 DISTRIBUTE Directive .................................................................................................... 72 ALIAS Directive ............................................................................................................. 72 C$PRAGMA C ........................................................................................................................ 72 7. Creating and Using Libraries ........................................................................................ 75 Using builtin Math Functions in C/C++ .................................................................................... 75 Creating and Using Shared Object Files on Linux ....................................................................... 76PGI® User’s Guide vi Creating and Using Shared Object Files in SFU and 32-bit SUA ..................................................... 77 Shared Object Error Message ......................................................................................... 78 Shared Object-Related Compiler Switches ......................................................................... 78 PGI Runtime Libraries on Windows ......................................................................................... 79 Creating and Using Static Libraries on Windows ........................................................................ 79 ar command ................................................................................................................ 79 ranlib command ........................................................................................................... 80 Creating and Using Dynamic-Link Libraries on Windows ............................................................. 80 Using LIB3F ........................................................................................................................ 88 LAPACK, BLAS and FFTs ......................................................................................................... 88 The C++ Standard Template Library ........................................................................................ 88 8. Using Environment Variables ........................................................................................ 89 Setting Environment Variables ................................................................................................. 89 Setting Environment Variables on Linux ............................................................................ 89 Setting Environment Variables on Windows ....................................................................... 90 Setting Environment Variables on Mac OSX ....................................................................... 90 PGI-Related Environment Variables .......................................................................................... 91 PGI Environment Variables ..................................................................................................... 92 FLEXLM_BATCH ............................................................................................................ 93 FORTRAN_OPT ............................................................................................................. 93 GMON_OUT_PREFIX ...................................................................................................... 93 LD_LIBRARY_PATH ....................................................................................................... 93 LM_LICENSE_FILE ......................................................................................................... 93 MANPATH .................................................................................................................... 94 MPSTKZ ....................................................................................................................... 94 MP_BIND ..................................................................................................................... 94 MP_BLIST .................................................................................................................... 95 MP_SPIN ..................................................................................................................... 95 MP_WARN ................................................................................................................... 95 NCPUS ......................................................................................................................... 96 NCPUS_MAX ................................................................................................................. 96 NO_STOP_MESSAGE ...................................................................................................... 96 PATH ........................................................................................................................... 96 PGI ............................................................................................................................. 96 PGI_CONTINUE ............................................................................................................. 97 PGI_OBJSUFFIX ............................................................................................................. 97 PGI_STACK_USAGE ........................................................................................................ 97 PGI_TERM ................................................................................................................... 97 PGI_TERM_DEBUG ....................................................................................................... 99 PWD ............................................................................................................................ 99 STATIC_RANDOM_SEED ................................................................................................. 99 TMP .......................................................................................................................... 100 TMPDIR ..................................................................................................................... 100 Using Environment Modules ................................................................................................. 100 Stack Traceback and JIT Debugging ....................................................................................... 101PGI ® User’s Guide vii 9. Distributing Files - Deployment .................................................................................. 103 Deploying Applications on Linux ............................................................................................ 103 Runtime Library Considerations ..................................................................................... 103 64-bit Linux Considerations .......................................................................................... 104 Linux Redistributable Files ............................................................................................ 104 Restrictions on Linux Portability .................................................................................... 104 Installing the Linux Portability Package ........................................................................... 104 Licensing for Redistributable Files ................................................................................. 105 Deploying Applications on Windows ....................................................................................... 105 PGI Redistributables .................................................................................................... 105 Microsoft Redistributables ............................................................................................ 105 Code Generation and Processor Architecture ........................................................................... 106 Generating Generic x86 Code ........................................................................................ 106 Generating Code for a Specific Processor ........................................................................ 106 Generating Code for Multiple Types of Processors in One Executable .......................................... 106 Unified Binary Command-line Switches ........................................................................... 107 Unified Binary Directives and Pragma ............................................................................. 107 10. Inter-language Calling ................................................................................................ 109 Overview of Calling Conventions ............................................................................................ 109 Inter-language Calling Considerations ..................................................................................... 110 Functions and Subroutines ................................................................................................... 110 Upper and Lower Case Conventions, Underscores .................................................................... 111 Compatible Data Types ......................................................................................................... 111 Fortran Named Common Blocks .................................................................................... 112 Argument Passing and Return Values ..................................................................................... 113 Passing by Value (%VAL) ............................................................................................. 113 Character Return Values ............................................................................................... 113 Complex Return Values ................................................................................................ 114 Array Indices ...................................................................................................................... 114 Examples ........................................................................................................................... 115 Example - Fortran Calling C .......................................................................................... 115 Example - C Calling Fortran .......................................................................................... 115 Example - C ++ Calling C ............................................................................................ 116 Example - C Calling C++ ............................................................................................. 117 Example - Fortran Calling C++ ..................................................................................... 118 Example - C++ Calling Fortran ..................................................................................... 119 Win32 Calling Conventions ................................................................................................... 120 Win32 Fortran Calling Conventions ................................................................................ 120 Symbol Name Construction and Calling Example .............................................................. 121 Using the Default Calling Convention .............................................................................. 122 Using the STDCALL Calling Convention ............................................................................ 122 Using the C Calling Convention ...................................................................................... 122 Using the UNIX Calling Convention ................................................................................. 123 11. Programming Considerations for 64-Bit Environments ....................................... 125PGI® User’s Guide viii Data Types in the 64-Bit Environment .................................................................................... 125 C/C++ Data Types ....................................................................................................... 126 Fortran Data Types ...................................................................................................... 126 Large Static Data in Linux ..................................................................................................... 126 Large Dynamically Allocated Data .......................................................................................... 126 64-Bit Array Indexing .......................................................................................................... 126 Compiler Options for 64-bit Programming .............................................................................. 127 Practical Limitations of Large Array Programming .................................................................... 128 Example: Medium Memory Model and Large Array in C ............................................................ 129 Example: Medium Memory Model and Large Array in Fortran .................................................... 130 Example: Large Array and Small Memory Model in Fortran ....................................................... 131 12. C/C++ Inline Assembly and Intrinsics ..................................................................... 133 Inline Assembly ................................................................................................................... 133 Extended Inline Assembly ..................................................................................................... 134 Output Operands ......................................................................................................... 135 Input Operands ........................................................................................................... 137 Clobber List ................................................................................................................ 138 Additional Constraints .................................................................................................. 139 Operand Aliases .......................................................................................................... 145 Assembly String Modifiers ............................................................................................. 145 Extended Asm Macros .................................................................................................. 147 Intrinsics ............................................................................................................................ 148 13. Fortran, C and C++ Data Types ................................................................................ 151 Fortran Data Types .............................................................................................................. 151 Fortran Scalars ........................................................................................................... 151 FORTRAN 77 Aggregate Data Type Extensions .................................................................. 153 Fortran 90 Aggregate Data Types (Derived Types) ............................................................ 154 C and C++ Data Types ....................................................................................................... 154 C and C++ Scalars ...................................................................................................... 154 C and C++ Aggregate Data Types .................................................................................. 156 Class and Object Data Layout ........................................................................................ 156 Aggregate Alignment .................................................................................................... 157 Bit-field Alignment ....................................................................................................... 158 Other Type Keywords in C and C++ .............................................................................. 158 14. C++ Name Mangling ................................................................................................... 159 Types of Mangling ............................................................................................................... 160 Mangling Summary .............................................................................................................. 160 Type Name Mangling ................................................................................................... 160 Nested Class Name Mangling ......................................................................................... 161 Local Class Name Mangling ........................................................................................... 161 Template Class Name Mangling ..................................................................................... 161 15. Command-Line Options Reference ........................................................................... 163PGI ® User’s Guide ix PGI Compiler Option Summary ............................................................................................. 163 Build-Related PGI Options ............................................................................................ 163 PGI Debug-Related Compiler Options ............................................................................. 166 PGI Optimization-Related Compiler Options .................................................................... 167 PGI Linking and Runtime-Related Compiler Options ......................................................... 167 C and C++ Compiler Options ............................................................................................... 168 Generic PGI Compiler Options .............................................................................................. 170 C and C++ -specific Compiler Options ................................................................................... 208 –M Options by Category ....................................................................................................... 219 –M Code Generation Controls .......................................................................... 220 –M C/C++ Language Controls .......................................................................... 223 –M Environment Controls ................................................................................ 225 –M Fortran Language Controls ......................................................................... 226 –M Inlining Controls ....................................................................................... 228 –M Optimization Controls ................................................................................ 229 –M Miscellaneous Controls .............................................................................. 238 16. OpenMP Reference Information ............................................................................... 243 Parallelization Directives and Pragmas ................................................................................... 243 ATOMIC ............................................................................................................................ 244 BARRIER ............................................................................................................................ 244 CRITICAL ... END CRITICAL and omp critical .......................................................................... 245 C$DOACROSS .................................................................................................................... 246 DO ... END DO and omp for ................................................................................................ 247 FLUSH and omp flush pragma .............................................................................................. 249 MASTER ... END MASTER and omp master pragma ................................................................. 250 ORDERED ......................................................................................................................... 251 PARALLEL ... END PARALLEL and omp parallel ....................................................................... 251 PARALLEL DO .................................................................................................................... 254 PARALLEL SECTIONS ........................................................................................................... 255 PARALLEL WORKSHARE ....................................................................................................... 256 SECTIONS … END SECTIONS .............................................................................................. 257 SINGLE ... END SINGLE ........................................................................................................ 257 THREADPRIVATE ................................................................................................................ 258 WORKSHARE ... END WORKSHARE ......................................................................................... 259 Directive and Pragma Clauses ............................................................................................... 260 Schedule Clause .......................................................................................................... 261 17. Directives and Pragmas Reference ........................................................................... 263 PGI Proprietary Fortran Directive and C/C++ Pragma Summary ................................................. 263 altcode (noaltcode) ............................................................................................................ 263 assoc (noassoc) .................................................................................................................. 264 bounds (nobounds) ........................................................................................................... 265 cncall (nocncall) ................................................................................................................ 265 concur (noconcur) ............................................................................................................ 265 depchk (nodepchk) ............................................................................................................ 265PGI® User’s Guide x eqvchk (noeqvchk) ............................................................................................................ 265 fcon (nofcon) ..................................................................................................................... 265 invarif (noinvarif) ............................................................................................................... 265 ivdep ................................................................................................................................. 266 lstval (nolstval) ................................................................................................................... 266 opt .................................................................................................................................... 266 safe (nosafe) ...................................................................................................................... 266 safe_lastval ......................................................................................................................... 266 safeptr (nosafeptr) .............................................................................................................. 267 single (nosingle) ................................................................................................................. 268 tp ...................................................................................................................................... 268 unroll (nounroll) ................................................................................................................ 268 vector (novector) ................................................................................................................ 269 vintr (novintr) .................................................................................................................... 269 18. Run-time Environment ................................................................................................ 271 Linux86 and Win32 Programming Model ................................................................................ 271 Function Calling Sequence ............................................................................................ 271 Function Return Values ................................................................................................ 273 Argument Passing ........................................................................................................ 275 Linux86-64 Programming Model ........................................................................................... 277 Function Calling Sequence ............................................................................................ 278 Function Return Values ................................................................................................ 280 Argument Passing ........................................................................................................ 281 Linux86-64 Fortran Supplement .................................................................................... 283 Win64 Programming Model .................................................................................................. 287 Function Calling Sequence ............................................................................................ 288 Function Return Values ................................................................................................ 290 Argument Passing ........................................................................................................ 291 Win64/SUA64 Fortran Supplement ................................................................................. 293 19. C++ Dialect Supported ............................................................................................... 299 Extensions Accepted in Normal C++ Mode ............................................................................. 299 cfront 2.1 Compatibility Mode ............................................................................................... 300 cfront 2.1/3.0 Compatibility Mode ......................................................................................... 301 20. C/C++ MMX/SSE Inline Intrinsics ............................................................................. 303 Using Intrinsic functions ....................................................................................................... 303 Required Header File ................................................................................................... 304 Intrinsic Data Types ..................................................................................................... 304 Intrinsic Example ........................................................................................................ 304 MMX Intrinsics ................................................................................................................... 305 SSE Intrinsics ...................................................................................................................... 306 ABM Intrinsics .................................................................................................................... 309 21. Fortran Module/Library Interfaces ........................................................................... 311PGI ® User’s Guide xi Data Types ......................................................................................................................... 311 Using DFLIB and DFPORT .................................................................................................... 312 DFLIB ........................................................................................................................ 312 DFPORT ..................................................................................................................... 312 Using the DFWIN module ..................................................................................................... 312 Supported Libraries and Modules .......................................................................................... 313 advapi32 .................................................................................................................... 313 comdlg32 ................................................................................................................... 315 dfwbase ..................................................................................................................... 315 dfwinty ....................................................................................................................... 315 gdi32 ......................................................................................................................... 316 kernel32 .................................................................................................................... 319 shell32 ....................................................................................................................... 327 user32 ....................................................................................................................... 327 winver ....................................................................................................................... 331 wsock32 .................................................................................................................... 332 22. Messages ........................................................................................................................ 333 Diagnostic Messages ............................................................................................................ 333 Phase Invocation Messages ................................................................................................... 334 Fortran Compiler Error Messages .......................................................................................... 334 Message Format .......................................................................................................... 334 Message List ............................................................................................................... 334 Fortran Runtime Error Messages ........................................................................................... 360 Message Format .......................................................................................................... 360 Message List ............................................................................................................... 360 Index ...................................................................................................................................... 363xiixiii Figures 13.1. Internal Padding in a Structure ............................................................................................. 157 13.2. Tail Padding in a Structure ................................................................................................... 158xivxv Tables 1. PGI Compilers and Commands .................................................................................................. xxvi 2. Processor Options ................................................................................................................... xxvi 1.1. Stop-after Options, Inputs and Outputs ........................................................................................ 5 1.2. Examples of Using siterc and User rc Files ................................................................................. 13 2.1. Commonly Used Command Line Options .................................................................................... 20 3.1. Optimization and –O, –g and –M Options ........................................................................ 42 5.1. Directive and Pragma Summary Table ....................................................................................... 53 5.2. Run-time Library Call Summary ................................................................................................ 55 5.3. OpenMP-related Environment Variable Summary Table ................................................................ 59 6.1. Proprietary Optimization-Related Fortran Directive and C/C++ Pragma Summary ............................. 65 8.1. PGI-related Environment Variable Summary Table ....................................................................... 91 8.2. Supported PGI_TERM Values ................................................................................................... 98 10.1. Fortran and C/C++ Data Type Compatibility ............................................................................ 111 10.2. Fortran and C/C++ Representation of the COMPLEX Type ......................................................... 112 10.3. Calling Conventions Supported by the PGI Fortran Compilers ..................................................... 120 11.1. 64-bit Compiler Options ....................................................................................................... 127 11.2. Effects of Options on Memory and Array Sizes ......................................................................... 127 11.3. 64-Bit Limitations ................................................................................................................ 128 12.1. Simple Constraints ............................................................................................................... 139 12.2. x86/x86_64 Machine Constraints .......................................................................................... 141 12.3. Multiple Alternative Constraints ............................................................................................. 143 12.4. Constraint Modifier Characters .............................................................................................. 144 12.5. Assembly String Modifier Characters ...................................................................................... 145 12.6. Intrinsic Header File Organization ......................................................................................... 148 13.1. Representation of Fortran Data Types ..................................................................................... 151 13.2. Real Data Type Ranges ........................................................................................................ 152 13.3. Scalar Type Alignment ......................................................................................................... 152 13.4. C/C++ Scalar Data Types ..................................................................................................... 154 13.5. Scalar Alignment ................................................................................................................. 155 15.1. PGI Build-Related Compiler Options ...................................................................................... 164 15.2. PGI Debug-Related Compiler Options ..................................................................................... 166 15.3. Optimization-Related PGI Compiler Options ............................................................................ 167 15.4. Linking and Runtime-Related PGI Compiler Options ................................................................. 167PGI® User’s Guide xvi 15.5. C and C++ -specific Compiler Options ................................................................................... 168 15.6. Subgroups for –help Option ................................................................................................. 179 15.7. –M Options Summary .......................................................................................................... 185 15.8. Optimization and –O, –g, –Mvect, and –Mconcur Options ........................................................ 193 16.1. Initialization of REDUCTION Variables .................................................................................... 253 16.2. Directive and Pragma Clauses .............................................................................................. 260 18.1. Register Allocation .............................................................................................................. 271 18.2. Standard Stack Frame .......................................................................................................... 272 18.3. Stack Contents for Functions Returning struct/union ................................................................. 274 18.4. Integral and Pointer Arguments ............................................................................................. 275 18.5. Floating-point Arguments ...................................................................................................... 275 18.6. Structure and Union Arguments ............................................................................................ 276 18.7. Register Allocation .............................................................................................................. 278 18.8. Standard Stack Frame .......................................................................................................... 278 18.9. Register Allocation for Example A-2 ....................................................................................... 282 18.10. Linux86-64 Fortran Fundamental Types ................................................................................ 284 18.11. Fortran and C/C++ Data Type Compatibility .......................................................................... 285 18.12. Fortran and C/C++ Representation of the COMPLEX Type ....................................................... 286 18.13. Register Allocation ............................................................................................................. 288 18.14. Standard Stack Frame ........................................................................................................ 288 18.15. Register Allocation for Example A-4 ..................................................................................... 292 18.16. Win64 Fortran Fundamental Types ....................................................................................... 293 18.17. Fortran and C/C++ Data Type Compatibility .......................................................................... 295 18.18. Fortran and C/C++ Representation of the COMPLEX Type ....................................................... 296 20.1. MMX Intrinsics (mmintrin.h) ................................................................................................ 305 20.2. SSE Intrinsics (xmmintrin.h) ................................................................................................ 306 20.3. SSE2 Intrinsics (emmintrin.h) ............................................................................................. 307 20.4. SSE3 Intrinsics (pmmintrin.h) .............................................................................................. 309 20.5. SSSE3 Intrinsics (tmmintrin.h) .............................................................................................. 309 20.6. SSE4a Intrinsics (ammintrin.h) ............................................................................................. 309 20.7. SSE4a Intrinsics (intrin.h) .................................................................................................... 310 21.1. Fortran Data Type Mappings ................................................................................................. 311xvii Examples 1.1. Hello program ......................................................................................................................... 2 2.1. Makefiles with Options ............................................................................................................ 16 3.1. Dot Product Code ................................................................................................................... 27 3.2. Unrolled Dot Product Code ...................................................................................................... 27 3.3. Vector operation using SSE instructions ..................................................................................... 31 3.4. Using SYSTEM_CLOCK code fragment ........................................................................................ 43 4.1. Sample Makefile ..................................................................................................................... 48 6.1. Prefetch Directive Use ............................................................................................................. 70 7.1. Build a DLL: Fortran ............................................................................................................... 82 7.2. Build a DLL: C ....................................................................................................................... 83 7.3. Build DLLs Containing Circular Mutual Imports: C ....................................................................... 84 7.4. Build DLLs Containing Circular Mutual Imports: Fortran ............................................................... 86 7.5. Import a Fortran module from a DLL ........................................................................................ 87 10.1. Character Return Parameters ................................................................................................ 114 10.2. COMPLEX Return Values ...................................................................................................... 114 10.3. Fortran Main Program fmain.f .............................................................................................. 115 10.4. C function cfunc_ ............................................................................................................... 115 10.5. Fortran Subroutine forts.f ..................................................................................................... 116 10.6. C Main Program cmain.c ..................................................................................................... 116 10.7. Simple C Function cfunc.c .................................................................................................... 116 10.8. C++ Main Program cpmain.C Calling a C Function .................................................................. 117 10.9. Simple C++ Function cpfunc.C with Extern C .......................................................................... 117 10.10. C Main Program cmain.c Calling a C++ Function .................................................................. 117 10.11. Fortran Main Program fmain.f calling a C++ function ............................................................ 118 10.12. C++ function cpfunc.C ...................................................................................................... 118 10.13. Fortran Subroutine forts.f ................................................................................................... 119 10.14. C++ main program cpmain.C ............................................................................................. 119 18.1. C Program Calling an Assembly-language Routine .................................................................... 277 18.2. Parameter Passing ............................................................................................................... 282 18.3. C Program Calling an Assembly-language Routine .................................................................... 283 18.4. Parameter Passing ............................................................................................................... 291 18.5. C Program Calling an Assembly-language Routine .................................................................... 293xviiixix Preface This guide is part of a set of manuals that describe how to use The Portland Group (PGI) Fortran, C, and C++ compilers and program development tools. These compilers and tools include the PGF77, PGF95, PGHPF, PGC++, and PGCC ANSI C compilers, the PGPROF profiler, and the PGDBG debugger. They work in conjunction with an x86 or x64 assembler and linker. You can use the PGI compilers and tools to compile, debug, optimize, and profile serial and parallel applications for x86 (Intel Pentium II/III/4/M, Intel Centrino, Intel Xeon, AMD Athlon XP/MP) or x64 (AMD Athlon64/Opteron/Turion, Intel EM64T, Intel Core Duo, Intel Core 2 Duo) processor-based systems. The PGI User's Guide provides operating instructions for the PGI command-level development environment. It also contains details concerning the PGI compilers' interpretation of the Fortran language, implementation of Fortran language extensions, and command-level compilation. Users are expected to have previous experience with or knowledge of the Fortran programming language. Audience Description This manual is intended for scientists and engineers using the PGI compilers. To use these compilers, you should be aware of the role of high-level languages, such as Fortran, C, and C++, as well as assembly-language in the software development process; and you should have some level of understanding of programming. The PGI compilers are available on a variety of x86 or x64 hardware platforms and operating systems. You need to be familiar with the basic commands available on your system. Compatibility and Conformance to Standards Your system needs to be running a properly installed and configured version of the compilers. For information on installing PGI compilers and tools, refer to the Release and Installation notes included with your software. For further information, refer to the following: • American National Standard Programming Language FORTRAN, ANSI X3. -1978 (1978). • ISO/IEC 1539-1 : 1991, Information technology – Programming Languages – Fortran, Geneva, 1991 (Fortran 90). • ISO/IEC 1539-1 : 1997, Information technology – Programming Languages – Fortran, Geneva, 1997 (Fortran 95).PGI® User’s Guide xx • Fortran 95 Handbook Complete ISO/ANSI Reference, Adams et al, The MIT Press, Cambridge, Mass, 1997. • High Performance Fortran Language Specification, Revision 1.0, Rice University, Houston, Texas (1993), http://www.crpc.rice.edu/HPFF. • High Performance Fortran Language Specification, Revision 2.0, Rice University, Houston, Texas (1997), http://www.crpc.rice.edu/HPFF. • OpenMP Application Program Interface, Version 2.5, May 2005, http://www.openmp.org. • Programming in VAX Fortran, Version 4.0, Digital Equipment Corporation (September, 1984). • IBM VS Fortran, IBM Corporation, Rev. GC26-4119. • Military Standard, Fortran, DOD Supplement to American National Standard Programming Language Fortran, ANSI x.3-1978, MIL-STD-1753 (November 9, 1978). • American National Standard Programming Language C, ANSI X3.159-1989. • ISO/IEC 9899:1999, Information technology – Programming Languages – C, Geneva, 1999 (C99). Organization Users typically begin by wanting to know how to use a product and often then find that they need more information and facts about specific areas of the product. Knowing how as well as why you might use certain options or perform certain tasks is key to using the PGI compilers and tools effectively and efficiently. However, once you have this knowledge and understanding, you very likely might find yourself wanting to know much more about specific areas or specific topics. Consequently, his manual is divided into the following two parts: • Part I, Compiler Usage, contains the essential information on how to use the compiler. • Part II, Reference Information, contains more detailed reference information about specific aspects of the compiler, such as the details of compiler options, directives, and more. Part I, Compiler Usage, contains these chapters: Chapter 1, “Getting Started” provides an introduction to the PGI compilers and describes their use and overall features. Chapter 2, “Using Command Line Options” provides an overview of the command-line options as well as task-related lists of options. Chapter 3, “Using Optimization & Parallelization” describes standard optimization techniques that, with little effort, allow users to significantly improve the performance of programs. Chapter 4, “Using Function Inlining” describes how to use function inlining and shows how to create an inline library. Chapter 5, “Using OpenMP” provides a description of the OpenMP Fortran parallelization directives and of the OpenMP C and C++ parallelization pragmas and shows examples of their use. Chapter 6, “Using Directives and Pragmas” provides a description of each Fortran optimization directive and C/C++ optimization pragma, and shows examples of their use.Preface xxi Chapter 7, “Creating and Using Libraries” discusses PGI support libraries, shared object files, and environment variables that affect the behavior of the PGI compilers. Chapter 8, “ Using Environment Variables” describes the environment variables that affect the behavior of the PGI compilers. Chapter 9, “Distributing Files - Deployment” describes the deployment of your files once you have built, debugged and compiled them successfully. Chapter 10, “Inter-language Calling” provides examples showing how to place C Language calls in a Fortran program and Fortran Language calls in a C program. Chapter 11, “Programming Considerations for 64-Bit Environments” discusses issues of which programmers should be aware when targeting 64-bit processors. Chapter 12, “C/C++ Inline Assembly and Intrinsics” describes how to use inline assembly code in C and C++ programs, as well as how to use intrinsic functions that map directly to x86 and x64 machine instructions. Part II, Reference Information, contains these chapters: Chapter 13, “Fortran, C and C++ Data Types” describes the data types that are supported by the PGI Fortran, C, and C++ compilers. Chapter 14, “C++ Name Mangling” describes the name mangling facility and explains the transformations of names of entities to names that include information on aspects of the entity’s type and a fully qualified name. Chapter 15, “Command-Line Options Reference” provides a detailed description of each command-line option. Chapter 16, “OpenMP Reference Information”contains detailed descriptions of each of the OpenMP directives and pragmas that PGI supports. Chapter 17, “Directives and Pragmas Reference”contains detailed descriptions of PGI’s proprietary directives and pragmas. Chapter 18, “Run-time Environment” describes the assembly language calling conventions and examples of assembly language calls. Chapter 19, “C++ Dialect Supported” lists more details of the version of the C++ language that PGC++ supports. Chapter 20, “C/C++ MMX/SSE Inline Intrinsics,” on page 303 provides tables that list the MMX Inline Intrinsics (mmintrin.h), the SSE1 inline intrinsics (xmmintrin.h), and SSE2 inline intrinsics (emmintrin.h). Chapter 21, “Fortran Module/Library Interfaces” provides a description of the Fortran module library interfaces that PVF supports, describing each property available. Chapter 22, “Messages” provides a list of compiler error messages.PGI® User’s Guide xxii Hardware and Software Constraints This guide describes versions of the PGI compilers that produce assembly code for x86 and x64 processorbased systems. Details concerning environment-specific values and defaults and system-specific features or limitations are presented in the release notes delivered with the PGI compilers. Conventions The PGI User's Guide uses the following conventions: italic Italic font is for commands, filenames, directories, arguments, options and for emphasis. Constant Width Constant width font is for examples and for language statements in the text, including assembly language statements. [ item1 ] Square brackets indicate optional items. In this case item1 is optional. { item2 | item 3} Braces indicate that a selection is required. In this case, you must select either item2 or item3. filename... Ellipsis indicate a repetition. Zero or more of the preceding item may occur. In this example, multiple filenames are allowed. FORTRAN Fortran language statements are shown in the text of this guide using upper-case characters and a reduced point size. The PGI compilers and tools are supported on both 32-bit and 64-bit variants of Linux, Windows, and Mac OS operating systems on a variety of x86-compatible processors. There are a wide variety of releases and distributions of each of these types of operating systems. The PGI User’s Guide defines the following terms with respect to these platforms: AMD64 a 64-bit processor from AMD, designed to be binary compatible with IA32 processors, and incorporating new features such as additional registers and 64-bit addressing support for improved performance and greatly increased memory range. Barcelona the Quad-Core AMD Opteron(TM) Processor, that is, Opteron Rev x10 DLL a dynamic linked library on Win32 or Win64 platforms of the form xxx.dll containing objects that are dynamically linked into a program at the time of execution. driver the compiler driver controls the compiler, linker, and assembler, and adds objects and libraries to create an executable. The -dryrun option illustrates operation of the driver. pgf77, pgf95, pghpf, pgcc, pgCCPreface xxiii (Linux), and pgcpp are drivers for the PGI compilers. A pgf90 driver is retained for compatibility with existing makefiles, even though pgf90 and pgf95 drivers are identical. Dual-core Dual-, Quad-, or Multi-core - some x64 CPUs incorporate two or four complete processor cores (functional units, registers, level 1 cache, level 2 cache, etc) on a single silicon die. These are referred to as Dual-core or Quad-core (in general, Multi-core) processors. For purposes of OpenMP or auto-parallel threads, or MPI process parallelism, these cores function as distinct processors. However, the processing cores are on a single chip occupying a single socket on the system motherboard. In PGI 7.1, there are no longer software licensing limits on OpenMP threads for Multi-core. EM64T a 64-bit IA32 processor with Extended Memory 64-bit Technology extensions that are binary compatible with AMD64 processors. This includes Intel Pentium 4, Intel Xeon, and Intel Core 2 processors. hyperthreading (HT) some IA32 CPUs incorporate extra registers that allow 2 threads to run on a single CPU with improved performance for some tasks. This is called hyperthreading and abbreviated HT. Some linux86 and linux86-64 environments treat IA32 CPUs with HT as though there were a 2nd pseudo CPU, even though there is only one physical CPU. Unless the Linux kernel is hyperthread-aware, the second thread of an OpenMP program will be assigned to the pseudo CPU, rather than a real second physical processor (if one exists in the system). OpenMP Programs can run very slowly if the second thread is not properly assigned. IA32 an Intel Architecture 32-bit processor, designed to be binary compatible with x86 processors, and incorporating new features such as streaming SIMD extensions (SSE) for improved performance. Large Arrays arrays with aggregate size larger than 2GB, which require 64-bit index arithmetic for accesses to elements of arrays. If -Mlarge_arrays is specified and -mcmodel=medium is not specified, the default small memory model is used, and all index arithmetic is performed in 64-bits. This can be a useful mode of execution for certain existing 64-bit applications that use the small memory model but allocate and manage a single contiguous data space larger than 2GB. linux86 32-bit Linux operating system running on an x86 or x64 processor-based system, with 32-bit GNU tools, utilities and libraries used by the PGI compilers to assemble and link for 32-bit execution. linux86-64 64-bit Linux operating system running on an x64 processor-based system, with 64-bit and 32-bit GNU tools, utilities and libraries used by the PGI compilers to assemble and link for execution in either linux86 or linux86-64 environments. The 32-bit development tools and execution environment under linux86-64 are considered a cross development environment for x86 processor-based applications. Mac OS X collectively, all osx86 and osx86-64 platforms supported by the PGI compilers. -mcmodel=small compiler/linker switch to produce small memory model format objects/executables in which both code (.text) and data (.bss) sections are limited to less than 2GB. This switch is the default and only possible format for linux86 32-bit executables. This switch is the default format for linux86-64 executables.PGI® User’s Guide xxiv Maximum address offset range is 32-bits, and total memory used for OS+Code+Data must be less than 2GB. -mcmodel=medium compiler/linker switch to produce medium memory model format objects/executables in which code sections are limited to less than 2GB, but data sections can be greater than 2GB. This option is supported only in linux86-64 environments. It must be used to compile any program unit that will be linked in to a 64-bit executable that will use aggregate data sets larger than 2GB and will access data requiring address offsets greater than 2GB. This option must be used to link any 64-bit executable that will use aggregate data sets greater than 2GB in size. Executables linked using -mcmodel=medium can incorporate objects compiled using -mcmodel=small as long as the small objects are from a shared library. NUMA A type of multi-processor system architecture in which the memory latency from a given processor to a given portion of memory can vary, resulting in the possibility for compiler or programming optimizations to ensure frequently accessed data is "close" to a given processor as determined by memory latency. osx86 32-bit Apple Mac OS Operating Systems running on an x86 Core 2 or Core 2 Duo processor-based system with the 32-bit Apple and GNU tools, utilities, and libraries used by the PGI compilers to assemble and link for 32-bit execution. The PGI Workstation preview supports Mac OS 10.4.9 only. osx86-64 64-bit Apple Mac OS Operating Systems running on an x64 Core 2 Duo processor-based system with the 64-bit and 32-bit Apple and GNU tools, utilities, and libraries used by the PGI compilers to assemble and link for either 64- or 32-bit execution. The PGI Workstation preview supports Mac OS 10.4.9 only. SFU Windows Services for Unix, a 32-bit-only predecessor of SUA, the Subsystem for Unix Applications. See SUA. Shared library a Linux library of the form libxxx.so containing objects that are dynamically linked into a program at the time of execution. SSE collectively, all SSE extensions supported by the PGI compilers. SSE1 32-bit IEEE 754 FPU and associated streaming SIMD extensions (SSE) instructions on Pentium III, AthlonXP* and later 32-bit x86, AMD64 and EM64T compatible CPUs, enabling scalar and packed vector arithmetic on single-precision floating-point data. SSE2 64-bit IEEE 754 FPU and associated SSE instructions on P4/Xeon and later 32-bit x86, AMD64 and EM64T compatible CPUs. SSE2 enables scalar and packed vector arithmetic on double-precision floating-point data. SSE3 additional 32-bit and 64-bit SSE instructions to enable more efficient support of arithmetic on complex floating-point data on 32-bit x86, AMD64 and EM64T compatible CPUs with so-called Prescott NewPreface xxv Instructions (PNI), such as Intel IA32 processors with EM64T extensions and newer generation (Revision E and beyond) AMD64 processors. SSE4A and ABM AMD Instruction Set enhancements for the Quad-Core AMD Opteron Processor. Support for these instructions is enabled by the -tp barcelona or -tp barcelona-64 switch. SSSE3 an extension of the SSE3 instruction set found on the Intel Core 2. Static linking a method of linking: On Linux, use - to ensure all objects are included in a generated executable at link time. Static linking causes objects from static library archives of the form libxxx.a to be linked in to your executable, rather than dynamically linking the corresponding libxxx.so shared library. Static linking of executables linked using the -mcmodel=medium option is supported. On Windows, the Windows linker links statically or dynamically depending on whether the libraries on the link-line are DLL import libraries or static libraries. By default, the static PGI libraries are included on the link line. To link with DLL versions of the PGI libraries instead of static libraries, use the -Mdll option. SUA Subsystem for UNIX-based Applications (SUA) is source-compatibility subsystem for compiling and running custom UNIX-based applications on a computer running 32-bit or 64-bit Windows server-class operating system. It provides an operating system for Portable Operating System Interface (POSIX) processes. SUA supports a package of support utilities (including shells and >300 Unix commands), case-sensitive file names, and job control. The subsystem installs separately from the Windows kernel to support UNIX functionality without any emulation. Win32 any of the 32-bit Microsoft Windows Operating Systems (XP/2000/Server 2003) running on an x86 or x64 processor-based system. On these targets, the PGI compiler products include all of the tools and libraries needed to build executables for 32-bit Windows systems. Win64 any of the 64-bit Microsoft Windows Operating Systems (XP Professional /Windows Server 2003 x64 Editions) running on an x64 processor-based system. On these targets, the PGI compiler products include all of the tools and libraries needed to build executables for 32-bit Windows systems. Windows collectively, all Win32 and Win64 platforms supported by the PGI compilers. x64 collectively, all AMD64 and EM64T processors supported by the PGI compilers. x86 a processor designed to be binary compatible with i386/i486 and previous generation processors from Intel* Corporation. Refers collectively to such processors up to and including 32-bit variants. x87 - 80-bit IEEE stack-based floating-point unit (FPU) and associated instructions on x86-compatible CPUs.PGI® User’s Guide xxvi The following table lists the PGI compilers and tools and their corresponding commands: Table 1. PGI Compilers and Commands Compiler or Tool Language or Function Command PGF77 FORTRAN 77 pgf77 PGF95 Fortran 90/95 pgf95 PGHPF High Performance Fortran pghpf PGCC C ANSI C99 and K&R C pgcc PGC++ ANSI C++ with cfront features pgcpp (pgCC) PGDBG Source code debugger pgdbg PGPROF Performance profiler pgprof In general, the designation PGF95 is used to refer to The Portland Group’s Fortran 90/95 compiler, and pgf95 is used to refer to the command that invokes the compiler. A similar convention is used for each of the PGI compilers and tools. For simplicity, examples of command-line invocation of the compilers generally reference the pgf95 command, and most source code examples are written in Fortran. Usage of the PGF77 compiler, whose features are a subset of PGF95, is similar. Usage of PGHPF, PGC++, and PGCC ANSI C99 is consistent with PGF95 and PGF77, but there are command-line options and features of these compilers that do not apply to PGF95 and PGF77 and vice versa. There are a wide variety of x86-compatible processors in use. All are supported by the PGI compilers and tools. Most of these processors are forward-compatible, but not backward-compatible, meaning that code compiled to target a given processor will not necessarily execute correctly on a previous-generation processor. The following table provides a partial list, including the most important processor types, along with the features utilized by the PGI compilers that distinguish them from a compatibility standpoint: Table 2. Processor Options Processor Prefetch SSE1 SSE2 SSE3 32-bit 64-bit Scalar FP Default AMD Athlon N N N N Y N x87 AMD Athlon XP/MP Y Y N N Y N x87 AMD Athlon64 Y Y Y N Y Y SSE AMD Opteron Y Y Y N Y Y SSE AMD Opteron Rev E Y Y Y Y Y Y SSE AMD Opteron Rev F Y Y Y Y Y Y SSE AMD Turion Y Y Y Y Y Y SSE Intel Celeron N N N N Y N x87Preface xxvii Processor Prefetch SSE1 SSE2 SSE3 32-bit 64-bit Scalar FP Default Intel Pentium II N N N N Y N x87 Intel Pentium III Y Y N N Y N x87 Intel Pentium 4 Y Y Y N Y N SSE Intel Pentium M Y Y Y N Y N SSE Intel Centrino Y Y Y N Y N SSE Intel Pentium 4 EM64T Y Y Y Y Y Y SSE Intel Xeon EM64T Y Y Y Y Y Y SSE Intel Core Duo EM64T Y Y Y Y Y Y SSE Intel Core 2 Duo EM64T Y Y Y Y Y Y SSE In this manual, the convention is to use “x86” to specify the group of processors in the previous table that are listed as “32-bit” but not “64-bit.” The convention is to use “x64” to specify the group of processors that are listed as both “32-bit” and “64-bit.” x86 processor-based systems can run only 32-bit operating systems. x64 processor-based systems can run either 32-bit or 64-bit operating systems, and can execute all 32-bit x86 binaries in either case. x64 processors have additional registers and 64-bit addressing capabilities that are utilized by the PGI compilers and tools when running on a 64-bit operating system. The prefetch, SSE1, SSE2 and SSE3 processor features further distinguish the various processors. Where such distinctions are important with respect to a given compiler option or feature, it is explicitly noted in this manual. Note that the default for performing scalar floating-point arithmetic is to use SSE instructions on targets that support SSE1 and SSE2. See section 2.3.1, Scalar SSE Code Generation, for a detailed discussion of this topic. Related Publications The following documents contain additional information related to the x86 and x64 architectures, and the compilers and tools available from The Portland Group. • PGI Fortran Reference manual describes the FORTRAN 77, Fortran 90/95, and HPF statements, data types, input/output format specifiers, and additional reference material related to use of the PGI Fortran compilers. • System V Application Binary Interface Processor Supplement by AT&T UNIX System Laboratories, Inc. (Prentice Hall, Inc.). • System V Application Binary Interface X86-64 Architecture Processor Supplement, http://www.x86- 64.org/abi.pdf. • Fortran 95 Handbook Complete ISO/ANSI Reference, Adams et al, The MIT Press, Cambridge, Mass, 1997. • Programming in VAX Fortran, Version 4.0, Digital Equipment Corporation (September, 1984). • IBM VS Fortran, IBM Corporation, Rev. GC26-4119. • The C Programming Language by Kernighan and Ritchie (Prentice Hall).PGI® User’s Guide xxviii • C: A Reference Manual by Samuel P. Harbison and Guy L. Steele Jr. (Prentice Hall, 1987). • The Annotated C++ Reference Manual by Margaret Ellis and Bjarne Stroustrup, AT&T Bell Laboratories, Inc. (Addison-Wesley Publishing Co., 1990). • OpenMP Application Program Interface, Version 2.5 May 2005 (OpenMP Architecture Review Board, 1997-2005).1 Chapter 1. Getting Started This chapter describes how to use the PGI compilers. The command used to invoke a compiler, such as the pgf95 command, is called a compiler driver. The compiler driver controls the following phases of compilation: preprocessing, compiling, assembling, and linking. Once a file is compiled and an executable file is produced, you can execute, debug, or profile the program on your system. Executables produced by the PGI compilers are unconstrained, meaning they can be executed on any compatible x86 or x64 processor-based system, regardless of whether the PGI compilers are installed on that system. Overview In general, using a PGI compiler involves three steps: 1. Produce a program source code in a file containing a .f extension or another appropriate extension, as described in “Input Files,” on page 3. This program may be one that you have written or one that you are modifying. 2. Compile the program using the appropriate compiler command. 3. Execute, debug, or profile the executable file on your system. You might also want to deploy your application, though this is not a required step. The PGI compilers allow many variations on these general program development steps. These variations include the following: • Stop the compilation after preprocessing, compiling or assembling to save and examine intermediate results. • Provide options to the driver that control compiler optimization or that specify various features or limitations. • Include as input intermediate files such as preprocessor output, compiler output, or assembler output. Invoking the Command-level PGI Compilers To translate and link a Fortran, C, or C++ program, the pgf77, pgf95, pghpf, pgcc, and pgcpp commands do the following:PGI® User’s Guide 2 1. Preprocess the source text file. 2. Check the syntax of the source text. 3. Generate an assembly language file. 4. Pass control to the subsequent assembly and linking steps. Example 1.1. Hello program Let’s look at a simple example of using the PGI compiler to create, compile, and execute a program that prints hello. Step 1: Create your program. For this example, suppose you enter the following simple Fortran program in the file hello.f: print *, "hello" end Step 2: Compile the program. When you created your program, you called it hello.f. In this example, we compile it from a shell command prompt using the default pgf95 driver option. Use the following syntax: PGI$ pgf95 hello.f PGI$ By default, the executable output is placed in the file a.out, or, on Windows platforms, in a filename based on the name of the first source or object file on the command line. However, you can use the –o option to specify an output file name. To place the executable output in the file hello, use this command: PGI$ pgf95 -o hello hello.f PGI$ Step 3: Execute the program. To execute the resulting hello program, simply type the filename at the command prompt and press the Return or Enter key on your keyboard: PGI$ hello hello PGI$ Command-line Syntax The compiler command-line syntax, using pgf95 as an example, is: pgf95 [options] [path]filename [...] Where: options is one or more command-line options, all of which are described in detail in Chapter 2, “Using Command Line Options”. path is the pathname to the directory containing the file named by filename. If you do not specify the path for a filename, the compiler uses the current directory. You must specify the path separately for each filename not in the current directory.Chapter 1. Getting Started 3 filename is the name of a source file, preprocessed source file, assembly-language file, object file, or library to be processed by the compilation system. You can specify more than one [path]filename. Command-line Options The command-line options control various aspects of the compilation process. For a complete alphabetical listing and a description of all the command-line options, refer to Chapter 2, “Using Command Line Options”. The following list provides important information about proper use of command-line options. • Case is significant for options and their arguments. • The compiler drivers recognize characters preceded by a hyphen (-) as command-line options. For example, the –Mlist option specifies that the compiler creates a listing file. Note The convention for the text of this manual is to show command-line options using a dash instead of a hyphen; for example, you see –Mlist. • The pgcpp command recognizes a group of characters preceded by a plus sign (+) as command-line options. • The order of options and the filename is not fixed. That is, you can place options before and after the filename argument on the command line. However, the placement of some options is significant, such as the –l option, in which the order of the filenames determines the search order. Note If two or more options contradict each other, the last one in the command line takes precedence. Fortran Directives and C/C++ Pragmas You can insert Fortran directives and C/C++ pragmas in program source code to alter the effects of certain command-line options and to control various aspects of the compilation process for a specific routine or a specific program loop. For more information on Fortran directives and C/C++ pragmas, refer to Chapter 5, “Using OpenMP” and Chapter 6, “Using Directives and Pragmas”. Filename Conventions The PGI compilers use the filenames that you specify on the command line to find and to create input and output files. This section describes the input and output filename conventions for the phases of the compilation process. Input Files You can specify assembly-language files, preprocessed source files, Fortran/C/C++ source files, object files, and libraries as inputs on the command line. The compiler driver determines the type of each input file by examining the filename extensions. The drivers use the following conventions:PGI® User’s Guide 4 filename.f indicates a Fortran source file. filename.F indicates a Fortran source file that can contain macros and preprocessor directives (to be preprocessed). filename.FOR indicates a Fortran source file that can contain macros and preprocessor directives (to be preprocessed). filename.F95 indicates a Fortran 90/95 source file that can contain macros and preprocessor directives (to be preprocessed). filename.f90 indicates a Fortran 90/95 source file that is in freeform format. filename.f95 indicates a Fortran 90/95 source file that is in freeform format. filename.hpf indicates an HPF source file. filename.c indicates a C source file that can contain macros and preprocessor directives (to be preprocessed). filename.i indicates a preprocessed C or C++ source file. filename.C indicates a C++ source file that can contain macros and preprocessor directives (to be preprocessed). filename.cc indicates a C++ source file that can contain macros and preprocessor directives (to be preprocessed). filename.s indicates an assembly-language file. filename.o (Linux, Apple, SFU, SUA) indicates an object file. filename.obj (Windows systems only) indicates an object file. filename.a (Linux, Apple, SFU, SUA) indicates a library of object files. filename.lib (Windows systems only) indicates a statically-linked library of object files. filename.so (Linux and SFU systems only) indicates a library of shared object files. filename.dll (Windows systems only) indicates a dynamically-linked library.Chapter 1. Getting Started 5 filename..objlib (Apple systems only) indicates a dynamically-linked library. The driver passes files with .s extensions to the assembler and files with .o, .obj, .so, .dll, .a and .lib extensions to the linker. Input files with unrecognized extensions, or no extension, are also passed to the linker. Files with a .F (Capital F) or .FOR suffix are first preprocessed by the Fortran compilers and the output is passed to the compilation phase. The Fortran preprocessor functions similar to cpp for C/C++ programs, but is built in to the Fortran compilers rather than implemented through an invocation of cpp. This design ensures consistency in the preprocessing step regardless of the type or revision of operating system under which you’re compiling. Any input files not needed for a particular phase of processing are not processed. For example, if on the command line you specify an assembly-language file (filename.s) and the –S option to stop before the assembly phase, the compiler takes no action on the assembly language file. Processing stops after compilation and the assembler does not run. In this scenario, the compilation must have been completed in a previous pass which created the .s file. For a complete description of the –S option, refer to the following section:“Output Files”. In addition to specifying primary input files on the command line, code within other files can be compiled as part of include files using the INCLUDE statement in a Fortran source file or the preprocessor #include directive in Fortran source files that use a .F extension or C and C++ source files. When linking a program with a library, the linker extracts only those library components that the program needs. The compiler drivers link in several libraries by default. For more information about libraries, refer to Chapter 7, “Creating and Using Libraries”. Output Files By default, an executable output file produced by one of the PGI compilers is placed in the file a.out, or, on Windows, in a filename based on the name of the first source or object file on the command line. As the example in the preceding section shows, you can use the –o option to specify the output file name. If you use one of the options: –F (Fortran only), –P (C/C++ only), –S or –c, the compiler produces a file containing the output of the last completed phase for each input file, as specified by the option supplied. The output file will be a preprocessed source file, an assembly-language file, or an unlinked object file respectively. Similarly, the –E option does not produce a file, but displays the preprocessed source file on the standard output. Using any of these options, the –o option is valid only if you specify a single input file. If no errors occur during processing, you can use the files created by these options as input to a future invocation of any of the PGI compiler drivers. The following table lists the stop-after options and the output files that the compilers create when you use these options. It also describes the accepted input files. Table 1.1. Stop-after Options, Inputs and Outputs Option Stop after Input Output –E preprocessing Source files. For Fortran, must have .F extension. preprocessed file to standard outPGI® User’s Guide 6 Option Stop after Input Output –F preprocessing Source files. Must have .F extension. This option is not valid for pgcc or pgcpp. preprocessed file (.f) –P preprocessing Source files. This option is not valid for pgf77, pgf95 or pghpf) preprocessed file (.i) –S compilation Source files or preprocessed files assembly-language file (.s) –c assembly Source files, preprocessed files or assemblylanguage files unlinked object file (.o or .obj) none linking Source files, preprocessed files, assemblylanguage files, object files or libraries executable file (a.out or .exe) If you specify multiple input files or do not specify an object filename, the compiler uses the input filenames to derive corresponding default output filenames of the following form, where filename is the input filename without its extension: filename.f indicates a preprocessed file, if you compiled a Fortran file using the –F option. filename.i indicates a prepossedfile, if you compiled using the –P option.. filename.lst indicates a listing file from the –Mlist option. filename.o or filename.obj indicates an object file from the –c option. filename.s indicates an assembly-language file from the –S option. Note Unless you specify otherwise, the destination directory for any output file is the current working directory. If the file exists in the destination directory, the compiler overwrites it. The following example demonstrates the use of output filename extensions. $ pgf95 -c proto.f proto1.F This produces the output files proto.o and proto1.o, or, on Windows, proto.obj and proto1.obj all of which are binary object files. Prior to compilation, the file proto1.F is preprocessed because it has a .F filename extension. Fortran, C, and C++ Data Types The PGI Fortran, C, and C++ compilers recognize scalar and aggregate data types. A scalar data type holds a single value, such as the integer value 42 or the real value 112.6. An aggregate data type consists of one or more scalar data type objects, such as an array of integer values.Chapter 1. Getting Started 7 For information about the format and alignment of each data type in memory, and the range of values each type can have on x86 or x64 processor-based systems running a 32-bit operating system, refer to Chapter 13, “Fortran, C and C++ Data Types”. For more information on x86-specific data representation, refer to the System V Application Binary Interface Processor Supplement by AT&T UNIX System Laboratories, Inc. (Prentice Hall, Inc.). This manual specifically does not address x64 processor-based systems running a 64-bit operating system, because the application binary interface (ABI) for those systems is still evolving. For the latest version of this ABI, see www.x86-64.org/abi.pdf. Parallel Programming Using the PGI Compilers The PGI compilers support three styles of parallel programming: • Automatic shared-memory parallel programs compiled using the –Mconcur option to pgf77, pgf95, pgcc, or pgcpp — parallel programs of this variety can be run on shared-memory parallel (SMP) systems such as dual-core or multi-processor workstations. • OpenMP shared-memory parallel programs compiled using the –mp option to pgf77, pgf95, pgcc, or pgcpp — parallel programs of this variety can be run on SMP systems. Carefully coded user-directed parallel programs using OpenMP directives can often achieve significant speed-ups on dual-core workstations or large numbers of processors on SMP server systems. Chapter 5, “Using OpenMP” contains complete descriptions of user-directed parallel programming. • Data parallel shared- or distributed-memory parallel programs compiled using the PGHPF High Performance Fortran compiler — parallel programs of this variety can be run on SMP workstations or servers, distributed-memory clusters of workstations, or clusters of SMP workstations or servers. Coding a data parallel version of an application can be more work than using OpenMP directives, but has the advantage that the resulting executable is usable on all types of parallel systems regardless of whether shared memory is available. See the PGHPF User’s Guide for a complete description of how to build and execute data parallel HPF programs. In this manual, the first two types of parallel programs are collectively referred to as SMP parallel programs. The third type is referred to as a data parallel program, or simply as an HPF program. Some newer CPUs incorporate two or more complete processor cores - functional units, registers, level 1 cache, level 2 cache, and so on - on a single silicon die. These CPUs are known as multi-core processors. For purposes of HPF, threads, or OpenMP parallelism, these cores function as two or more distinct processors. However, the processing cores are on a single chip occupying a single socket on a system motherboard. For purposes of PGI software licensing, a multi-core processor is treated as a single CPU. Running SMP Parallel Programs When you execute an SMP parallel program, by default it uses only one processor. To run on more than one processor, set the NCPUS environment variable to the desired number of processors, subject to a maximum of four for PGI’s workstation-class products. You can set this environment variable by issuing the following command in a Windows command prompt window:PGI® User’s Guide 8 % setenv NCPUS In a shell command window under csh, issue the following command: % setenv NCPUS In sh, ksh, or BASH command window, issue the following command: % NCPUS=; export NCPUS Note If you set NCPUS to a number larger than the number of physical processors, your program may execute very slowly. Running Data Parallel HPF Programs When you execute an HPF program, by default it will use only one processor. If you wish to run on more than one processor, use the -pghpf -np runtime option. For example, to compile and run the hello.f example defined above on one processor, you would issue the following commands: % pghpf -o hello hello.f Linking: % hello hello % To execute it on two processors, you would issue the following commands: % hello -pghpf -np 2 hello % Note If you specify a number larger than the number of physical processors, your program will execute very slowly. You still only see a single “hello” printed to your screen. This is because HPF is a single-threaded model, meaning that all statements execute with the same semantics as if they were running in serial. However, parallel statements or constructs operating on explicitly distributed data are in fact executed in parallel. The programmer must manually insert compiler directives to cause data to be distributed to the available processors. See the PGHPF User’s Guide and The High Performance Fortran Handbook for more details on constructing and executing data parallel programs on shared-memory or distributed-memory cluster systems using PGHPF. Platform-specific considerations There are nine platforms supported by the PGI Workstation and PGI Server compilers and tools: • 32-bit Linux - supported on 32-bit Linux operating systems running on either a 32-bit x86 compatible or an x64 compatible processor.Chapter 1. Getting Started 9 • 64-bit/32-bit Linux - includes all features and capabilities of the 32-bit Linux version, and is also supported on 64-bit Linux operating systems running on an x64 compatible processor. • 32-bit Windows - supported on 32-bit Windows operating systems running on either a 32-bit x86 compatible or an x64 compatible processor. • 64-bit/32-bit Windows - includes all features and capabilities of the 32-bit Windows version, and is also supported on 64-bit Windows operating systems running an x64 compatible processor. • 32-bit SFU - supported on 32-bit Windows operating systems running on either a 32-bit x86 compatible or an x64 compatible processor. • 32-bit SUA - supported on 32-bit Windows operating systems running on either a 32-bit x86 compatible or an x64 compatible processor. • 64-bit/32-bit SUA - includes all features and capabilities of the 32-bit SUA version, and is also supported on 64-bit Windows operating systems running on an x64 compatible processor. • 32-bit Apple Mac OS X - supported on 32-bit Apple Mac operating systems running on either a 32-bit or 64- bit Intel-based Mac system. • 64-bit Apple Mac OS X - supported on 64-bit Apple Mac operating systems running on a 64-bit Intel-based Mac system. The following sections describe the specific considerations required to use the PGI compilers on the various platforms: Linux, Windows, and Apple Mac OS X. Using the PGI Compilers on Linux Linux Header Files The Linux system header files contain many GNU gcc extensions. PGI supports many of these extensions, thus allowing the PGCC C and C++ compilers to compile most programs that the GNU compilers can compile. A few header files not interoperable with the PGI compilers have been rewritten and are included in $PGI/linux86/include. These files are: sigset.h, asm/byteorder.h, stddef.h, asm/ posix_types.h and others. Also, PGI’s version of stdarg.h supports changes in newer versions of Linux. If you are using the PGCC C or C++ compilers, please make sure that the supplied versions of these include files are found before the system versions. This will happen by default unless you explicitly add a –I option that references one of the system include directories. Running Parallel Programs on Linux You may encounter difficulties running auto-parallel or OpenMP programs on Linux systems when the per-thread stack size is set to the default (2MB). If you have unexplained failures, please try setting the environment variable OMP_STACK_SIZE to a larger value, such as 8MB. This can be accomplished with the command in csh: % setenv OMP_STACK_SIZE 8M in bash, sh, or ksh, use: % OMP_STACK_SIZE=8M; export OMP_STACK_SIZEPGI® User’s Guide 10 If your program is still failing, you may be encountering the hard 8 MB limit on main process stack sizes in Linux. You can work around the problem by issuing the following command in csh: % limit stacksize unlimited in bash, sh, or ksh, use: % ulimit -s unlimited Using the PGI Compilers on Windows BASH Shell Environment On Windows platforms, the tools that ship with the PGI Workstation or PGI Server command-level compilers include a full-featured shell command environment. After installation, you should have a PGI icon on your Windows desktop. Double-left-click on this icon to cause an instance of the BASH command shell to appear on your screen. Working within BASH is very much like working within the sh or ksh shells on a Linux system, but in addition BASH has a command history feature similar to csh and several other unique features. Shell programming is fully supported. A complete BASH User’s Guide is available through the PGI online manual set. Select “PGI Workstation” under Start->Programs and double-left-click on the documentation icon to see the online manual set. You must have a web browser installed on your system in order to read the online manuals. The BASH shell window is pre-initialized for usage of the PGI compilers and tools, so there is no need to set environment variables or modify your command path when the command window comes up. In addition to the PGI compiler commands referenced above, within BASH you have access to over 100 common commands and utilities, including but not limited to the following: vi emacs make tar / untar gzip / gunzip ftp sed grep / egrep / fgrep awk cat cksum cp date diff du find kill ls more / less mv printenv / env rm / rmdir touch wc If you are familiar with program development in a Linux environment, editing, compiling, and executing programs within BASH will be very comfortable. If you have not previously used such an environment, you should take time to familiarize yourself with either the vi or emacs editors and with makefiles. The emacs editor has an extensive online tutorial, which you can start by bringing up emacs and selecting the appropriate option under the pull-down help menu. You can get a thorough introduction to the construction and use of makefiles through the online Makefile User’s Guide. For library compatibility, PGI provides versions of ar and ranlib that are compatible with native Windows object-file formats. For more information on these commands, refer to “Creating and Using Static Libraries on Windows,” on page 79.Chapter 1. Getting Started 11 Windows Command Prompt The PGI Workstation entry in the Windows Start menu contains a submenu titled PGI Workstation Tools. This submenu contains a shortcut labeled PGI Command Prompt (32-bit). The shortcut is used to launch a Windows command shell using an environment pre-initialized for the use of the 32-bit PGI compilers and tools. On x64 systems, a second shortcut labeled PGI Command Prompt (64-bit) will also be present. This shortcut launches a Windows command shell using an environment pre-initialized for the use of the 64-bit PGI compilers and tools. Using the PGI Compilers on SUA and SFU Subsystem for Unix Applications (SUA and SFU) Subsystem for Unix Applications (SUA) is a source-compatibility subsystem for running Unix applications on 32-bit and 64-bit Windows server-class operating systems. PGI Workstation for Windows includes compilers and tools for SUA and its 32-bit-only predecessor, Services For Unix (SFU). SUA provides an operating system for POSIX processes. There is a package of support utilities available for download from Microsoft that provides a more complete Unix environment, including features like shells, scripting utilities, a telnet client, development tools, and so on. SUA/SFU Header Files The SUA/SFU system header files contain numerous non-standard extensions. PGI supports many of these extensions, thus allowing the PGCC C and C++ compilers to compile most programs that the GNU compilers can compile. A few header files not interoperable with the PGI compilers have been rewritten and are included in $PGI/sua32/include or $PGI/sua64/include. These files are: stdarg.h, stddef.h, and others. If you are using the PGCC C or C++ compilers, please make sure that the supplied versions of these include files are found before the system versions. This happens by default unless you explicitly add a –I option that references one of the system include directories. Running Parallel Programs on SUA and SFU You may encounter difficulties running auto-parallel or OpenMP programs on SUA/SFU systems when the per-thread stack size is set to the default (2MB). If you have unexplained failures, please try setting the environment variable OMP_STACK_SIZE to a larger value, such as 8MB. This can be accomplished with the command: in csh: % setenv OMP_STACK_SIZE 8M in bash, sh, or ksh. % OMP_STACK_SIZE=8M; export OMP_STACK_SIZE Using the PGI Compilers on Mac OS X Mac OS X Header FilesPGI® User’s Guide 12 The Mac OS X header files contain numerous non-standard extensions. PGI supports many of these extensions, thus allowing the PGCC C and C++ compilers to compile most programs that the GNU compilers can compile. A few header files not interoperable with the PGI compilers have been rewritten and are included in $PGI/ sua32/include or $PGI/sua64/include. These files are: stdarg.h, stddef.h, and others. If you are using the PGCC C or C++ compilers, please make sure that the supplied versions of these include files are found before the system versions. This will happen by default unless you explicitly add a –I option that references one of the system include directories. Running Parallel Programs on Mac OS You may encounter difficulties running auto-parallel or OpenMP programs on Mac OS X systems when the per-thread stack size is set to the default (8MB). If you have unexplained failures, please try setting the environment variable OMP_STACK_SIZE to a larger value, such as 16MB. This can be accomplished with the following command: in csh: % setenv OMP_STACK_SIZE 16M in bash, sh, or ksh. % OMP_STACK_SIZE=16M; export OMP_STACK_SIZE Site-specific Customization of the Compilers If you are using the PGI compilers and want all your users to have access to specific libraries or other files, there are special files that allow you to customize the compilers for your site. Using siterc Files The PGI compiler drivers utilize a file named siterc to enable site-specific customization of the behavior of the PGI compilers. The siterc file is located in the bin subdirectory of the PGI installation directory. Using siterc, you can control how the compiler drivers invoke the various components in the compilation tool chain. Using User rc Files In addition to the siterc file, user rc files can reside in a given user’s home directory, as specified by the user’s HOME environment variable. You can use these files to control the respective PGI compilers. All of these files are optional. On Linux and SUA these files are named .mypgf77rc, .mypgf90rc, .mypgccrc, .mypgcpprc, and .mypghpfrc. On native windows, these files are named mypgf77rc, mypgf95rc, mypgccrc, mypgcpprc, and mypghpfrc. On Windows, these files are named mypgf77rc and mypgf95rc. The following examples show how these rc files can be used to tailor a given installation for a particular purpose.Chapter 1. Getting Started 13 Table 1.2. Examples of Using siterc and User rc Files To do this... Add the line shown to the indicated file Make the libraries found in the following location available to all linux86-64 compilations. /opt/newlibs/64 set SITELIB=/opt/newlibs/64; to /opt/pgi/linux86-64/7.1/bin/siterc Make the libraries found in the following location available to all linux86 compilations. /opt/newlibs/32 set SITELIB=/opt/newlibs/32; to /opt/pgi/linux86/7.1/bin/siterc Add the following new library path to all linux86-64 compilations. /opt/local/fast append SITELIB=/opt/local/fast; to /opt/pgi/linux86-64/7.1/bin/siterc Make the following include path available to all compilations; -I/opt/acml/include set SITEINC=/opt/acml/include; to /opt/pgi/linux86/7.1/bin/siterc and / opt/pgi/linux86-64/7.1/bin/siterc Change –Mmpi to link in the following with linux86-64 compilations. /opt/mympi/64/libmpix.a set MPILIBDIR=/opt/mympi/64; set MPILIBNAME=mpix; to /opt/pgi/linux86-64/7.1/bin/siterc; Have linux86-64 compilations always add –DIS64BIT –DAMD set SITEDEF=IS64BIT AMD; to /opt/pgi/linux86-64/7.1/bin/siterc Build an F90 executable for linux86- 64 or linux86 that resolves PGI shared objects in the relative directory ./REDIST set RPATH=./REDIST ; to ~/.mypgf95rc Note This only affects the behavior of PGF95 for the given user. Common Development Tasks Now that you have a brief introduction to the compiler, let’s look at some common development tasks that you might wish to perform. • When you compile code you can specify a number of options on the command line that define specific characteristics related to how the program is compiled and linked, typically enhancing or overriding the default behavior of the compiler. For a list of the most common command line options and information on all the command line options, refer to Chapter 2, “Using Command Line Options”. • Code optimization and parallelization allow you to organize your code for efficient execution. While possibly increasing compilation time and making the code more difficult to debug, these techniques typicallyPGI® User’s Guide 14 produce code that runs significantly faster than code that does not use them. For more information on optimization and parallelization, refer to Chapter 3, “Using Optimization & Parallelization”. • Function inlining, a special type of optimization, replaces a call to a function or a subroutine with the body of the function or subroutine. This process can speed up execution by eliminating parameter passing and the function or subroutine call and return overhead. In addition, function inlining allows the compiler to optimize the function with the rest of the code. However, function inlining may also result in much larger code size with no increase in execution speed. For more information on function inlining, refer to Chapter 4, “Using Function Inlining”. • Directives and pragmas allow users to place hints in the source code to help the compiler generate better assembly code. You typically use directives and pragmas to control the actions of the compiler in a particular portion of a program without affecting the program as a whole. You place them in your source code where you want them to take effect. A directive or pragma typically stays in effect from the point where included until the end of the compilation unit or until another directive or pragma changes its status. For more information on directives and pragmas, refer to Chapter 5, “Using OpenMP”and Chapter 6, “Using Directives and Pragmas”. • A library is a collection of functions or subprograms used to develop software. Libraries contain "helper" code and data, which provide services to independent programs, allowing code and data to be shared and changed in a modular fashion. The functions and programs in a library are grouped for ease of use and linking. When creating your programs, it is often useful to incorporate standard libraries or proprietary ones. For more information on this topic, refer to Chapter 7, “Creating and Using Libraries”. • Environment variables define a set of dynamic values that can affect the way running processes behave on a computer. It is often useful to use these variables to set and pass information that alters the default behavior of the PGI compilers and the executables which they generate. For more information on these variables, refer to Chapter 8, “ Using Environment Variables”. • Deployment, though possibly an infrequent task, can present some unique issues related to concerns of porting the code to other systems. Deployment, in this context, involves distribution of a specific file or set of files that are already compiled and configured. The distribution must occur in such a way that the application executes accurately on another system which may not be configured exactly the same as the system on which the code was created. For more information on what you might need to know to successfully deploy your code, refer to Chapter 9, “Distributing Files - Deployment”. • An intrinsic is a function available in a given language whose implementation is handled specially by the compiler. Intrinsics make using processor-specific enhancements easier because they provide a C/C++ language interface to assembly instructions. In doing so, the compiler manages details that the user would normally have to be concerned with, such as register names, register allocations, and memory locations of data. For C/C++ programs, PGI provides support for MMX and SSE/SSE2/SSE3 intrinsics. For more information on these intrinsics, refer to Chapter 20, “C/C++ MMX/SSE Inline Intrinsics”.15 Chapter 2. Using Command Line Options A command line option allows you to control specific behavior when a program is compiled and linked. This chapter describes the syntax for properly using command-line options and provides a brief overview of a few of the more common options. Note For a complete list of command-line options, their descriptions and use, refer to Chapter 15, “Command-Line Options Reference,” on page 163. Command Line Option Overview Before looking at all the command-line options, first become familiar with the syntax for these options. There are a large number of options available to you, yet most users only use a few of them. So, start simple and progress into using the more advanced options. By default, the PGI 7.1 compilers generate code that is optimized for the type of processor on which compilation is performed, the compilation host. Before adding options to your command-line, review the sections“Help with Command-line Options,” on page 16 and “Frequently-used Options,” on page 19. Command-line Options Syntax On a command-line, options need to be preceded by a hyphen (-). If the compiler does not recognize an option, it passes the option to the linker. This document uses the following notation when describing options: [item] Square brackets indicate that the enclosed item is optional. {item | item} Braces indicate that you must select one and only one of the enclosed items. A vertical bar (|) separates the choices.PGI® User’s Guide 16 ... Horizontal ellipses indicate that zero or more instances of the preceding item are valid. NOTE Some options do not allow a space between the option and its argument or within an argument. When applicable, the syntax section of the option description in Chapter 15, “Command-Line Options Reference,” on page 163 contains this information. Command-line Suboptions Some options accept several suboptions. You can specify these suboptions either by using the full option statement multiple times or by using a comma-separated list for the suboptions. The following two command lines are equivalent: pgf95 -Mvect=sse -Mvect=noaltcode pgf95 -Mvect=sse,noaltcode Command-line Conflicting Options Some options have an opposite or negated counterpart. For example, both–Mvect and –Mnovect are available. –Mvect enables vectorization and –Mnovect disables it. If you used both of these commands on a command line, they would conflict. Note Rule: When you use conflicting options on a command line, the last encountered option takes precedence over any previous one. This rule is important for a number of reasons. • Some options, such as –fast, include other options. Therefore, it is possible for you to be unaware that you have conflicting options. • You can use this rule to create makefiles that apply specific flags to a set of files, as shown in Example 2.1. Example 2.1. Makefiles with Options In this makefile, CCFLAGS uses vectorization. CCNOVECTFLAGS uses the flags defined for CCFLAGS but disables vectorization. CCFLAGS=c -Mvect=sse CCNOVECTFLAGS=$(CCFLAGS) -Mnovect Help with Command-line Options If you are just getting started with the PGI compilers and tools, it is helpful to know which options are available, when to use them, and which options most users find effective.Chapter 2. Using Command Line Options 17 Using –help The –help option is useful because it provides information about all options supported by a given compiler. You can use –help in one of three ways: • Use –help with no parameters to obtain a list of all the available options with a brief one-line description of each. • Add a parameter to –help to restrict the output to information about a specific option. The syntax for this usage is this: –help For example, suppose you use the following command to restrict the output to information about the - fast option: pgf95 -help -fast The output you see is similar to this: -fast Common optimizations; includes -O2 -Munroll=c:1 -Mnoframe -Mlre In the following example, usage information for –help shows how groups of options can be listed or examined according to function $ pgf95 -help -help -help[=groups|asm|debug|language|linker|opt|other| overall|phase|prepro|suffix|switch|target|variable] Show compiler switches • Add a parameter to –help to restrict the output to a specific set of options or to a building process. The syntax for this usage is this: -help= The previous output from the command pgf95 -help -help shows the available subgroups. For example, you can use the following command to restrict the output to information about options related to debug information generation. pgf95 -help=debug The output you see is similar to this: Debugging switches: -M[no]bounds Generate code to check array bounds -Mchkfpstk Check consistency of floating point stack at subprogram calls (32-bit only) Note: This switch only works on 32-bit. On 64-bit, the switch is ignored. -Mchkstk Check for sufficient stack space upon subprogram entry -Mcoff Generate COFF format object -Mdwarf1 Generate DWARF1 debug information with -g -Mdwarf2 Generate DWARF2 debug information with -g -Mdwarf3 Generate DWARF3 debug information with -g -Melf Generate ELF format object -g Generate information for debugger -gopt Generate information for debugger without disabling optimizationsPGI® User’s Guide 18 For a complete description of subgroups, refer to “–help ,” on page 178. Getting Started with Performance One of top priorities of most users is performance and optimization. This section provides a quick overview of a few of the command-line options that are useful in improving performance. Using –fast and –fastsse Options PGI compilers implement a wide range of options that allow users a fine degree of control on each optimization phase. When it comes to optimization of code, the quickest way to start is to use –fast and –fastsse. These options create a generally optimal set of flags for targets that support SSE/SSE2 capability. They incorporate optimization options to enable use of vector streaming SIMD (SSE/SSE2) instructions for 64-bit targets. They enable vectorization with SSE instructions, cache alignment, and SSE arithmetic to flush to zero mode. Note The contents of the –fast and –fastsse options are host-dependent. Further, you should use these options on both compile and link command lines. • –fast and –fastsse typically include these options: –O2 Specifies a code optimization level of 2. –Munroll=c:1 Unrolls loops, executing multiple instances of the loop during each iteration. –Mnoframe Indicates to not generate code to set up a stack frame. –Mlre Indicates loop-carried redundancy elimination. • These additional options are also typically available when using –fast for 64-bit targets and when using –fastsse for both 32- and 64-bit targets: –Mvect=sse Generates SSE instructions. –Mscalarsse Generates scalar SSE code with xmm registers; implies –Mflushz. –Mcache_align Aligns long objects on cache-line boundaries. –Mflushz Sets SSE to flush-to-zero mode. Note For best performance on processors that support SSE instructions, use the PGF95 compiler, even for FORTRAN 77 code, and the –fast option. To see the specific behavior of –fast for your target, use the following command: pgf95 -help -fastChapter 2. Using Command Line Options 19 Other Performance-related Options While –fast and -fastsse are options designed to be the quickest route to best performance, they are limited to routine boundaries. Depending on the nature and writing style of the source code, the compiler often can perform further optimization by knowing the global context of usage of a given routine. For instance, determining the possible value range of actual parameters of a routine could enable a loop to be vectorized; similarly, determining static occurrence of calls helps to decide which routine is beneficial to inline. These types of global optimizations are under control of Inter Procedural Analysis (IPA) in PGI compilers. Option -Mipa enables Inter Procedural Analysis. -Mpi=fast is the recommended option to get best performances for global optimization. You can also add the suboption inline to enable automatic global inlining across file. You might consider using –Mipa=fast,inline. This option for inter-procedural analysis and global optimization can improve performance. You may also be able to obtain further performance improvements by experimenting with the individual –Mpgflag options detailed in the section“–M Options by Category,” on page 219. These options include –Mvect, –Munroll, –Minline, –Mconcur, and –Mpfi/–Mpfo. However, performance improvements using these options are typically application- and system-dependent. It is important to time your application carefully when using these options to ensure no performance degradations occur. For more information on optimization, refer to Chapter 3, “Using Optimization & Parallelization,” on page 21. For specific information about these options, refer to “–M Optimization Controls,” on page 229. Targeting Multiple Systems; Using the -tp Option The –tp option allows you to set the target architecture. By default, the PGI compiler uses all supported instructions wherever possible when compiling on a given system. As a result, executables created on a given system may not be usable on previous generation systems. For example, executables created on a Pentium 4 may fail to execute on a Pentium III or Pentium II. Processor-specific optimizations can be specified or limited explicitly by using the -tp option. Thus, it is possible to create executables that are usable on previous generation systems. With the exception of k8-64, k8- 64e, p7-64, and x64, any of these sub-options are valid on any x86 or x64 processor-based system. The k8-64, k8-64e, p7-64 and x64 options are valid only on x64 processor-based systems For more information about the -tp option, refer to “–tp [,target...] ,” on page 202. Frequently-used Options In addition to overall performance, there are a number of other options that many users find useful when getting started. The following table provides a brief summary of these options. For more information on these options, refer to the complete description of each option available in Chapter 15, “Command-Line Options Reference,” on page 163. Also, there are a number of suboptions available with each of the –M options listed. For more information on those options, refer to “–M Options by Category”.PGI® User’s Guide 20 Table 2.1. Commonly Used Command Line Options Option Description –fast or –fastsse These options create a generally optimal set of flags for targets that support SSE/SSE2 capability. They incorporate optimization options to enable use of vector streaming SIMD instructions (64-bit targets) and enable vectorization with SEE instructions, cache aligned and flushz. –g Instructs the compiler to include symbolic debugging information in the object module. –gopt Instructs the compiler to include symbolic debugging information in the object file, and to generate optimized code identical to that generated when –g is not specified. –help Provides information about available options. –mcmodel=medium Enables medium=model core generation for 64-bit targets; useful when the data space of the program exceeds 4GB. –Mconcur Instructs the compiler to enable auto-concurrentization of loops. If specified, the compiler uses multiple processors to execute loops that it determines to be parallelizable; thus, loop iterations are split to execute optimally in a multithreaded execution context. –Minfo Instructs the compiler to produce information on standard error. –Minline Passes options to the function inliner. –Mipa=fast,inline Enables interprocedural analysis and optimization. Also enables automatic procedure inlining. –Mneginfo Instructs the compiler to produce information on standard error. –Mpfi and –Mpfo Enable profile feedback driven optimizations. –Mkeepasm Keeps the generated assembly files. –Munroll Invokes the loop unroller to unroll loops, executing multiple instances of the loop during each iteration. This also sets the optimization level to 2 if the level is set to less than 2, or if no –O or –g options are supplied. –M[no]vect Enables/Disables the code vectorizer. --[no_]exceptions Removes exception handling from user code. –o Names the output file. –O Specifies code optimization level where is 0, 1, 2, 3, or 4. –tp [,target...] Specify the type(s) of the target processor(s) to enable generation of PGI Unified Binary executables.21 Chapter 3. Using Optimization & Parallelization Source code that is readable, maintainable, and produces correct results is not always organized for efficient execution. Normally, the first step in the program development process involves producing code that executes and produces the correct results. This first step usually involves compiling without much worry about optimization. After code is compiled and debugged, code optimization and parallelization become an issue. Invoking one of the PGI compiler commands with certain options instructs the compiler to generate optimized code. Optimization is not always performed since it increases compilation time and may make debugging difficult. However, optimization produces more efficient code that usually runs significantly faster than code that is not optimized. The compilers optimize code according to the specified optimization level. Using the –O, –Mvect, –Mipa, and –Mconcur, you can specify the optimization levels. In addition, you can use several –M switches to control specific types of optimization and parallelization. This chapter describes the optimization options displayed in the following list. –fast –Mpfi –Mvect –Mconcur –Mpfo –O –Mipa=fast –Munroll This chapter also describes how to choose optimization options to use with the PGI compilers. This overview will help if you are just getting started with one of the PGI compilers, or wish to experiment with individual optimizations. Complete specifications of each of these options is available in Chapter 15, “Command-Line Options Reference”. Overview of Optimization In general, optimization involves using transformations and replacements that generate more efficient code. This is done by the compiler and involves replacements that are independent of the particular target processor’s architecture as well as replacements that take advantage of the x86 or x64 architecture, instruction set and registers. For the discussion in this and the following chapters, optimization is divided into the following categories:PGI® User’s Guide 22 Local Optimization This optimization is performed on a block-by-block basis within a program’s basic blocks. A basic block is a sequence of statements in which the flow of control enters at the beginning and leaves at the end without the possibility of branching, except at the end. The PGI compilers perform many types of local optimization including: algebraic identity removal, constant folding, common sub-expression elimination, redundant load and store elimination, scheduling, strength reduction, and peephole optimizations. Global Optimization This optimization is performed on a program unit over all its basic blocks. The optimizer performs controlflow and data-flow analysis for an entire program unit. All loops, including those formed by IFs and GOTOs, are detected and optimized. Global optimization includes: constant propagation, copy propagation, dead store elimination, global register allocation, invariant code motion, and induction variable elimination. Loop Optimization: Unrolling, Vectorization, and Parallelization The performance of certain classes of loops may be improved through vectorization or unrolling options. Vectorization transforms loops to improve memory access performance and make use of packed SSE instructions which perform the same operation on multiple data items concurrently. Unrolling replicates the body of loops to reduce loop branching overhead and provide better opportunities for local optimization, vectorization and scheduling of instructions. Performance for loops on systems with multiple processors may also improve using the parallelization features of the PGI compilers. Interprocedural Analysis (IPA) and Optimization Interprocedural analysis (IPA) allows use of information across function call boundaries to perform optimizations that would otherwise be unavailable. For example, if the actual argument to a function is in fact a constant in the caller, it may be possible to propagate that constant into the callee and perform optimizations that are not valid if the dummy argument is treated as a variable. A wide range of optimizations are enabled or improved by using IPA, including but not limited to data alignment optimizations, argument removal, constant propagation, pointer disambiguation, pure function detection, F90/F95 array shape propagation, data placement, vestigial function removal, automatic function inlining, inlining of functions from pre-compiled libraries, and interprocedural optimization of functions from pre-compiled libraries. Function Inlining This optimization allows a call to a function to be replaced by a copy of the body of that function. This optimization will sometimes speed up execution by eliminating the function call and return overhead. Function inlining may also create opportunities for other types of optimization. Function inlining is not always beneficial. When used improperly it may increase code size and generate less efficient code. Profile-Feedback Optimization (PFO) Profile-feedback optimization (PFO) makes use of information from a trace file produced by specially instrumented executables which capture and save information on branch frequency, function and subroutine call frequency, semi-invariant values, loop index ranges, and other input data dependent information that can only be collected dynamically during execution of a program. By definition, use of profile-feedbackChapter 3. Using Optimization & Parallelization 23 optimization is a two-phase process: compilation and execution of a specially-instrumented executable, followed by a subsequent compilation which reads a trace file generated during the first phase and uses the information in that trace file to guide compiler optimizations. Getting Started with Optimizations Your first concern should be getting your program to execute and produce correct results. To get your program running, start by compiling and linking without optimization. Use the optimization level –O0 or select –g to perform minimal optimization. At this level, you will be able to debug your program easily and isolate any coding errors exposed during porting to x86 or x64 platforms. If you want to get started quickly with optimization, a good set of options to use with any of the PGI compilers is –fast –Mipa=fast. For example: $ pgf95 -fast -Mipa=fast prog.f For all of the PGI Fortran, C, and C++ compilers, the –fast, –Mipa=fast options generally produce code that is well-optimized without the possibility of significant slowdowns due to pathological cases. The –fast option is an aggregate option that includes a number of individual PGI compiler options; which PGI compiler options are included depends on the target for which compilation is performed. The –Mipa=fast option invokes interprocedural analysis including several IPA suboptions. For C++ programs, add -Minline=levels:10 --no_exceptions: $ pgcpp -fast -Mipa=fast -Minline=levels:10 --no_exceptions prog.cc Note A C++ program compiled with --no_exceptions will fail if the program uses exception handling. By experimenting with individual compiler options on a file-by-file basis, further significant performance gains can sometimes be realized. However, depending on the coding style, individual optimizations can sometimes cause slowdowns, and must be used carefully to ensure performance improvements. In addition to -fast, the optimization flags most likely to further improve performance are -O3, -Mpfi, -Mpfo, -Minline, and on targets with multiple processors -Mconcur. In addition, the –Msafeptr option can significantly improve performance of C/C++ programs in which there is known to be no pointer aliasing. However, for obvious reasons this command-line option must be used carefully. Three other options which are extremely useful are -help, -Minfo, and -dryrun. –help As described in “Help with Command-line Options,” on page 16, you can see a specification of any commandline option by invoking any of the PGI compilers with -help in combination with the option in question, without specifying any input files. For example: $ pgf95 -help -O Reading rcfile /usr/pgi/linux86-64/7.0/bin/.pgf95rcPGI® User’s Guide 24 -O[] Set optimization level, -O0 to -O4, default -O2 Or you can see the full functionality of -help itself, which can return information on either an individual option or groups of options; type: $ pgf95 -help -help Reading rcfile /usr/pgi_rel/linux86-64/7.0/bin/.pgf95rc -help[=groups|asm|debug|language|linker|opt|other|overall| phase|prepro|suffix|switch|target|variable] –Minfo You can use the -Minfo option to display compile-time optimization listings. When this option is used, the PGI compilers issue informational messages to stderr as compilation proceeds. From these messages, you can determine which loops are optimized using unrolling, SSE instructions, vectorization, parallelization, interprocedural optimizations and various miscellaneous optimizations. You can also see where and whether functions are inlined. You can use the -Mneginfo option to display informational messages listing why certain optimizations are inhibited. For more information on -Minfo, refer to “–M Optimization Controls,” on page 229 –dryrun The –dryrun option can be useful as a diagnostic tool if you need to see the steps used by the compiler driver to preprocess, compile, assemble and link in the presence of a given set of command line inputs. When you specify the –dryrun option, these steps will be printed to stderr but are not actually performed. For example, you can use this option to inspect the default and user-specified libraries that are searched during the link phase, and the order in which they are searched by the linker. The remainder of this chapter describes the –0 options, the loop unroller option –Munroll, the vectorizer option –Mvect, the auto-parallelization option –Mconcur, the interprocedural analysis optimization –Mipa, and the profile-feedback instrumentation (–Mpfi) and optimization (–Mpfo) options. You should be able to get very near optimal compiled performance using some combination of these switches. Local and Global Optimization using -O Using the PGI compiler commands with the –Olevel option (the capital O is for Optimize), you can specify any of the following optimization levels: –O0 Level zero specifies no optimization. A basic block is generated for each language statement. –O1 Level one specifies local optimization. Scheduling of basic blocks is performed. Register allocation is performed. –O2 Level two specifies global optimization. This level performs all level-one local optimization as well as leveltwo global optimization. If optimization is specified on the command line without a level, level 2 is the default.Chapter 3. Using Optimization & Parallelization 25 –O3 Level three specifies aggressive global optimization. This level performs all level-one and level-two optimizations and enables more aggressive hoisting and scalar replacement optimizations that may or may not be profitable. –O4 Level four performs all level-one, level-two, and level-three optimizations and enables hoisting of guarded invariant floating point expressions. Note If you use the -O option to specify optimization and do not specify a level, then level two optimization (-O2) is the default. Level-zero optimization specifies no optimization (–O0). At this level, the compiler generates a basic block for each statement. Performance will almost always be slowest using this optimization level. This level is useful for the initial execution of a program. It is also useful for debugging, since there is a direct correlation between the program text and the code generated. Level-one optimization specifies local optimization (–O1). The compiler performs scheduling of basic blocks as well as register allocation. Local optimization is a good choice when the code is very irregular, such as code that contains many short statements containing IF statements and does not contain loops (DO or DO WHILE statements). Although this case rarely occurs, for certain types of code, this optimization level may perform better than level-two (–O2). The PGI compilers perform many different types of local optimizations, including but not limited to: - Algebraic identity removal - Peephole optimizations - Constant folding - Redundant load and store elimination - Common subexpression elimination - Strength reductions - Local register optimization Level-two optimization (–O2 or –O) specifies global optimization. The –fast option generally will specify global optimization; however, the –fast switch varies from release to release, depending on a reasonable selection of switches for any one particular release. The –O or –O2 level performs all level-one local optimizations as well as global optimizations. Control flow analysis is applied and global registers are allocated for all functions and subroutines. Loop regions are given special consideration. This optimization level is a good choice when the program contains loops, the loops are short, and the structure of the code is regular. The PGI compilers perform many different types of global optimizations, including but not limited to: - Branch to branch elimination - Global register allocation - Constant propagation - Invariant code motion - Copy propagation - Induction variable elimination - Dead store eliminationPGI® User’s Guide 26 You can explicitly select the optimization level on the command line. For example, the following command line specifies level-two optimization which results in global optimization: $ pgf95 -O2 prog.f Specifying –O on the command-line without a level designation is equivalent to –O2. The default optimization level changes depending on which options you select on the command line. For example, when you select the –g debugging option, the default optimization level is set to level-zero (–O0). However, you can use the -gopt option to generate debug information without perturbing optimization if you need to debug optimized code. Refer to “Default Optimization Levels,” on page 42 for a description of the default levels. As noted above, the –fast option includes –O2 on all x86 and x64 targets. If you wish to override this with –O3 while maintaining all other elements of –fast, simply compile as follows: $ pgf95 -fast -O3 prog.f Scalar SSE Code Generation For all processors prior to Intel Pentium 4 and AMD Opteron/Athlon64, for example Intel Pentium III and AMD AthlonXP/MP processors, scalar floating-point arithmetic as generated by the PGI Workstation compilers is performed using x87 floating-point stack instructions. With the advent of SSE/SSE2 instructions on Intel Pentium 4/Xeon and AMD Opteron/Athlon64, it is possible to perform all scalar floating-point arithmetic using SSE/SSE2 instructions. In most cases, this is beneficial from a performance standpoint. The default on 32-bit Intel Pentium II/III (–tp p6, –tp piii, etc.) or AMD AthlonXP/MP (–tp k7) is to use x87 instructions for scalar floating-point arithmetic. The default on Intel Pentium 4/Xeon or Intel EM64T running a 32-bit operating system (–tp p7), AMD Opteron/Athlon64 running a 32-bit operating system (–tp k8-32), or AMD Opteron/Athlon64 or Intel EM64T processors running a 64-bit operating system (–tp k8-64 and –tp p7- 64 respectively) is to use SSE/SSE2 instructions for scalar floating-point arithmetic. The only way to override this default on AMD Opteron/Athlon64 or Intel EM64T processors running a 64-bit operating system is to specify an older 32-bit target (for example –tp k7 or –tp piii). Note There can be significant arithmetic differences between calculations performed using x87 instructions versus SSE/SSE2. By default, all floating-point data is promoted to IEEE 80-bit format when stored on the x87 floating-point stack, and all x87 operations are performed register-to-register in this same format. Values are converted back to IEEE 32-bit or IEEE 64-bit when stored back to memory (for REAL/float and DOUBLE PRECISION/ double data respectively). The default precision of the x87 floating-point stack can be reduced to IEEE 32-bit or IEEE 64-bit globally by compiling the main program with the –pc {32 | 64} option to the PGI Workstation compilers, which is described in detail in Chapter 2, “Using Command Line Options”. However, there is no way to ensure that operations performed in mixed precision will match those produced on a traditional loadstore RISC/UNIX system which implements IEEE 64-bit and IEEE 32-bit registers and associated floating-point arithmetic instructions. In contrast, arithmetic results produced on Intel Pentium 4/Xeon, AMD Opteron/Athlon64 or Intel EM64T processors will usually closely match or be identical to those produced on a traditional RISC/UNIX system if all scalar arithmetic is performed using SSE/SSE2 instructions. You should keep this in mind when portingChapter 3. Using Optimization & Parallelization 27 applications to and from systems which support both x87 and full SSE/SSE2 floating-point arithmetic. Many subtle issues can arise which affect your numerical results, sometimes to several digits of accuracy. Loop Unrolling using –Munroll This optimization unrolls loops, executing multiple instances of the loop during each iteration. This reduces branch overhead, and can improve execution speed by creating better opportunities for instruction scheduling. A loop with a constant count may be completely unrolled or partially unrolled. A loop with a non-constant count may also be unrolled. A candidate loop must be an innermost loop containing one to four blocks of code. The following shows the use of the –Munroll option: $ pgf95 -Munroll prog.f The –Munroll option is included as part of –fast on all x86 and x64 targets. The loop unroller expands the contents of a loop and reduces the number of times a loop is executed. Branching overhead is reduced when a loop is unrolled two or more times, since each iteration of the unrolled loop corresponds to two or more iterations of the original loop; the number of branch instructions executed is proportionately reduced. When a loop is unrolled completely, the loop’s branch overhead is eliminated altogether. Loop unrolling may be beneficial for the instruction scheduler. When a loop is completely unrolled or unrolled two or more times, opportunities for improved scheduling may be presented. The code generator can take advantage of more possibilities for instruction grouping or filling instruction delays found within the loop. Example 3.1 and Example 3.2 show the effect of code unrolling on a segment that computes a dot product. Example 3.1. Dot Product Code REAL*4 A(100), B(100), Z INTEGER I DO I=1, 100 Z = Z + A(i) * B(i) END DO END Example 3.2. Unrolled Dot Product Code REAL*4 A(100), B(100), Z INTEGER I DO I=1, 100, 2 Z = Z + A(i) * B(i) Z = Z + A(i+1) * B(i+1) END DO END Using the –Minfo option, the compiler informs you when a loop is being unrolled. For example, a message indicating the line number, and the number of times the code is unrolled, similar to the following will display when a loop is unrolled: dot: 5, Loop unrolled 5 times Using the c: and n: sub-options to –Munroll, or using –Mnounroll, you can control whether and how loops are unrolled on a file-by-file basis. Using directives or pragmas as specified in Chapter 6,PGI® User’s Guide 28 “Using Directives and Pragmas”, you can precisely control whether and how a given loop is unrolled. Refer to Chapter 2, “Using Command Line Options”, for a detailed description of the –Munroll option. Vectorization using –Mvect The –Mvect option is included as part of –fast on all x86 and x64 targets. If your program contains computationally-intensive loops, the –Mvect option may be helpful. If in addition you specify –Minfo, and your code contains loops that can be vectorized, the compiler reports relevant information on the optimizations applied. When a PGI compiler command is invoked with the –Mvect option, the vectorizer scans code searching for loops that are candidates for high-level transformations such as loop distribution, loop interchange, cache tiling, and idiom recognition (replacement of a recognizable code sequence, such as a reduction loop, with optimized code sequences or function calls). When the vectorizer finds vectorization opportunities, it internally rearranges or replaces sections of loops (the vectorizer changes the code generated; your source code’s loops are not altered). In addition to performing these loop transformations, the vectorizer produces extensive data dependence information for use by other phases of compilation and detects opportunities to use vector or packed Streaming SIMD Extensions (SSE) instructions on processors where these are supported. The –Mvect option can speed up code which contains well-behaved countable loops which operate on large REAL, REAL*4, REAL*8, INTEGER*4, COMPLEX or COMPLEX DOUBLE arrays in Fortran and their C/C++ counterparts. However, it is possible that some codes will show a decrease in performance when compiled with –Mvect due to the generation of conditionally executed code segments, inability to determine data alignment, and other code generation factors. For this reason, it is recommended that you check carefully whether particular program units or loops show improved performance when compiled with this option enabled. Vectorization Sub-options The vectorizer performs high-level loop transformations on countable loops. A loop is countable if the number of iterations is set only before loop execution and cannot be modified during loop execution. Some of the vectorizer transformations can be controlled by arguments to the –Mvect command line option. The following sections describe the arguments that affect the operation of the vectorizer. In addition, some of these vectorizer operations can be controlled from within code using directives and pragmas. For details on the use of directives and pragmas, refer to Chapter 6, “Using Directives and Pragmas,” on page 63. The vectorizer performs the following operations: • Loop interchange • Loop splitting • Loop fusion • Memory-hierarchy (cache tiling) optimizations • Generation of SSE instructions on processors where these are supported • Generation of prefetch instructions on processors where these are supported • Loop iteration peeling to maximize vector alignmentChapter 3. Using Optimization & Parallelization 29 • Alternate code generation By default, –Mvect without any sub-options is equivalent to: -Mvect=assoc,cachesize=c where c is the actual cache size of the machine. This enables the options for nested loop transformation and various other vectorizer options. These defaults may vary depending on the target system. Assoc Option The option –Mvect=assoc instructs the vectorizer to perform associativity conversions that can change the results of a computation due to a round-off error (–Mvect=noassoc disables this option). For example, a typical optimization is to change one arithmetic operation to another arithmetic operation that is mathematically correct, but can be computationally different and generate faster code. This option is provided to enable or disable this transformation, since a round-off error for such associativity conversions may produce unacceptable results. Cachesize Option The option –Mvect=cachesize:n instructs the vectorizer to tile nested loop operations assuming a data cache size of n bytes. By default, the vectorizer attempts to tile nested loop operations, such as matrix multiply, using multi-dimensional strip-mining techniques to maximize re-use of items in the data cache. SSE Option The option –Mvect=sse instructs the vectorizer to automatically generate packed SSE (Streaming SIMD Extensions), SSE2, and prefetch instructions when vectorizable loops are encountered. SSE instructions, first introduced on Pentium III and AthlonXP processors, operate on single-precision floating-point data, and hence apply only to vectorizable loops that operate on single-precision floating-point data. SSE2 instructions, first introduced on Pentium 4, Xeon and Opteron processors, operate on double-precision floating-point data. Prefetch instructions, first introduced on Pentium III and AthlonXP processors, can be used to improve the performance of vectorizable loops that operate on either 32-bit or 64-bit floating-point data. Refer to Table 2, “Processor Options,” on page xxvi for a concise list of processors that support SSE, SSE2 and prefetch instructions. Note Program units compiled with –Mvect=sse will not execute on Pentium, Pentium Pro, Pentium II or first generation AMD Athlon processors. They will only execute correctly on Pentium III, Pentium 4, Xeon, EM64T, AthlonXP, Athlon64 and Opteron systems running an SSE-enabled operating system. Prefetch Option The option –Mvect=prefetch instructs the vectorizer to automatically generate prefetch instructions when vectorizable loops are encountered, even in cases where SSE or SSE2 instructions are not generated. Usually, explicit prefetching is not necessary on Pentium 4, Xeon and Opteron because these processors supportPGI® User’s Guide 30 hardware prefetching; nonetheless, it sometimes can be worthwhile to experiment with explicit prefetching. Prefetching can be controlled on a loop-by-loop level using prefetch directives, which are described in detail in “Prefetch Directives ,” on page 69. Note Program units compiled with –Mvect=prefetch will not execute correctly on Pentium, Pentium Pro, or Pentium II processors. They will execute correctly only on Pentium III, Pentium 4, Xeon, EM64T, AthlonXP, Athlon64 or Opteron systems. In addition, the prefetchw instruction is only supported on AthlonXP, Athlon64 or Opteron systems and can cause instruction faults on non-AMD processors. For this reason, the PGI compilers do not generate prefetchw instructions by default on any target. In addition to these sub-options to –Mvect, several other sub-options are supported. Refer to the description of -M[no]vect in Chapter 15, “Command-Line Options Reference” for a detailed description of all available sub-options. Vectorization Example Using SSE/SSE2 Instructions One of the most important vectorization options is -Mvect=sse. When you use this option, the compiler automatically generates SSE and SSE2 instructions, where possible, when targeting processors on which these instructions are supported. This process can improve performance by up to a factor of two compared with the equivalent scalar code. All of the PGI Fortran, C and C++ compilers support this capability. Table 2, “Processor Options,” on page xxvi shows which x86 and x64 processors support these instructions. Prior to release 7.0 -Mvect=sse was omitted from the compiler switch -fast but included in -fastsse. Since release 7.0 -fast is synonymous with -fastsse and therefore includes -Mvect=sse. In the program in Example 3.3, “Vector operation using SSE instructions”, the vectorizer recognizes the vector operation in subroutine 'loop' when either the compiler switch -Mvect=sse or -fast is used. This example shows the compilation, informational messages, and runtime results using the SSE instructions on an AMD Opteron processor-based system, along with issues that affect SSE performance. First note that the arrays in Example 3.3 are single-precision and that the vector operation is done using a unit stride loop. Thus, this loop can potentially be vectorized using SSE instructions on any processor that supports SSE or SSE2 instructions. SSE operations can be used to operate on pairs of single-precision floatingpoint numbers, and do not apply to double-precision floating-point numbers. SSE2 instructions can be used to operate on quads of single-precision floating-point numbers or on pairs of double-precision floating-point numbers. Loops vectorized using SSE or SSE2 instructions operate much more efficiently when processing vectors that are aligned to a cache-line boundary. You can cause unconstrained data objects of size 16 bytes or greater to be cache-aligned by compiling with the –Mcache_align switch. An unconstrained data object is a data object that is not a common block member and not a member of an aggregate data structure. Note For stack-based local variables to be properly aligned, the main program or function must be compiled with –Mcache_align.Chapter 3. Using Optimization & Parallelization 31 The –Mcache_align switch has no effect on the alignment of Fortran allocatable or automatic arrays. If you have arrays that are constrained, such as vectors that are members of Fortran common blocks, you must specifically pad your data structures to ensure proper cache alignment; –Mcache_align causes only the beginning address of each common block to be cache-aligned. The following examples show the results of compiling the example code with and without –Mvect=sse. Example 3.3. Vector operation using SSE instructions program vector_op parameter (N = 9999) real*4 x(N), y(N), z(N), W(N) do i = 1, n y(i) = i z(i) = 2*i w(i) = 4*i enddo do j = 1, 200000 call loop(x,y,z,w,1.0e0,N) enddo print *, x(1),x(771),x(3618),x(6498),x(9999) end subroutine loop(a,b,c,d,s,n) integer i, n real*4 a(n), b(n), c(n), d(n),s do i = 1, n a(i) = b(i) + c(i) - s * d(i) enddo end Assume the preceding program is compiled as follows, where -Mvect=nosse disables SSE vectorization: % pgf95 -fast -Mvect=nosse -Minfo vadd.f vector_op: 4, Loop unrolled 4 times loop: 18, Loop unrolled 4 times The following output shows a sample result if the generated executable is run and timed on a standalone AMD Opteron 2.2 Ghz system: % /bin/time vadd -1.000000 -771.000 -3618.000 -6498.00 -9999.00 5.39user 0.00system 0:05.40elapsed 99%CP Now, recompile with SSE vectorization enabled, and you see results similar to these: % pgf95 -fast -Minfo vadd.f -o vadd vector_op: 4, Unrolled inner loop 8 times Loop unrolled 7 times (completely unrolled) loop: 18, Generated 4 alternate loops for the inner loop Generated vector sse code for inner loop Generated 3 prefetch instructions for this loop Notice the informational message for the loop at line 18.PGI® User’s Guide 32 • The first two lines of the message indicate that the loop has been vectorized, SSE instructions have been generated, and four alternate versions of the loop have also been generated. The loop count and alignments of the arrays determine which of these versions is executed. • The last line of the informational message indicates that prefetch instructions have been generated for three loads to minimize latency of data transfers from main memory. Executing again, you should see results similar to the following: % /bin/time vadd -1.000000 -771.000 -3618.00 -6498.00 -9999.0 3.59user 0.00system 0:03.59elapsed 100%CPU The result is a 50% speed-up over the equivalent scalar, that is, the non-SSE, version of the program. Speed-up realized by a given loop or program can vary widely based on a number of factors: • When the vectors of data are resident in the data cache, performance improvement using vector SSE or SSE2 instructions is most effective. • If data is aligned properly, performance will be better in general than when using vector SSE operations on unaligned data. • If the compiler can guarantee that data is aligned properly, even more efficient sequences of SSE instructions can be generated. • The efficiency of loops that operate on single-precision data can be higher. SSE2 vector instructions can operate on four single-precision elements concurrently, but only two double-precision elements. Note Compiling with –Mvect=sse can result in numerical differences from the executables generated with less optimization. Certain vectorizable operations, for example dot products, are sensitive to order of operations and the associative transformations necessary to enable vectorization (or parallelization). Auto-Parallelization using -Mconcur With the -Mconcur option the compiler scans code searching for loops that are candidates for autoparallelization. -Mconcur must be used at both compile-time and link-time. When the parallelizer finds opportunities for auto-parallelization, it parallelizes loops and you are informed of the line or loop being parallelized if the -Minfo option is present on the compile line. See “–M Optimization Controls,” on page 229, for a complete specification of -Mconcur. A loop is considered parallelizable if doesn't contain any cross-iteration data dependencies. Cross-iteration dependencies from reductions and expandable scalars are excluded from consideration, enabling more loops to be parallelizable. In general, loops with calls are not parallelized due to unknown side effects. Also, loops with low trip counts are not parallelized since the overhead in setting up and starting a parallel loop will likely outweigh the potential benefits. In addition, the default is to not parallelize innermost loops, since these often by definition are vectorizable using SSE instructions and it is seldom profitable to both vectorize and parallelizeChapter 3. Using Optimization & Parallelization 33 the same loop, especially on multi-core processors. Compiler switches and directives are available to let you override most of these restrictions on auto-parallelization. Auto-parallelization Sub-options The parallelizer performs various operations that can be controlled by arguments to the –Mconcur command line option. The following sections describe these arguments that affect the operation of the vectorizer. In addition, these vectorizer operations can be controlled from within code using directives and pragmas. For details on the use of directives and pragmas, refer to Chapter 6, “Using Directives and Pragmas”. By default, –Mconcur without any sub-options is equivalent to: -Mconcur=dist:block This enables parallelization of loops with blocked iteration allocation across the available threads of execution. These defaults may vary depending on the target system. Altcode Option The option –Mconcur=altcode instructs the parallelizer to generate alternate serial code for parallelized loops. If altcode is specified without arguments, the parallelizer determines an appropriate cutoff length and generates serial code to be executed whenever the loop count is less than or equal to that length. If altcode:n is specified, the serial altcode is executed whenever the loop count is less than or equal to n. If noaltcode is specified, no alternate serial code is generated. Dist Option The option –Mconcur=dist:{block|cyclic} option specifies whether to assign loop iterations to the available threads in blocks or in a cyclic (round-robin) fashion. Block distribution is the default. If cyclic is specified, iterations are allocated to processors cyclically. That is, processor 0 performs iterations 0, 3, 6, etc.; processor 1 performs iterations 1, 4, 7, etc.; and processor 2 performs iterations 2, 5, 8, etc. Cncall Option The option –Mconcur=cncall specifies that it is safe to parallelize loops that contain subroutine or function calls. By default, such loops are excluded from consideration for auto-parallelization. Also, no minimum loop count threshold must be satisfied before parallelization will occur, and last values of scalars are assumed to be safe. The environment variable NCPUS is checked at runtime for a parallel program. If NCPUS is set to 1, a parallel program runs serially, but will use the parallel routines generated during compilation. If NCPUS is set to a value greater than 1, the specified number of processors will be used to execute the program. Setting NCPUS to a value exceeding the number of physical processors can produce inefficient execution. Executing a program on multiple processors in an environment where some of the processors are being time-shared with another executing job can also result in inefficient execution. As with the vectorizer, the -Mconcur option can speed up code if it contains well-behaved countable loops and/or computationally intensive nested loops that operate on arrays. However, it is possible that some codes will show a decrease in performance on multi-processor systems when compiled with -Mconcur due to parallelization overheads, memory bandwidth limitations in the target system, false-sharing of cache lines, orPGI® User’s Guide 34 other architectural or code-generation factors. For this reason, it is recommended that you check carefully whether particular program units or loops show improved performance when compiled using this option. If the compiler is not able to successfully auto-parallelize your application, you should refer to Chapter 5, “Using OpenMP”. It is possible that insertion of explicit parallelization directives or pragmas, and use of the –mp compiler option might enable the application to run in parallel. Loops That Fail to Parallelize In spite of the sophisticated analysis and transformations performed by the compiler, programmers will often note loops that are seemingly parallel, but are not parallelized. In this subsection, we look at some examples of common situations where parallelization does not occur. Innermost Loops As noted earlier in this chapter, the PGI compilers will not parallelize innermost loops by default, because it is usually not profitable. You can override this default using the command-line option –Mconcur=innermost. Timing Loops Often, loops will occur in programs that are similar to timing loops. The outer loop in the following example is one such loop. do j = 1, 2 do i = 1, n a(i) = b(i) + c(i) 1 enddo enddo The outer loop above is not parallelized because the compiler detects a cross-iteration dependence in the assignment to a(i). Suppose the outer loop were parallelized. Then both processors would simultaneously attempt to make assignments into a(1:n). Now in general the values computed by each processor for a(1:n) will differ, so that simultaneous assignment into a(1:n) will produce values different from sequential execution of the loops. In this example, values computed for a(1:n) don’t depend on j, so that simultaneous assignment by both processors will not yield incorrect results. However, it is beyond the scope of the compilers’ dependence analysis to determine that values computed in one iteration of a loop don’t differ from values computed in another iteration. So the worst case is assumed, and different iterations of the outer loop are assumed to compute different values for a(1:n). Is this assumption too pessimistic? If j doesn’t occur anywhere within a loop, the loop exists only to cause some delay, most probably to improve timing resolution. It is not usually valid to parallelize timing loops; to do so would distort the timing information for the inner loops. Scalars Quite often, scalars will inhibit parallelization of non-innermost loops. There are two separate cases that present problems. In the first case, scalars appear to be expandable, but appear in non-innermost loops, as in the following example. do j = 1, n x = b(j) do i = 1, n a(i,j) = x + c(i,j) Chapter 3. Using Optimization & Parallelization 35 enddo enddo There are a number of technical problems to be resolved in order to recognize expandable scalars in noninnermost loops. Until this generalization occurs, scalars like x in the preceding code segment inhibit parallelization of loops in which they are assigned. In the following example, scalar k is not expandable, and it is not an accumulator for a reduction. k = 1 do i = 1, n do j = 1, n 1 a(j,i) = b(k) * x enddo k = i 2 if (i .gt. n/2) k = n - (i - n/2) enddo If the outer loop is parallelized, conflicting values are stored into k by the various processors. The variable k cannot be made local to each processor because the value of k must remain coherent among the processors. It is possible the loop could be parallelized if all assignments to k are placed in critical sections. However, it is not clear where critical sections should be introduced because in general the value for k could depend on another scalar (or on k itself), and code to obtain the value of other scalars must reside in the same critical section. In the example above, the assignment to k within a conditional at label 2 prevents k from being recognized as an induction variable. If the conditional statement at label 2 is removed, k would be an induction variable whose value varies linearly with j, and the loop could be parallelized. Scalar Last Values During parallelization, scalars within loops often need to be privatized; that is, each execution thread has its own independent copy of the scalar. Problems can arise if a privatized scalar is accessed outside the loop. For example, consider the following loop: for (i = 1; i 5.0 ) t = x[i]; } v = t; The value of t may not be computed on the last iteration of the loop. Normally, if a scalar is assigned within a loop and used following the loop, the PGI compilers save the last value of the scalar. However, if the loop is parallelized and the scalar is not assigned on every iteration, it may be difficult, without resorting to costly critical sections, to determine on what iteration t is last assigned. Analysis allows the compiler to determine that a scalar is assigned on each iteration and hence that the loop is safe to parallelize if the scalar is used later, as illustrated in the following example. for ( i = 1; i < n; i++) { if ( x[i] > 0.0 ) { t = 2.0; } else { t = 3.0; y[i] = ...t; } } v = t;PGI® User’s Guide 36 where t is assigned on every iteration of the loop. However, there are cases where a scalar may be privatizable, but if it is used after the loop, it is unsafe to parallelize. Examine the following loop in which each use of t within the loop is reached by a definition from the same iteration. for ( i = 1; i < N; i++ ){ if( x[i] > 0.0 ){ t = x[i]; ... ... y[i] = ...t; } } v = t; Here t is privatizable, but the use of t outside the loop may yield incorrect results, since the compiler may not be able to detect on which iteration of the parallelized loop t is last assigned. The compiler detects the previous cases. When a scalar is used after the loop but is not defined on every iteration of the loop, parallelization does not occur. When the programmer knows that the scalar is assigned on the last iteration of the loop, the programmer may use a directive or pragma to let the compiler know the loop is safe to parallelize. The Fortran directive safe_lastval informs the compiler that, for a given loop, all scalars are assigned in the last iteration of the loop; thus, it is safe to parallelize the loop. We could add the following line to any of our previous examples. cpgi$l safe_lastval The resulting code looks similar to this: cpgi$l safe_lastval ... for (i = 1; i 5.0 ) t = x[i]; } v = t; In addition, a command-line option –Msafe_lastval, provides this information for all loops within the routines being compiled, which essentially provides global scope. Processor-Specific Optimization and the Unified Binary Different processors have differences, some subtle, in hardware features such as instruction sets and cache size. The compilers make architecture-specific decisions about things such as instruction selection, instruction scheduling, and vectorization. By default, the PGI compilers produce code specifically targeted to the type of processor on which the compilation is performed. That is, the default is to use all supported instructions wherever possible when compiling on a given system. As a result, executables created on a given system may not be usable on previous generation systems. For example, executables created on a Pentium 4 may fail to execute on a Pentium III or Pentium II. All PGI compilers have the capability of generating unified binaries, which provide a low-overhead means for generating a single executable that is compatible with and has good performance on more than one hardware platform. You can use the –tp option to control compilation behavior by specifying the processor or processors with which the generated code is compatible. The compilers generate and combine into one executable multipleChapter 3. Using Optimization & Parallelization 37 binary code streams, each optimized for a specific platform. At runtime, the one executable senses the environment and dynamically selects the appropriate code stream. For specific information on the –tp option, see –tp [,target...] . Executable size is automatically controlled via unified binary culling. Only those functions and subroutines where the target affects the generated code have unique binary images, resulting in a code-size savings of from 10% to 90% compared to generating full copies of code for each target. Programs can use PGI Unified Binary even if all of the object files and libraries are not compiled as unified binaries. Like any other object file, you can use PGI Unified Binary object files to create programs or libraries. No special start up code is needed; support is linked in from the PGI libraries. The -Mpfi option disables generation of PGI Unified Binary. Instead, the default target auto-detect rules for the host are used to select the target processor. Interprocedural Analysis and Optimization using –Mipa The PGI Fortran, C and C++ compilers use interprocedural analysis (IPA) that results in minimal changes to makefiles and the standard edit-build-run application development cycle. Other than adding –Mipa to the command line, no other changes are required. For reference and background, the process of building a program without IPA is described below, followed by the minor modifications required to use IPA with the PGI compilers. While the PGCC compiler is used here to show how IPA works, similar capabilities apply to each of the PGI Fortran, C and C++ compilers. Note The examples use Linux file naming conventions. On Windows, ‘.o’ files would be ‘.obj’ files, and ‘a.out’ files would be ‘.exe’ files. Building a Program Without IPA – Single Step Using the pgcc command-level compiler driver, multiple source files can be compiled and linked into a single executable with one command. The following example compiles and links three source files: % pgcc -o a.out file1.c file2.c file3.c In actuality, the pgcc driver executes several steps to produce the assembly code and object files corresponding to each source file, and subsequently to link the object files together into a single executable file. Thus, the command above is roughly equivalent to the following commands performed individually: % pgcc -S -o file1.s file1.c % as -o file1.o file1.s % pgcc -S -o file2.s file2.c % as -o file2.o file2.s % pgcc -S -o file3.s file3.c % as -o file3.o file3.s % pgcc -o a.out file1.o file2.o file3.o If any of the three source files is edited, the executable can be rebuilt with the same command line: % pgcc -o a.out file1.c file2.c file3.c This always works as intended, but has the side-effect of recompiling all of the source files, even if only one has changed. For applications with a large number of source files, this can be time-consuming and inefficient.PGI® User’s Guide 38 Building a Program Without IPA - Several Steps It is also possible to use individual pgcc commands to compile each source file into a corresponding object file, and one to link the resulting object files into an executable: % pgcc -c file1.c % pgcc -c file2.c % pgcc -c file3.c % pgcc -o a.out file1.o file2.o file3.o The pgcc driver invokes the compiler and assembler as required to process each source file, and invokes the linker for the final link command. If you modify one of the source files, the executable can be rebuilt by compiling just that file and then relinking: % pgcc -c file1.c % pgcc -o a.out file1.o file2.o file3.o Building a Program Without IPA Using Make The program compilation and linking process can be simplified greatly using the make utility on systems where it is supported. Suppose you create a makefile containing the following lines: a.out: file1.o file2.o file3.o pgcc $(OPT) -o a.out file1.o file2.o file3.o file1.o: file1.c pgcc $(OPT) -c file1.c file2.o: file2.c pgcc $(OPT) -c file2.c file3.o: file3.c pgcc $(OPT) -c file3.c It is then possible to type a single make command: % make The make utility determines which object files are out of date with respect to their corresponding source files, and invokes the compiler to recompile only those source files and to relink the executable. If you subsequently edit one or more source files, the executable can be rebuilt with the minimum number of recompilations using the same single make command. Building a Program with IPA Interprocedural analysis and optimization (IPA) by the PGI compilers alters the standard and make utility command-level interfaces as little as possible. IPA occurs in three phases: • Collection: Create a summary of each function or procedure, collecting the useful information for interprocedural optimizations. This is done during the compile step if the –Mipa switch is present on the command line; summary information is collected and stored in the object file. • Propagation: Process all the object files to propagate the interprocedural summary information across function and file boundaries. This is done during the link step, when all the object files are combined, if the –Mipa switch is present on the link command line. • Recompile/Optimization: Recompile each of the object files with the propagated interprocedural information, producing a specialized object file. This process is also done during the link step when the –Mipa switch is present on the link command line.Chapter 3. Using Optimization & Parallelization 39 When linking with –Mipa, the PGI compilers automatically regenerate IPA-optimized versions of each object file, essentially recompiling each file. If there are IPA-optimized objects from a previous build, the compilers will minimize the recompile time by reusing those objects if they are still valid. They will still be valid if the IPAoptimized object is newer than the original object file, and the propagated IPA information for that file has not changed since it was optimized. After each object file has been recompiled, the regular linker is invoked to build the application with the IPAoptimized object files. The IPA-optimized object files are saved in the same directory as the original object files, for use in subsequent program builds. Building a Program with IPA - Single Step By adding the –Mipa command line switch, several source files can be compiled and linked with interprocedural optimizations with one command: % pgcc -Mipa=fast -o a.out file1.c file2.c file3.c Just like compiling without –Mipa, the driver executes several steps to produce the assembly and object files to create the executable: % pgcc -Mipa=fast -S -o file1.s file1.c % as -o file1.o file1.s % pgcc -Mipa=fast -S -o file2.s file2.c % as -o file2.o file2.s % pgcc -Mipa=fast -S -o file3.s file3.c % as -o file3.o file3.s % pgcc -Mipa=fast -o a.out file1.o file2.o file3.o In the last step, an IPA linker is invoked to read all the IPA summary information and perform the interprocedural propagation. The IPA linker reinvokes the compiler on each of the object files to recompile them with interprocedural information. This creates three new objects with mangled names: file1_ipa5_a.out.oo.o, file2_ipa5_a.out.oo.o, file2_ipa5_a.out.oo.o The system linker is then invoked to link these IPA-optimized objects into the final executable. Later, if one of the three source files is edited, the executable can be rebuilt with the same command line: % pgcc -Mipa=fast -o a.out file1.c file2.c file3.c This will work, but again has the side-effect of compiling each source file, and recompiling each object file at link time. Building a Program with IPA - Several Steps Just by adding the –Mipa command-line switch, it is possible to use individual pgcc commands to compile each source file, followed by a command to link the resulting object files into an executable: % pgcc -Mipa=fast -c file1.c % pgcc -Mipa=fast -c file2.c % pgcc -Mipa=fast -c file3.c % pgcc -Mipa=fast -o a.out file1.o file2.o file3.o The pgcc driver invokes the compiler and assembler as required to process each source file, and invokes the IPA linker for the final link command. If you modify one of the source files, the executable can be rebuilt by compiling just that file and then relinking: % pgcc -Mipa=fast -c file1.cPGI® User’s Guide 40 % pgcc -Mipa=fast -o a.out file1.o file2.o file3.o When the IPA linker is invoked, it will determine that the IPA-optimized object for file1.o (file1_ipa5_a.out.oo.o) is stale, since it is older than the object file1.o, and hence will need to be rebuilt, and will reinvoke the compiler to generate it. In addition, depending on the nature of the changes to the source file file1.c, the interprocedural optimizations previously performed for file2 and file3 may now be inaccurate. For instance, IPA may have propagated a constant argument value in a call from a function in file1.c to a function in file2.c; if the value of the argument has changed, any optimizations based on that constant value are invalid. The IPA linker will determine which, if any, of any previously created IPA-optimized objects need to be regenerated, and will reinvoke the compiler as appropriate to regenerate them. Only those objects that are stale or which have new or different IPA information will be regenerated, which saves on compile time. Building a Program with IPA Using Make As in the previous two sections, programs can be built with IPA using the make utility, just by adding the –Mipa command-line switch: OPT=-Mipa=fast a.out: file1.o file2.o file3.o pgcc $(OPT) -o a.out file1.o file2.o file3.o file1.o: file1.c pgcc $(OPT) -c file1.c file2.o: file2.c pgcc $(OPT) -c file2.c file3.o: file3.c pgcc $(OPT) -c file3.c Using the single make command invokes the compiler to generate any object files that are out-of-date, then invoke pgcc to link the objects into the executable; at link time, pgcc calls the IPA linker to regenerate any stale or invalid IPA-optimized objects. % make Questions about IPA 1. Why is the object file so large? An object file created with –Mipa contains several additional sections. One is the summary information used to drive the interprocedural analysis. In addition, the object file contains the compiler internal representation of the source file, so the file can be recompiled at link time with interprocedural optimizations. There may be additional information when inlining is enabled. The total size of the object file may be 5-10 times its original size. The extra sections are not added to the final executable. 2. What if I compile with –Mipa and link without –Mipa? The PGI compilers generate a legal object file, even when the source file is compiled with –Mipa. If you compile with –Mipa and link without –Mipa, the linker is invoked on the original object files. A legal executable will be generated; while this will not have the benefit of interprocedural optimizations, any other optimizations will apply. 3. What if I compile without –Mipa and link with –Mipa? At link time, the IPA linker must have summary information about all the functions or routines used in the program. This information is created only when a file is compiled with –Mipa. If you compileChapter 3. Using Optimization & Parallelization 41 a file without –Mipa and then try to get interprocedural optimizations by linking with –Mipa, the IPA linker will issue a message that some routines have no IPA summary information, and will proceed to run the system linker using the original object files. If some files were compiled with –Mipa and others were not, it will determine the safest approximation of the IPA summary information for those files not compiled with –Mipa, and use that to recompile the other files using interprocedural optimizations. 4. Can I build multiple applications in the same directory with –Mipa? Yes. Suppose you have three source files: main1.c, main2.c, and sub.c, where sub.c is shared between the two applications. Suppose you build the first application with –Mipa, using this command: % pgcc -Mipa=fast -o app1 main1.c sub.c The the IPA linker creates two IPA-optimized object files: main1_ipa4_app1.o sub_ipa4_app1.oo It uses them to build the first application. Now suppose you build the second application using this command: % pgcc -Mipa=fast -o app2 main2.c sub.c The IPA linker creates two more IPA-optimized object files: main2_ipa4_app2.oo sub_ipa4_app2.oo Note There are now three object files for sub.c: the original sub.o, and two IPA-optimized objects, one for each application in which it appears. Note 5. How is the mangled name for the IPA-optimized object files generated? The mangled name has '_ipa' appended, followed by the decimal number of the length of the executable file name, followed by an underscore and the executable file name itself. The suffix is changed to .oo (on Linux) or .oobj (on Windows) so linking *.o or *.obj does not pull in the IPAoptimized objects. If the IPA linker determines that the file would not benefit from any interprocedural optimizations, it does not have to recompile the file at link time and uses the original object. Profile-Feedback Optimization using –Mpfi/–Mpfo The PGI compilers support many common profile-feedback optimizations, including semi-invariant value optimizations and block placement. These are performed under control of the –Mpfi/–Mpfo command-line options. When invoked with the –Mpfi option, the PGI compilers instrument the generated executable for collection of profile and data feedback information. This information can be used in subsequent compilations that include the –Mpfo optimization option. –Mpfi must be used at both compile-time and link-time. Programs compiled with –Mpfi include extra code to collect run-time statistics and write them out to a trace file. When the resulting program is executed, a profile feedback trace file pgfi.out is generated in the current working directory.PGI® User’s Guide 42 Note Programs compiled and linked with –Mpfi execute more slowly due to the instrumentation and data collection overhead. You should use executables compiled with –Mpfi only for execution of training runs. When invoked with the –Mpfo option, the PGI compilers use data from a pgfi.out profile feedback tracefile to enable or enhance certain performance optimizations. Use of this option requires the presence of a pgfi.out trace file in the current working directory. Default Optimization Levels The following table shows the interaction between the –O ,–g, and –M options. In the table, level can be 0, 1, 2, 3 or 4, and can be vect, concur, unroll or ipa. The default optimization level is dependent upon these command-line options. Table 3.1. Optimization and –O, –g and –M Options Optimize Option Debug Option –M Option Optimization Level none none none 1 none none –M 2 none –g none 0 –O none or –g none 2 –Olevel none or –g none level –Olevel <= 2 none or –g –M 2 Code that is not optimized yet compiled using the option –O0 can be significantly slower than code generated at other optimization levels. The –M option, where is vect, concur, unroll or ipa, sets the optimization level to 2 if no –O options are supplied. The –fast and –fastsse options set the optimization level to a target-dependent optimization level if no –O options are supplied. Local Optimization Using Directives and Pragmas Command-line options let you specify optimizations for an entire source file. Directives supplied within a Fortran source file and pragmas supplied within a C or C++ source file provide information to the compiler and alter the effects of certain command-line options or the default behavior of the compiler. (Many directives have a corresponding command-line option). While a command line option affects the entire source file that is being compiled, directives and pragmas let you do the following: • Apply, or disable, the effects of a particular command-line option to selected subprograms or to selected loops in the source file (for example, an optimization). • Globally override command-line options. • Tune selected routines or loops based on your knowledge or on information obtained through profiling.Chapter 3. Using Optimization & Parallelization 43 Chapter 6, “Using Directives and Pragmas” provides details on how to add directives and pragmas to your source files. Execution Timing and Instruction Counting As this chapter shows, once you have a program that compiles, executes and gives correct results, you may optimize your code for execution efficiency. Selecting the correct optimization level requires some thought and may require that you compare several optimization levels before arriving at the best solution. To compare optimization levels, you need to measure the execution time for your program. There are several approaches you can take for timing execution. You can use shell commands that provide execution time statistics, you can include function calls in your code that provide timing information, or you can profile sections of code. Timing functions available with the PGI compilers include 3F timing routines, the SECNDS pre-declared function in PGF77 or PGF95, or the SYSTEM_CLOCK or CPU_CLOCK intrinsics in PGF95 or PGHPF. In general, when timing a program, you should try to eliminate or reduce the amount of system level activities such as program loading, I/O and task switching. The following example shows a fragment that indicates how to use SYSTEM_CLOCK effectively within an F90/ F95 or HPF program unit. Example 3.4. Using SYSTEM_CLOCK code fragment . . . integer :: nprocs, hz, clock0, clock1 real :: time integer, allocatable :: t(:) !hpf$ distribute t(cyclic) #if defined (HPF) allocate (t(number_of_processors())) #elif defined (_OPENMP) allocate (t(OMP_GET_NUM_THREADS())) #else allocate (t(1)) #endif call system_clock (count_rate=hz) ! call system_clock(count=clock0) < do work> call system_clock(count=clock1) ! t = (clock1 - clock0) time = real (sum(t)) / (real(hz) * size(t)) . . . Portability of Multi-Threaded Programs on Linux PGI has created two libraries - libpgbind and libnuma - to handle the variations between various implementations of Linux. Some older versions of Linux are lacking certain features that support multi-processor and multi-core systems, in particular, the system call 'sched_setaffinity' and the numa library libnuma. The PGI run-time library uses these features to implement some –Mconcur and –mp operations. These variations have led to the creation of two PGI libraries, libpgbind and libnuma. These libraries are used on all 32-bit and 64-bit Linux systems. These libraries are not needed on Windows.PGI® User’s Guide 44 When a program is linked with the system libnuma library, the program depends on the libnuma library in order to run. On systems without a system libnuma library, the PGI version of libnuma provides the required stubs so that the program links and executes properly. If the program is linked with libpgbind and libnuma, the differences between systems is masked by the different versions of libpgbind and libnuma. In particular, PGI provides two versions of libpgbind - one for systems with working support for sched_setaffinity and another for systems that do not. When a program is deployed to the target system, the proper set of libraries, real or stub, should be deployed with the program. This facility requires that the program be dynamically linked with libpgbind and libnuma. libpgbind On some versions of Linux, the system call sched_setaffinity does not exist or does not work. The library libpgbind is used to work around this problem. During installation, a small test program is compiled, linked, and executed. If the test program compiles, links, and executes successfully, the installed version of libpgbind calls the system sched_setaffinity, otherwise the stub version is installed. libnuma Not all systems have libnuma. Typically, only numa systems will have this library. PGI supplies a stub version of libnuma which satisfies the calls from the PGI runtime to libnuma. Note that libnuma is a shared library that is linked dynamically at runtime. The reason to have a numa library on all systems is to allow multi-threaded programs (e.g. compiled with –Mconcur or –mp ) to be compiled, linked, and executed without regard to whether the host or target systems has a numa library. When the numa library is not available, a multi-threaded program still runs because the calls to the numa library are satisfied by the PGI stub library. During installation, the installation procedure checks for the existence of a real libnuma among the system libraries. If the real library is not found, the PGI stub version is substituted.45 Chapter 4. Using Function Inlining Function inlining replaces a call to a function or a subroutine with the body of the function or subroutine. This can speed up execution by eliminating parameter passing and function/subroutine call and return overhead. It also allows the compiler to optimize the function with the rest of the code. Note that using function inlining indiscriminately can result in much larger code size and no increase in execution speed. The PGI compilers provide two categories of inlining: • Automatic inlining - During the compilation process, a hidden pass precedes the compilation pass. This hidden pass extracts functions that are candidates for inlining. The inlining of functions occurs as the source files are compiled. • Inline libraries - You create inline libraries, for example using the pgf95 compiler driver and the –Mextract and –o options. There is no hidden extract pass but you must ensure that any files that depend on the inline library use the latest version of the inline library. There are important restrictions on inlining. Inlining only applies to certain types of functions. Refer to “Restrictions on Inlining,” on page 49 for more details on function inlining limitations. This chapter describes how to use the following options related to function inlining: –Mextract –Minline –Mrecursive Invoking Function Inlining To invoke the function inliner, use the -Minline option. If you do not specify an inline library, the compiler performs a special prepass on all source files named on the compiler command line before it compiles any of them. This pass extracts functions that meet the requirements for inlining and puts them in a temporary inline library for use by the compilation pass. Several -Minline suboptions let you determine the selection criteria for functions to be inlined. These suboptions include:PGI® User’s Guide 46 except:func Inlines all eligible functions except func, a function in the source text. You can us a comma-separated list to specify multiple functions. [name:]func Inlines all functions in the source text whose name matches func. You can us a comma-separated list to specify multiple functions. [size:]n Inlines functions with a statement count less than or equal to n, the specified size. Note The size n may not exactly equal the number of statements in a selected function; the size parameter is merely a rough gauge. levels:n Inlines n level of function calling levels. The default number is one (1). Using a level greater than one indicates that function calls within inlined functions may be replaced with inlined code. This approach allows the function inliner to automatically perform a sequence of inline and extract processes. [lib:]file.ext Instructs the inliner to inline the functions within the library file file.ext. If no inline library is specified, functions are extracted from a temporary library created during an extract prepass. Tip Create the library file using the -Mextract option. If you specify both a function name and a size n, the compiler inlines functions that match the function name or have n or fewer statements. If a name is used without a keyword, then a name with a period is assumed to be an inline library and a name without a period is assumed to be a function name. If a number is used without a keyword, the number is assumed to be a size. In the following example, the compiler inlines functions with fewer than approximately 100 statements in the source file myprog.f and writes the executable code in the default output file a.out. $ pgf95 -Minline=size:100 myprog.f Refer to “–M Options by Category,” on page 219 for more information on the -Minline options. Using an Inline Library If you specify one or more inline libraries on the command line with the -Minline option, the compiler does not perform an initial extract pass. The compiler selects functions to inline from the specified inline library. If you also specify a size or function name, all functions in the inline library meeting the selection criteria are selected for inline expansion at points in the source text where they are called. If you do not specify a function name or a size limitation for the -Minline option, the compiler inlines every function in the inline library that matches a function in the source text.Chapter 4. Using Function Inlining 47 In the following example, the compiler inlines the function proc from the inline library lib.il and writes the executable code in the default output file a.out. $ pgf95 -Minline=name:proc,lib:lib.il myprog.f The following command line is equivalent to the preceding line, with the exception that in the following example does not use the keywords name: and lib:. You typically use keywords to avoid name conflicts when you use an inline library name that does not contain a period. Otherwise, without the keywords, a period informs the compiler that the file on the command line is an inline library. $ pgf95 -Minline=proc,lib.il myprog.f Creating an Inline Library You can create or update an inline library using the -Mextract command-line option. If you do not specify selection criteria with the -Mextract option, the compiler attempts to extract all subprograms. Several -Mextract options let you determine the selection criteria for creating or updating an inline library. These selection criteria include: func Extracts the function func. You can us a comma-separated list to specify multiple functions. [name:]func Extracts the functions whose name matches func, a function in the source text. [size:]n Limits the size of the extracted functions to functions with a statement count less than or equal to n, the specified size. Note The size n may not exactly equal the number of statements in a selected function; the size parameter is merely a rough gauge. [lib:]ext.lib Stores the extracted information in the library directory ext.lib. If no inline library is specified, functions are extracted to a temporary library created during an extract prepass for use during the compilation stage. When you use the -Mextract option, only the extract phase is performed; the compile and link phases are not performed. The output of an extract pass is a library of functions available for inlining. This output is placed in the inline library file specified on the command line with the –o filename specification. If the library file exists, new information is appended to it. If the file does not exist, it is created. You can use a command similar to the following: $ pgf95 -Mextract=lib:lib.il myfunc.f You can use the -Minline option with the -Mextract option. In this case, the extracted library of functions can have other functions inlined into the library. Using both options enables you to obtain more than one level of inlining. In this situation, if you do not specify a library with the –Minline option, the inline processPGI® User’s Guide 48 consists of two extract passes. The first pass is a hidden pass implied by the –Minline option, during which the compiler extracts functions and places them into a temporary library. The second pass uses the results of the first pass but puts its results into the library that you specify with the –o option. Working with Inline Libraries An inline library is implemented as a directory with each inline function in the library stored as a file using an encoded form of the inlinable function. A special file named TOC in the inline library directory serves as a table of contents for the inline library. This is a printable, ASCII file which can be examined to find out information about the library contents, such as names and sizes of functions, the source file from which they were extracted, the version number of the extractor which created the entry, etc. Libraries and their elements can be manipulated using ordinary system commands. • Inline libraries can be copied or renamed. • Elements of libraries can be deleted or copied from one library to another. • The ls or dir command can be used to determine the last-change date of a library entry. Dependencies When a library is created or updated using one of the PGI compilers, the last-change date of the library directory is updated. This allows a library to be listed as a dependence in a makefile or a PVF property and ensures that the necessary compilations are performed when a library is changed. Updating Inline Libraries - Makefiles If you use inline libraries you need to be certain that they remain up to date with the source files into which they are inlined. One way to assure inline libraries are updated is to include them in a makefile. The makefile fragment in the following example assumes the file utils.f contains a number of small functions used in the files parser.f and alloc.f. The makefile also maintains the inline library utils.il. The makefile updates the library whenever you change utils.f or one of the include files it uses. In turn, the makefile compiles parser.f and alloc.f whenever you update the library. Example 4.1. Sample Makefile SRC = mydir FC = pgf95 FFLAGS = -O2 main.o: $(SRC)/main.f $(SRC)/global.h $(FC) $(FFLAGS) -c $(SRC)/main.f utils.o: $(SRC)/utils.f $(SRC)/global.h $(SRC)/utils.h $(FC) $(FFLAGS) -c $(SRC)/utils.f utils.il: $(SRC)/utils.f $(SRC)/global.h $(SRC)/utils.h $(FC) $(FFLAGS) -Mextract=15 -o utils.il utils.f parser.o: $(SRC)/parser.f $(SRC)/global.h utils.il $(FC) $(FFLAGS) -Minline=utils.il -c $(SRC)/parser.f alloc.o: $(SRC)/alloc.f $(SRC)/global.h utils.il $(FC) $(FFLAGS) -Minline=utils.il -c $(SRC)/alloc.f myprog: main.o utils.o parser.o alloc.o $(FC) -o myprog main.o utils.o parser.o alloc.oChapter 4. Using Function Inlining 49 Error Detection during Inlining To request inlining information from the compiler when you invoke the inliner, specify the –Minfo=inline option. For example: $ pgf95 -Minline=mylib.il -Minfo=inline myext.f Examples Assume the program dhry consists of a single source file dhry.f. The following command line builds an executable file for dhry in which proc7 is inlined wherever it is called: $ pgf95 dhry.f -Minline=proc7 The following command lines build an executable file for dhry in which proc7 plus any functions of approximately 10 or fewer statements are inlined (one level only). Note The specified functions are inlined only if they are previously placed in the inline library, temp.il, during the extract phase. $ pgf95 dhry.f -Mextract=lib:temp.il $ pgf95 dhry.f -Minline=10,proc7,temp.il Using the same source file dhry.f, the following example builds an executable for dhry in which all functions of roughly ten or fewer statements are inlined. Two levels of inlining are performed. This means that if function A calls function B, and B calls C, and both B and C are inlinable, then the version of B which is inlined into A will have had C inlined into it. $ pgf95 dhry.f -Minline=size:10,levels:2 Restrictions on Inlining The following Fortran subprograms cannot be extracted: • Main or BLOCK DATA programs. • Subprograms containing alternate return, assigned GO TO, DATA, SAVE, or EQUIVALENCE statements. • Subprograms containing FORMAT statements. • Subprograms containing multiple entries. A Fortran subprogram is not inlined if any of the following applies: • It is referenced in a statement function. • A common block mismatch exists; in other words, the caller must contain all common blocks specified in the callee, and elements of the common blocks must agree in name, order, and type (except that the caller's common block can have additional members appended to the end of the common block). • An argument mismatch exists; in other words, the number and type (size) of actual and formal parameters must be equal.PGI® User’s Guide 50 • A name clash exists, such as a call to subroutine xyz in the extracted subprogram and a variable named xyz in the caller. The following types of C and C++ functions cannot be inlined: • Functions containing switch statements • Functions which reference a static variable whose definition is nested within the function • Function which accept a variable number of arguments Certain C/C++ functions can only be inlined into the file that contains their definition: • Static functions • Functions which call a static function • Functions which reference a static variable51 Chapter 5. Using OpenMP The PGF77 and PGF95 Fortran compilers support the OpenMP Fortran Application Program Interface. The PGCC ANSI C and C++ compilers support the OpenMP C/C++ Application Program Interface. The OpenMP shared-memory parallel programming model is defined by a collection of compiler directives or pragmas, library routines, and environment variables that can be used to specify shared-memory parallelism in Fortran, C and C++ programs. The Fortran directives and C/C++ pragmas include a parallel region construct for writing coarse grain SPMD programs, work-sharing constructs which specify that DO loop iterations or C/C++ for loop iterations should be split among the available threads of execution, and synchronization constructs. The data environment is controlled either by using clauses on the directives or pragmas, or with additional directives or pragmas. Run-time library routines are provided to query the parallel runtime environment, for example to determine how many threads are participating in execution of a parallel region. Finally, environment variables are provided to control the execution behavior of parallel programs. For more information on OpenMP, see www.openmp.org. Fortran directives and C/C++ pragmas allow users to place hints in the source code to help the compiler generate better assembly code. You typically use directives and pragmas to control the actions of the compiler in a particular portion of a program without affecting the program as a whole. You place them in your source code where you want them to take effect. Typically they stay in effect from the point where included until the end of the compilation unit or until another directive or pragma changes its status. Fortran Parallelization Directives Parallelization directives are comments in a program that are interpreted by the PGI Fortran compilers when the option –mp is specified on the command line. The form of a parallelization directive is: sentinel directive_name [clauses] With the exception of the SGI-compatible DOACROSS directive, the sentinel must comply with these rules: • Be one of these: !$OMP, C$OMP, or *$OMP. • Must start in column 1 (one). • Must appear as a single word without embedded white space. • The sentinel marking a DOACROSS directive is C$.PGI® User’s Guide 52 The directive_name can be any of the directives listed in Table 5.1, “Directive and Pragma Summary Table,” on page 53. The valid clauses depend on the directive. Chapter 16, “OpenMP Reference Information” provides a list of directives and their clauses, their usage, and examples. In addition to the sentinel rules, the directive must also comply with these rules: • Standard Fortran syntax restrictions, such as line length, case insensitivity, and so on, apply to the directive line. • Initial directive lines must have a space or zero in column six. • Continuation directive lines must have a character other than a space or a zero in column six. Continuation lines for C$DOACROSS directives are specified using the C$& sentinel. • Directives which are presented in pairs must be used in pairs. Clauses associated with directives have these characteristics: • The order in which clauses appear in the parallelization directives is not significant. • Commas separate clauses within the directives, but commas are not allowed between the directive name and the first clause. • Clauses on directives may be repeated as needed, subject to the restrictions listed in the description of each clause. C/C++ Parallelization Pragmas Parallelization pragmas are #pragma statements in a C or C++ program that are interpreted by the PGCC C and C++ compilers when the option -mp is specified on the command line. The form of a parallelization pragma is: #pragma omp pragma_name [clauses] The format for pragmas include these standards: • The pragmas follow the conventions of the C and C++ standards. • Whitespace can appear before and after the #. • Preprocessing tokens following the #pragma omp are subject to macro replacement. • The order in which clauses appear in the parallelization pragmas is not significant. • Spaces separate clauses within the pragmas. • Clauses on pragmas may be repeated as needed subject to the restrictions listed in the description of each clause. For the purposes of the OpenMP pragmas, a C/C++ structured block is defined to be a statement or compound statement (a sequence of statements beginning with { and ending with }) that has a single entry and a single exit. No statement or compound statement is a C/C++ structured block if there is a jump into or out of that statement.Chapter 5. Using OpenMP 53 Directive and Pragma Recognition The compiler option –mp enables recognition of the parallelization directives and pragmas. The use of this option also implies: –Mreentrant Local variables are placed on the stack and optimizations, such as -Mnoframe, that may result in nonreentrant code are disabled. –Miomutex For directives, critical sections are generated around Fortran I/O statements. For pragmas, calls to I/O library functions are system-dependent and are not necessarily guaranteed to be thread-safe. I/O library calls within parallel regions should be protected by critical regions, as shown in the examples in Chapter 16, “OpenMP Reference Information”, to ensure they function correctly on all systems. Directive and Pragma Summary Table The following table provides a brief summary of the directives and pragmas that PGI supports. For complete information on these statement and examples, refer to Chapter 16, “OpenMP Reference Information”. Table 5.1. Directive and Pragma Summary Table Fortran Directive and C/C++ Pragma Description “ATOMIC ,” on page 244 omp atomic Semantically equivalent to enclosing a single statement in the CRITCIAL...END CRITICAL directive or omp critical pragma. Note: Only certain statements are allowed. “BARRIER,” on page 244 omp barrier Synchronizes all threads at a specific point in a program so that all threads complete work to that point before any thread continues. “CRITICAL ... END CRITICAL and omp critical ,” on page 245 Defines a subsection of code within a parallel region, a critical section, which is executed one thread at a time. “DO ... END DO and omp for ,” on page 247 Provides a mechanism for distribution of loop iterations across the available threads in a parallel region. “C$DOACROSS ,” on page 246 Specifies that the compiler should parallelize the loop to which it applies, even though that loop is not contained within a parallel region. “FLUSH and omp flush pragma ,” on page 249 When this appears, all processor-visible data items, or, when a list is present (FLUSH [list]), only those specified in the list, are written to memory, thus ensuring that all the threads in a team have a consistent view of certain objects in memory. “MASTER ... END MASTER and omp master pragma ” Designates code that executes on the master thread and that is skipped by the other threads.PGI® User’s Guide 54 Fortran Directive and C/C++ Pragma Description “ORDERED ,” on page 251 omp ordered Defines a code block that is executed by only one thread at a time, and in the order of the loop iterations; this makes the ordered code block sequential, while allowing parallel execution of statements outside the code block. “PARALLEL DO ,” on page 254 omp parallel for Enables you to specify which loops the compiler should parallelize. “PARALLEL ... END PARALLEL and omp parallel ,” on page 251 Supports a fork/join execution model in which a single thread executes all statements until a parallel region is encountered. “PARALLEL SECTIONS ,” on page 255 omp parallel sections Defines a non-iterative work-sharing construct without the need to define an enclosing parallel region. “PARALLEL WORKSHARE ,” on page 256 Provides a short form method for including a WORKSHARE directive inside a PARALLEL construct. “SECTIONS … END SECTIONS ,” on page 257 omp sections Defines a non-iterative work-sharing construct within a parallel region. “SINGLE ... END SINGLE,” on page 257S omp master Designates code that executes on a single thread and that is skipped by the other threads. “THREADPRIVATE ,” on page 258 omp threadprivate When a common block or variable that is initialized appears in this directive or pragma, each thread’s copy is initialized once prior to its first use. “WORKSHARE ... END WORKSHARE,” on page 259 omp for Provides a mechanism to effect parallel execution of noniterative but implicitly data parallel constructs. Directive and Pragma Clauses Some directives and pragmas accept clauses that further allow a user to control the scope attributes of variables for the duration of the directive or pragma. Not all clauses are allowed on all directives, so the clauses that are valid are included with the description of the directive and pragma. Typically, if no data scope clause is specified for variables, the default scope is share. Table 16.2, “Directive and Pragma Clauses ,” on page 260 provides a brief summary of the clauses associated with OPENMP directives and pragmas that PGI supports.Chapter 5. Using OpenMP 55 For complete information on these clauses, refer to the OpenMP documentation available on the WorldWide Web. Run-time Library Routines User-callable functions are available to the Fortran and to the OpenMP C/C++ programmer to query and alter the parallel execution environment. Any C/C++ program unit that invokes these functions should include the statement #include . The omp.h include file contains definitions for each of the C/C++ library routines and two required type definitions. For example, to use the omp_get_num_threads function, use this syntax: #include int omp_get_num_threads(void); The following table summarizes the run-time library calls. Note The Fortran call is shown first followed by the equivalent C++ call. Table 5.2. Run-time Library Call Summary Run-time Library Call with Examples omp_get_num_threads Returns the number of threads in the team executing the parallel region from which it is called. When called from a serial region, this function returns 1. A nested parallel region is the same as a single parallel region. By default, the value returned by this function is equal to the value of the environment variable OMP_NUM_THREADS or to the value set by the last previous call to omp_set_num_threads(). Fortran integer omp_get_num_threads() C/C++ #include int omp_get_num_threads(void); omp_set_num_threads Sets the number of threads to use for the next parallel region. This subroutine or function can only be called from a serial region of code. If it is called from within a parallel region, or from within a subroutine or function that is called from within a parallel region, the results are undefined. Further, this subroutine or function has precedence over the OMP_NUM_THREADS environment variable. Fortran subroutine omp_set_num_threads(scalar_integer_exp) C/C++ #include void omp_set_num_threads(int num_threads); omp_get_thread_num Returns the thread number within the team. The thread number lies between 0 and omp_get_num_threads()-1. When called from a serial region, this function returns 0. A nested parallel region is the same as a single parallel region. Fortran integer omp_get_thread_num()PGI® User’s Guide 56 Run-time Library Call with Examples C/C++ #include int omp_get_thread_num(void); omp_get_max_threads Returns the maximum value that can be returned by calls to omp_get_num_threads(). If omp_set_num_threads() is used to change the number of processors, subsequent calls to omp_get_max_threads() return the new value. Further, this function returns the maximum value whether executing from a parallel or serial region of code. Fortran integer function omp_get_max_threads() C/C++ #include void omp_get_max_threads(void) omp_get_num_procs Returns the number of processors that are available to the program Fortran integer function omp_get_num_procs() C/C++ #include int omp_get_num_procs(void); omp_get_stack_size Returns the value of the OpenMP internal control variable that specifies the size that is used to create a stack for a newly created thread. This value may not be the size of the stack of the current thread. Fortran !omp_get_stack_size interface function omp_get_stack_size () use omp_lib_kinds integer ( kind=OMP_STACK_SIZE_KIND ) :: omp_get_stack_size end function omp_get_stack_size end interface C/C++ #include size_t omp_get_stack_size(void); omp_set_stack_size Changes the value of the OpenMP internal control variable that specifies the size to be used to create a stack for a newly created thread. The integer argument specifies the stack size in kilobytes. The size of the stack of the current thread cannot be changed. In the PGI implementation, all OpenMP or auto-parallelization threads are created just prior to the first parallel region; therefore, only calls to omp_set_stack_size() that occur prior to the first region have an effect. Fortran: subroutine omp_set_stack_size(integer(KIND=OMP_STACK_SIZE_KIND)) C/C++ #include void omp_set_stack_size(size_t); omp_in_parallel Returns whether or not the call is within a parallel region. Returns .TRUE.for directives and non-zero for pragmas if called from within a parallel region and .FALSE. for directives and zero for pragmas if called outside of a parallel region. When calledChapter 5. Using OpenMP 57 Run-time Library Call with Examples from within a parallel region that is serialized, for example in the presence of an IF clause evaluating .FALSE.for directives and zero for pragmas, the function returns .FALSE. for directives and zero for pragmas. Fortran logical function omp_in_parallel() C/C++ #include int omp_in_parallel(void); omp_set_dynamic Allows automatic dynamic adjustment of the number of threads used for execution of parallel regions. This function is recognized, but currently has no effect. Fortran subroutine omp_set_dynamic(scalar_logical_exp) C/C++ #include void omp_set_dynamic(int dynamic_threads); omp_get_dynamic Allows the user to query whether automatic dynamic adjustment of the number of threads used for execution of parallel regions is enabled. This function is recognized, but currently always returns .FALSE.for directives and zero for pragmas. Fortran logical function omp_get_dynamic() C/C++ #include void omp_get_dynamic(void); omp_set_nested Allows enabling/disabling of nested parallel regions. This function is recognized, but currently has no effect. Fortran subroutine omp_set_nested(scalar_logical_exp) C/C++ #include void omp_set_nested(int nested); omp_get_nested Allows the user to query whether dynamic adjustment of the number of threads available for execution of parallel regions is enabled. This function is recognized, but currently always returns .FALSE. for directives and zero for pragmas. Fortran logical function omp_get_nested() C/C++ #include int omp_get_nested(void); omp_get_wtime Returns the elapsed wall clock time, in seconds, as a DOUBLE PRECISION value for directives and as a floating-point double value for pragmas. Times returned are per-thread times, and are not necessarily globally consistent across all threads. Fortran double precision function omp_get_wtime() C/C++ #include double omp_get_wtime() omp_get_wtickPGI® User’s Guide 58 Run-time Library Call with Examples Returns the resolution of omp_get_wtime(), in seconds, as a DOUBLE PRECISION value for Fortran directives and as a floating-point double value for C/C++ pragmas. Fortran double precision function omp_get_wtick() C/C++ #include double omp_get_wtick() omp_init_lock Initializes a lock associated with the variable lock for use in subsequent calls to lock routines. The initial state of the lock is unlocked. If the variable is already associated with a lock, it is illegal to make a call to this routine. Fortran subroutine omp_init_lock(integer_var) C/C++ #include void omp_init_lock(omp_lock_t *lock); void omp_init_nest_lock(omp_nest_lock_t *lock); omp_destroy_lock Disassociates a lock associated with the variable. Fortran subroutine omp_destroy_lock(integer_var) C/C++ #include void omp_destroy_lock(omp_lock_t *lock); void omp_destroy_nest_lock(omp_nest_lock_t *lock); omp_set_lock Causes the calling thread to wait until the specified lock is available. The thread gains ownership of the lock when it is available. If the variable is not already associated with a lock, it is illegal to make a call to this routine. Fortran subroutine omp_set_lock(integer_var) C/C++ #include void omp_set_lock(omp_lock_t *lock); void omp_set_nest_lock(omp_nest_lock_t *lock); omp_unset_lock Causes the calling thread to release ownership of the lock associated with integer_var. If the variable is not already associated with a lock, it is illegal to make a call to this routine. Fortran subroutine omp_unset_lock(integer_var) C/C++ #include void omp_unset_lock(omp_lock_t *lock); void omp_unset_nest_lock(omp_nest_lock_t *lock); omp_test_lock Causes the calling thread to try to gain ownership of the lock associated with the variable. The function returns .TRUE.for directives and non-zero for pragmas if the thread gains ownership of the lock; otherwise it returns .FALSE. for directives and zero for pragmas. If the variable is not already associated with a lock, it is illegal to make a call to this routine. Fortran logical function omp_test_lock(integer_var) C/C++ #include int omp_test_lock(omp_lock_t *lock);Chapter 5. Using OpenMP 59 Run-time Library Call with Examples int omp_test_nest_lock(omp_nest_lock_t *lock); Environment Variables You can use OpenMP environment variables to control the behavior of OpenMP programs. These environment variables allow you to set and pass information that can alter the behavior of directives and pragmas. The following summary table is a quick reference for the OPENMP environment variables that PGI uses. Detailed descriptions of each of these variables immediately follows the table. Table 5.3. OpenMP-related Environment Variable Summary Table Environment Variable Default Description OMP_DYNAMIC FALSE Currently has no effect. Typically enables (TRUE) or disables (FALSE) the dynamic adjustment of the number of threads. OMP_NESTED FALSE Currently has no effect. Typically enables (TRUE) or disables (FALSE) nested parallelism. OMP_NUM_THREADS 1 Specifies the number of threads to use during execution of parallel regions. OMP_SCHEDULE STATIC with chunk size of 1 Specifies the type of iteration scheduling and optionally the chunk size to use for omp for and omp parallel for loops that include the run-time schedule clause. OMP_STACK_SIZE Overrides the default stack size for a newly created thread. OMP_WAIT_POLICY ACTIVE Sets the behavior of idle threads, defining whether they spin or sleep when idle. The values are ACTIVE and PASSIVE. OMP_DYNAMIC OMP_DYNAMIC currently has no effect. Typically this variable enables (TRUE) or disables (FALSE) the dynamic adjustment of the number of threads. OMP_NESTED OMP_NESTED currently has no effect. Typically this variable enables (TRUE) or disables (FALSE) nested parallelism. OMP_NUM_THREADS OMP_NUM_THREADS specifies the number of threads to use during execution of parallel regions. The default value for this variable is 1. For historical reasons, the environment variable NCPUS is supported with the same functionality. In the event that both OMP_NUM_THREADS and NCPUS are defined, the value of OMP_NUM_THREADS takes precedence.PGI® User’s Guide 60 NOTE OMP_NUM_THREADS threads is used to execute the program regardless of the number of physical processors available in the system. As a result, you can run programs using more threads than physical processors and they execute correctly. However, performance of programs executed in this manner can be unpredictable, and oftentimes will be inefficient. OMP_SCHEDULE OMP_SCHEDULE specifies the type of iteration scheduling to use for DO and PARALLEL DO loop directives and for omp for and omp parallel for loop pragmas that include the SCHEDULE(RUNTIME) clause, described in “Schedule Clause,” on page 261. The default value for this variable is STATIC If the optional chunk size is not set, a chunk size of 1 is assumed except in the case of a static schedule. For a static schedule, the default is as defined in “DO ... END DO and omp for ,” on page 247. Examples of the use of OMP_SCHEDULE are as follows: For Fortran: $ setenv OMP_SCHEDULE "STATIC, 5" $ setenv OMP_SCHEDULE "GUIDED, 8" $ setenv OMP_SCHEDULE "DYNAMIC" For C/C++: $ setenv OMP_SCHEDULE "static, 5" $ setenv OMP_SCHEDULE "guided, 8" $ setenv OMP_SCHEDULE "dynamic" OMP_STACK_SIZE OMP_STACK_SIZE is an OpenMP 3.0 feature that controls the size of the stack for newly-created threads. This variable overrides the default stack size for a newly created thread. The value is a decimal integer followed by an optional letter B, K, M, or G, to specify bytes, kilobytes, megabytes, and gigabytes, respectively. If no letter is used, the default is kilobytes. There is no space between the value and the letter; for example, one megabyte is specified 1M. The following example specifies a stack size of 8 megabytes. $ setenv OMP_STACK_SIZE 8M The API functions related to OMP_STACK_SIZE are omp_set_stack_size and omp_get_stack_size. The environment variable OMP_STACK_SIZE is read on program start-up. If the program changes its own environment, the variable is not re-checked. This environment variable takes precedence over MPSTKZ, described in “MPSTKZ,” on page 94. Once a thread is created, its stack size cannot be changed. In the PGI implementation, threads are created prior to the first parallel region and persist for the life of the program. The stack size of the main program is set at program start-up and is not affected by OMP_STACK_SIZE. For more information on controlling the program stack size in Linux, refer to “Running Parallel Programs on Linux,” on page 9. OMP_WAIT_POLICY OMP_WAIT_POLICY sets the behavior of idle threads - specifically, whether they spin or sleep when idle. The values are ACTIVE and PASSIVE, with ACTIVE the default. The behavior defined by OMP_WAIT_POLICY is also shared by threads created by auto-parallelization.Chapter 5. Using OpenMP 61 • Threads are considered idle when waiting at a barrier, when waiting to enter a critical region, or when unemployed between parallel regions. • Threads waiting for critical sections always busy wait (ACTIVE). • Barriers always busy wait (ACTIVE), with calls to sched_yield determined by the environment variable MP_SPIN, described in “MP_SPIN,” on page 95. • Unemployed threads during a serial region can either busy wait using the barrier (ACTIVE) or politely wait using a mutex (PASSIVE). This choice is set by OMP_WAIT_POLICY, so the default is ACTIVE. When ACTIVE is set, idle threads consume 100% of their CPU allotment spinning in a busy loop waiting to restart in a parallel region. This mechanism allows for very quick entry into parallel regions, a condition which is good for programs that enter and leave parallel regions frequently. When PASSIVE is set, idle threads wait on a mutex in the operating system and consume no CPU time until being restarted. Passive idle is best when a program has long periods of serial activity or when the program runs on a multi-user machine or otherwise shares CPU resources.6263 Chapter 6. Using Directives and Pragmas It is often useful to be able to alter the effects of certain command line options or default behavior of the compiler. Fortran directives and C/C++ pragmas provide pragmatic information that control the actions of the compiler in a particular portion of a program without affecting the program as a whole. That is, while a command line option affects the entire source file that is being compiled, directives and pragmas apply, or disable, the effects of a command line option to selected subprograms or to selected loops in the source file, for example, to optimize a specific area of code. Use directives and pragmas to tune selected routines or loops. PGI Proprietary Fortran Directives PGI Fortran compilers support proprietary directives that may have any of the following forms: !pgi$g directive !pgi$r directive !pgi$l directive !pgi$ directive Note If the input is in fixed format, the comment character must begin in column 1 and either * or C is allowed in place of !. The scope indicator occurs after the $; this indicator controls the scope of the directive. Some directives ignore the scope indicator. The valid scopes, shown above, are: g (global) indicates the directive applies to the end of the source file. r (routine) indicates the directive applies to the next subprogram. l (loop) indicates the directive applies to the next loop (but not to any loop contained within the loop body). Loop-scoped directives are only applied to DO loops.PGI® User’s Guide 64 blank indicates that the default scope for the directive is applied. The body of the directive may immediately follow the scope indicator. Alternatively, any number of blanks may precede the name of the directive. Any names in the body of the directive, including the directive name, may not contain embedded blanks. Blanks may surround any special characters, such as a comma or an equal sign. The directive name, including the directive prefix, may contain upper or lower case letters, and the case is not significant. Case is significant for any variable names that appear in the body of the directive if the command line option –Mupcase is selected. For compatibility with other vendors’ directives, the prefix cpgi$ may be substituted with cdir$ or cvd$. Note If the input is in fixed format, the comment character must begin in column 1. PGI Proprietary C and C++ Pragmas Pragmas may be supplied in a C/C++ source file to provide information to the compiler. Many pragmas have a corresponding command-line option. Pragmas may also toggle an option, selectively enabling and disabling the option. The general syntax of a pragma is: #pragma [ scope ] pragma-body The optional scope field is an indicator for the scope of the pragma; some pragmas ignore the scope indicator. The valid scopes are: global indicates the pragma applies to the entire source file. routine indicates the pragma applies to the next function. loop indicates the pragma applies to the next loop (but not to any loop contained within the loop body). Loopscoped pragmas are only applied to for and while loops. If a scope indicator is not present, the default scope, if any, is applied. Whitespace must appear after the pragma keyword and between the scope indicator and the body of the pragma. Whitespace may also surround any special characters, such as a comma or an equal sign. Case is significant for the names of the pragmas and any variable names that appear in the body of the pragma. PGI Proprietary Optimization Fortran Directive and C/C++ Pragma Summary The following table summarizes the supported Fortran directives and C/C++ pragmas. The following terms are useful in understanding the table.Chapter 6. Using Directives and Pragmas 65 • Functionality is a brief summary of the way to use the directive or pragma. For a complete description, refer to Chapter 17, “Directives and Pragmas Reference,” on page 263. • Many of the directives and pragmas can be preceded by NO. The default entry indicates the default for the directive or pragma. N/A appears if a default does not apply. • The scope entry indicates the allowed scope indicators for each directive or pragma, with L for loop, R for routine, and G for global. The default scope is surrounded by parentheses and N/A appears if the directive or pragma is not available in the given language. Note The “*” in the scope indicates this: For routine-scoped directive The scope includes the code following the directive or pragma until the end of the routine. For globally-scoped directive The scope includes the code following the directive or pragma until the end of the file rather than for the entire file. The name of a directive or pragma may also be prefixed with –M. For example, the directive –Mbounds is equivalent to bounds and –Mopt is equivalent to opt; and the pragma –Mnoassoc is equivalent to noassoc, and –Mvintr is equivalent to vintr. Table 6.1. Proprietary Optimization-Related Fortran Directive and C/C++ Pragma Summary Directive or pragma Functionality Default Fortran Scope C/C++ Scope altcode (noaltcode) Do/don’t generate alternate code for vectorized and parallelized loops. altcode (L)RG (L)RG assoc (noassoc) Do/don’t perform associative transformations. assoc (L)RG (L)RG bounds (nobounds) Do/don’t perform array bounds checking. nobounds (R)G* (R)G cncall (nocncall) Loops are considered for parallelization, even if they contain calls to user-defined subroutines or functions, or if their loop counts do not exceed usual thresholds. nocncall (L)RG (L)RG concur (noconcur) Do/don’t enable auto-concurrentization of loops. concur (L)RG (L)RG depchk (nodepchk) Do/don’t ignore potential data dependencies. depchk (L)RG (L)RG eqvchk (noeqvchk) Do/don’t check EQUIVALENCE for data dependencies. eqvchk (L)RG N/A fcon (nofcon) Do/don’t assume unsuffixed real constants are single precision. nofcon N/A (R)GPGI® User’s Guide 66 Directive or pragma Functionality Default Fortran Scope C/C++ Scope invarif (noinvarif) Do/don’t remove invariant if constructs from loops. invarif (L)RG (L)RG ivdep Ignore potential data dependencies. ivdep (L)RG N/A lstval (nolstval) Do/don’t compute last values. lstval (L)RG (L)RG opt Select optimization level. N/A (R)G (R)G safe (nosafe) Do/don’t treat pointer arguments as safe. safe N/A (R)G safe_lastval Parallelize when loop contains a scalar used outside of loop. not enabled (L) (L) safeptr (nosafeptr) Do/don’t ignore potential data dependencies to pointers. nosafeptr N/A L(R)G single (nosingle) Do/don’t convert float parameters to double. nosingle N/A (R)G* tp Generate PGI Unified Binary code optimized for specified targets. N/A (R)G (R)G unroll (nounroll) Do/don’t unroll loops. nounroll (L)RG (L)RG vector (novector) Do/don't perform vectorizations. vector (L)RG* (L)RG vintr (novintr) Do/don’t recognize vector intrinsics. vintr (L)RG (L)RG Scope of Fortran Directives and Command-Line options During compilation the effect of a directive may be to either turn an option on, or turn an option off. Directives apply to the section of code following the directive, corresponding to the specified scope, which may include the following loop, the following routine, or the rest of the program. This section presents several examples that show the effect of directives as well as their scope. Consider the following Fortran code: integer maxtime, time parameter (n = 1000, maxtime = 10) double precision a(n,n), b(n,n), c(n,n) do time = 1, maxtime do i = 1, n do j = 1, n c(i,j) = a(i,j) + b(i,j) enddo enddo enddo end When compiled with –Mvect, both interior loops are interchanged with the outer loop. $ pgf95 -Mvect dirvect1.f Directives alter this behavior either globally or on a routine or loop by loop basis. To assure that vectorization is not applied, use the novector directive with global scope.Chapter 6. Using Directives and Pragmas 67 cpgi$g novector integer maxtime, time parameter (n = 1000, maxtime = 10) double precision a(n,n), b(n,n), c(n,n) do time = 1, maxtime do i = 1, n do j = 1, n c(i,j) = a(i,j) + b(i,j) enddo enddo enddo end In this version, the compiler disables vectorization for the entire source file. Another use of the directive scoping mechanism turns an option on or off locally, either for a specific procedure or for a specific loop: integer maxtime, time parameter (n = 1000, maxtime = 10) double precision a(n,n), b(n,n), c(n,n) cpgi$l novector do time = 1, maxtime do i = 1, n do j = 1, n c(i,j) = a(i,j) + b(i,j) enddo enddo enddo end Loop level scoping does not apply to nested loops. That is, the directive only applies to the following loop. In this example, the directive turns off vector transformations for the top-level loop. If the outer loop were a timing loop, this would be a practical use for a loop-scoped directive. Scope of C/C++ Pragmas and Command-Line Options During compilation a pragma either turns an option on or turns an option off. Pragmas apply to the section of code corresponding to the specified scope - either the entire file, the following loop, or the following or current routine. This section presents several examples showing the effect of pragmas and the use of the pragma scope indicators. Note In all cases, pragmas override a corresponding command-line option. For pragmas that have only routine and global scope, there are two rules for determining the scope of a pragma. We cover these special scope rules at the end of this section. Consider the program: main() { float a[100][100], b[100][100], c[100][100]; int time, maxtime, n, i, j; maxtime=10; n=100; for (time=0; time[,[,...]] where is any valid variable, member, or array element reference.PGI® User’s Guide 70 Format Requirements NOTE The sentinel for prefetch directives is c$mem, which is distinct from the cpgi$ sentinel used for optimization directives. Any prefetch directives that use the cpgi$ sentinel will be ignored by the PGI compilers. • The "c" must be in column 1. • Either * or ! is allowed in place of c. • The scope indicators g, r and l used with the cpgi$ sentinel are not supported. • The directive name, including the directive prefix, may contain upper or lower case letters and is case insensitive (case is not significant). • Any variable names that appear in the body of the directive are case sensitive if the command line option –Mupcase is selected. Sample Usage Example 6.1. Prefetch Directive Use This example uses prefetch directives to prefetch data in a matrix multiplication inner loop where a row of one source matrix has been gathered into a contiguous vector. real*8 a(m,n), b(n,p), c(m,p), arow(n) ... do j = 1, p c$mem prefetch arow(1),b(1,j) c$mem prefetch arow(5),b(5,j) c$mem prefetch arow(9),b(9,j) do k = 1, n, 4 c$mem prefetch arow(k+12),b(k+12,j) c(i,j) = c(i,j) + arow(k) * b(k,j) c(i,j) = c(i,j) + arow(k+1) * b(k+1,j) c(i,j) = c(i,j) + arow(k+2) * b(k+2,j) c(i,j) = c(i,j) + arow(k+3) * b(k+3,j) enddo enddo This pattern of prefetch directives causes the compiler to emit prefetch instructions whereby elements of arow and b are fetched into the data cache starting four iterations prior to first use. By varying the prefetch distance in this way, it is sometimes possible to reduce the effects of main memory latency and improve performance. !DEC$ Directive PGI Fortran compilers for Microsoft Windows support several de-facto standard Fortran directives that help with interlanguage calling and importing and exporting routines to and from DLLs. These directives all take the form: !DEC$ directiveChapter 6. Using Directives and Pragmas 71 Format Requirements You must follow the following format requirements for the directive to be recognized in your program: • The directive must begin in line 1 when the file is fixed format or compiled with –Mfixed. • The directive prefix !DEC$ requires a space between the prefix and the directive keyword ATTRIBUTES. • The ! must begin the prefix when compiling Fortran 90 freeform format. • The characters C or * can be used in place of ! in either form of the prefix when compiling fixed-form (F77- style) format. • The directives are completely case insensitive. ALIAS Directive This directive specifies an alternative name with which to resolve a routine. The syntax for the ALIAS directive is either of the following: !DEC$ ALIAS routine_name , external_name !DEC$ ALIAS routine_name : external_name In this syntax, external_name is used as the external name for the specified routine_name. If external_name is an identifier name, the name (in uppercase) is used as the external name for the specified routine_name. If external_name is a character constant, it is used as-is; the string is not changed to uppercase, nor are blanks removed. You can also supply an alias for a routine using the ATTRIBUTES directive, described in the next section: !DEC$ ATTIRIBUTES ALIAS : 'alias_name' :: routine_name This directive specifies an alternative name with which to resolve a routine, as illustrated in the following code fragment that provides external names for three routines. In this fragment, the external name for sub1 is name1, for sub2 is name2, and for sub3 is name3. subroutine sub !DEC$ alias sub1 , 'name1' !DEC$ alias sub2 : 'name2' !DEC$ attributes alias : 'name3' :: sub3 ATTRIBUTES Directive !DEC$ ATTRIBUTES where is one of: ALIAS : 'alias_name' :: routine_name Specifies an alternative name with which to resolve routine_name. C :: routine_name Specifies that the routine routine_name will have its arguments passed by value. When a routine marked C is called, arguments, except arrays, are sent by value. For characters, only the first character is passed. The standard Fortran calling convention is pass by reference.PGI® User’s Guide 72 DLLEXPORT :: name Specifies that 'name' is being exported from a DLL. DLLIMPORT :: name Specifies that 'name' is being imported from a DLL. REFERENCE :: name Specifies that the argument 'name' is being passed by reference. Often this attribute is used in conjunction with STDCALL, where STDCALL refers to an entire routine; then individual arguments are modified with REFERENCE. STDCALL :: routine_name Specifies that routine 'routine_name' will have its arguments passed by value. When a routine marked STDCALL is called, arguments (except arrays and characters) will be sent by value. The standard Fortran calling convention is pass by reference. VALUE :: name Specifies that the argument 'name' is being passed by value. DISTRIBUTE Directive The syntax for the DISTRIBUTE directive is either of the following: !DEC$ DISTRIBUTE POINT !DEC$ DISTRIBUTEPOINT This directive is front-end based, and tells the compiler at what point within a loop to split into two loops. subroutine dist(a,b,n) integer i integer n integer a(*) integer b(*) do i = 1,n a(i) = a(i)+2 !DEC$ DISTRIBUTE POINT b(i) = b(i)*4 enddo end subroutine ALIAS Directive !DEC$ ALIAS is the same as !DEC$ ATTRIBUTES ALIAS C$PRAGMA C When programs are compiled using one of the PGI Fortran compilers on Linux, Win64, OSX, and SUA systems, an underscore is appended to Fortran global names, including names of functions, subroutines, and common blocks. This mechanism distinguishes Fortran name space from C/C++ name space. You can use C$PRAGMA C in the Fortran program to call a C/C++ function from Fortran. The statement would look similar to this:Chapter 6. Using Directives and Pragmas 73 C$PRAGMA C(name[,name]...) NOTE This statement directs the compiler to recognize the routine 'name' as a C function, thus preventing the Fortran compiler from appending an underscore to the routine name. On Win32 systems the C$PRAGMA C as well as the attributes C and STDCALL may effect other changes on argument passing as well as on the names of the routine. For more information on this topic, refer to “Win32 Calling Conventions,” on page 120.7475 Chapter 7. Creating and Using Libraries A library is a collection of functions or subprograms that are grouped for reference and ease of linking. This chapter discusses issues related to PGI-supplied compiler libraries. Specifically, it addresses the use of C/C++ builtin functions in place of the corresponding libc routines, creation of dynamically linked libraries, known as shared objects or shared libraries, and math libraries. Note This chapter does not duplicate material related to using libraries for inlining, described in “Creating an Inline Library,” on page 47 or information related to run-time library routines available to OpenMP programmers, described in “Run-time Library Routines,” on page 55. This chapter has examples that include the following options related to creating and using libraries. –Bdynamic –fpic –Mmakeimplib –Bstatic –implib –o –c –l –shared –def –Mmakedll Using builtin Math Functions in C/C++ The name of the math header file is math.h. Include the math header file in all of your source files that use a math library routine as in the following example, which calculates the inverse cosine of pi/3. #include #define PI 3.1415926535 void main() { double x, y; x = PI/3.0; y = acos(x); }PGI® User’s Guide 76 Including math.h will cause PGCC C and C++ to use builtin functions, which are much more efficient than library calls. In particular, the following intrinsics calls will be processed using builtins if you include math.h: abs atan atan2 cos exp fabs fmax fmaxf fmin fminf log log10 pow sin sqrt tan Creating and Using Shared Object Files on Linux All of the PGI Fortran, C, and C++ compilers support creation of shared object files. Unlike statically linked object and library files, shared object files link and resolve references with an executable at runtime via a dynamic linker supplied with your operating system. The PGI compilers must generate position independent code to support creation of shared objects by the linker. However, this is not the default. You must create object files with position independent code and shared object files that will include them. The following steps describe how to create and use a shared object file. 1. Create an object file with position independent code. To do this, compile your code with the appropriate PGI compiler using the –fpic option, or one of the equivalent options, such as –fPIC, –Kpic, and –KPIC, which are supported for compatibility with other systems. For example, use the following command to create an object file with position independent code using pgf95: % pgf95 -c -fpic tobeshared.f 2. Produce a shared object file. To do this, use the appropriate PGI compiler to invoke the linker supplied with your system. It is customary to name such files using a .so filename extension. On Linux, you do this by passing the –shared option to the linker: % pgf95 -shared -o tobeshared.so tobeshared.o Note Compilation and generation of the shared object can be performed in one step using both the –fpic option and the appropriate option for generation of a shared object file. 3. Use a shared object file. To do this, us the appropriate PGI compiler to compile and link the program which will reference functions or subroutines in the shared object file, and list the shared object on the link line, as shown here: % pgf95 -o myprog myprog.f tobeshared.so 4. Make the executable available. You now have an executable myprog which does not include any code from functions or subroutines in tobeshared.so, but which can be executed and dynamically linked to that code.Chapter 7. Creating and Using Libraries 77 By default, when the program is linked to produce myprog, no assumptions are made on the location of tobeshared.so. Therefore, for myprog to execute correctly, you must initialize the environment variable LD_LIBRARY_PATH to include the directory containing tobeshared.so. If LD_LIBRARY_PATH is already initialized, it is important not to overwrite its contents. Assuming you have placed tobeshared.so in a directory /home/myusername/bin, you can initialize LD_LIBRARY_PATH to include that directory and preserve its existing contents, as shown in the following: % setenv LD_LIBRARY_PATH "$LD_LIBRARY_PATH":/home/myusername/bin If you know that tobeshared.so will always reside in a specific directory, you can create the executable myprog in a form that assumes this using the –R link-time option. For example, you can link as follows: % pgf95 -o myprog myprof.f tobeshared.so -R/home/myusername/bin Note As with the –L option, there is no space between –R and the directory name. If the –R option is used, it is not necessary to initialize LD_LIBRARY_PATH. In the previous example, the dynamic linker will always look in /home/myusername/bin to resolve references to tobeshared.so. By default, if the LD_LIBRARY_PATH environment variable is not set, the linker will only search /usr/lib and /lib for shared objects. The command ldd is a useful tool when working with shared object files and executables that reference them. When applied to an executable, as shown in the following example, ldd lists all shared object files referenced in the executable along with the pathname of the directory from which they will be extracted. % ldd myprog If the pathname is not hard-coded using the–R option, and if LD_LIBRARY_PATH is not initialized, the pathname is listed as “not found”. For more information on ldd, its options and usage, see the online man page for ldd. Creating and Using Shared Object Files in SFU and 32-bit SUA Note The information included in this section is valid for 32-bit only. The 32-bit version of PGI Workstation for SFU and SUA uses the GNU ld for its linker, unlike previous versions that used the Windows LINK.EXE. With this change, the PGI compilers and tools for SFU and 32-bit SUA are now able to generate shared object (.so) files. You use the –shared switch to generate a shared object file. The following example creates a shared object file, hello.so, and then creates a program called hello that uses it. 1. Create a shared object file. To produce a shared object file, use the appropriate PGI compiler to invoke the linker supplied with your system. It is customary to name such files using a .so filename extension. In the following example, we use hello.so:PGI® User’s Guide 78 % pgcc -shared hello.c -o hello.so 2. Create a program that uses the shared object, in this example, hello.so: % pgcc hi.c hello.so -o hello Shared Object Error Message When running a program that uses a shared object, you may encounter an error message similar to the following: hello: error in loading shared libraries hello.so: cannot open shared object file: No such file or directory This error message either means that the shared object file does not exist or that the location of this file is not specified in your LD_LIBRARY_PATH variable. To specify the location of the .so, add the shared object’s directory to your LD_LIBRARY_PATH variable. For example, the following command adds the current directory to your LD_LIBRARY_PATH variable using C shell syntax: % setenv LD_LIBRARY_PATH "$LD_LIBRARY_PATH":"./" Shared Object-Related Compiler Switches The following switches support shared object files in SFU and SUA. For more detailed information on these switches, refer to Chapter 15, “Command-Line Options Reference,” on page 163. –shared Used to produce shared libraries –Bdynamic Passed to linker; specify dynamic binding Note On Windows, -Bstatic and -Bdynamic must be used for both compiling and linking. –Bstatic Passed to linker; specify static binding –Bstatic_pgi Use to link static PGI libraries with dynamic system libraries; implies –Mnorpath. –L Passed to linker; add directory to library search path. –Mnorpath Don't add –rpath paths to link line. –Mnostartup Do not use standard linker startup file. –Mnostdlib Do not use standard linker libraries. –R Passed to linker; just link symbols from object, or add directory to run time search path.Chapter 7. Creating and Using Libraries 79 PGI Runtime Libraries on Windows The PGI runtime libraries on Windows are available in both static and dynamicallyy-linked (DLL) versions. The static libraries are used by default. • You can use the dynamically-linked version of the routine by specifying –Bdynamic at both compile and link time. • You can explicitly specify static linking, the default, by using -Bstatic at compile and link time. For details on why you might choose one type of linking over another type, refer to “Creating and Using Dynamic-Link Libraries on Windows,” on page 80. Creating and Using Static Libraries on Windows The Microsoft Library Manager (LIB.EXE) is the tool that is typically used to create and manage a static library of object files on Windows. LIB is provided with the PGI compilers as part of the Microsoft Open Tools. Refer to www.msdn2.com for a complete LIB reference - searching for LIB.EXE. For a list of available options, invoke LIB with the /? switch. For compatibility with legacy makefiles, PGI provides wrappers for LIB and LINK called ar. This version of ar is compatible with Womdpws amd pbject-file formats. PGi also provides ranlib as a placeholder for legacy makefile support. ar command The ar command is a legacy archive wrapper that interprets legacy ar command line options and translates these to LINK/LIB options. You can use it to create libraries of object files. Syntax: The syntax for the ar command is this: ar [options] [archive] [object file]. Where: • The first argument must be a command line switch, and the leading dash on the first option is optional. • The single character options, such as –d and –v, may be combined into a single option, as –dv. Thus, ar dv, ar -dv, and ar -d -v all mean the same thing. • The first non-switch argument must be the library name. • One (and only one) of –d, –r, –t, or –x must appear on the command line. Options The options available for the ar command are these:PGI® User’s Guide 80 –c This switch is for compatibility; it is ignored. –d The named object files are deleted from the library. –r The named object files are replaced in or added to the library. ranlib command The ranlib command is a wrapper that allows use of legacy scripts and makefiles that use the ranlib command. The command actually does nothing; it merely exists for compatibility. Syntax: The syntax for the ranlib command is this: DOS> ranlib [options] [archive] Options The options available for the ranlib command are these: –help Short help information is printed out. –V Version information is printed out. Creating and Using Dynamic-Link Libraries on Windows There are several differences between static and dynamic-link libraries on Windows. Libraries of either type are used when resolving external references for linking an executable, but the process differs for each type of library. When linking with a static library, the code needed from the library is incorporated into the executable. When linking with a DLL, external references are resolved using the DLL's import library, not the DLL itself. The code in the DLL associated with the external references does not become a part of the executable. The DLL is loaded when the executable that needs it is run. For the DLL to be loaded in this manner, the DLL must be in your path. Static libraries and DLLs also handle global data differently. Global data in static libraries is automatically accessible to other objects linked into an executable. Global data in a DLL can only be accessed from outside the DLL if the DLL exports the data and the image that uses the data imports it. To this end the C compilers support the Microsoft storage class extensions __declspec(dllimport) and __declspec(dllexport). These extensions may appear as storage class modifiers and enable functions and data to be imported and exported: extern int __declspec(dllimport) intfunc(); float __declspec(dllexport) fdata;Chapter 7. Creating and Using Libraries 81 The PGI Fortran compilers support the DEC$ATTRIBUTES extensions DLLIMPORT and DLLEXPORT: cDEC$ ATTRIBUTES DLLEXPORT :: object [,object] ... cDEC$ ATTRIBUTES DLLIMPORT :: object [,object] ... Here c is one of C, c, !, or *. object is the name of the subprogram or common block that is exported or imported. Note that common block names are enclosed within slashes (/). In example: cDEC$ ATTRIBUTES DLLIMPORT :: intfunc !DEC$ ATTRIBUTES DLLEXPORT :: /fdata/ For more information on these extensions, refer to “!DEC$ Directive,” on page 70. The Examples in this section further illustrate the use of these extensions. To create a DLL from the command line, use the –Mmakedll option. The following switches apply to making and using DLLs with the PGI compilers: –Bdynamic Compile for and link to the DLL version of the PGI runtime libraries. This flag is required when linking with any DLL built by the PGI compilers. This flag corresponds to the /MD flag used by Microsoft’s cl compilers. –Bstatic Compile for and link to the static version of the PGI runtime libraries. This flag corresponds to the /MT flag used by Microsoft’s cl compilers. –Mmakedll Generate a dynamic-link library or DLL. Implies –Bdynamic. –Mmakeimplib Generate an import library without generating a DLL. Use this flag when you want to generate an import library for a DLL but are not yet ready to build the DLL itself. This situation might arise, for example, when building DLLs with mutual imports, as shown in Example 7.4, “Build DLLs Containing Circular Mutual Imports: Fortran,” on page 86. –o Passed to the linker. Name the DLL or import library . –def When used with –Mmakedll, this flag is passed to the linker and a .def file named is generated for the DLL. The .def file contains the symbols exported by the DLL. Generating a .def file is not required when building a DLL but can be a useful debugging tool if the DLL does not contain the symbols that you expect it to contain. When used with –Mmakeimplib, this flag is passed to lib which requires a .def file to create an import library. The .def file can be empty if the list of symbols to export are passed to lib on the command line or explicitly marked as dllexport in the source code. –implib Passed to the linker. Generate an import library named for the DLL. A DLL’s import library is the interface used when linking an executable that depends on routines in a DLL.PGI® User’s Guide 82 To use the PGI compilers to create an executable that links to the DLL form of the runtime, use the compiler flag –Bdynamic. The executable built will be smaller than one built without –Bdynamic; the PGI runtime DLLs, however, must be available on the system where the executable is run. The –Bdynamic flag must be used when an executable is linked against a DLL built by the PGI compilers. The following examples outline how to use –Bdynamic, –Mmakedll and –Mmakeimplib to build and use DLLs with the PGI compilers. Example 7.1. Build a DLL: Fortran In this example we build a DLL out of a single source file, object1.f, which exports data and a subroutine using DLLEXPORT. The main source file, prog1.f, uses DLLIMPORT to import the data and subroutine from the DLL. object1.f subroutine sub1(i) !DEC$ ATTRIBUTES DLLEXPORT :: sub1 integer i common /acommon/ adata integer adata !DEC$ ATTRIBUTES DLLEXPORT :: /acommon/ print *, "sub1 adata", adata print *, "sub1 i ", i adata = i end prog1.f program prog1 common /acommon/ adata integer adata external sub1 !DEC$ ATTRIBUTES DLLIMPORT:: sub1, /acommon/ adata = 11 call sub1(12) print *, "main adata", adata end Step 1: Create the DLL obj1.dll and its import library obj1.lib using the following series of commands: % pgf95 -Bdynamic -c object1.f % pgf95 -Mmakedll object1.obj -o obj1.dll Step 2: Compile the main program: % pgf95 -Bdynamic -o prog1 prog1.f -defaultlib:obj1 The –Mdll switch causes the compiler to link against the PGI runtime DLLs instead of the PGI runtime static libraries. The –Mdll switch is required when linking against any PGI-compiled DLL, such as obj1.dll. The #defaultlib: switch specifies that obj1.lib, the DLL’s import library, should be used to resolve imports. Step 3: Ensure that obj1.dll is in your path, then run the executable prog1 to determine if the DLL was successfully created and linked:Chapter 7. Creating and Using Libraries 83 % prog1 sub1 adata 11 sub1 i 12 main adata 12 Should you wish to change obj1.dll without changing the subroutine or function interfaces, no rebuilding of prog1 is necessary. Just recreate obj1.dll and the new obj1.dll is loaded at runtime. Example 7.2. Build a DLL: C In this example, we build a DLL out of a single source file, object2.c, which exports data and a subroutine using __declspec(dllexport). The main source file, prog2.c, uses __declspec(dllimport) to import the data and subroutine from the DLL. object2.c int __declspec(dllexport) data; void __declspec(dllexport) func2(int i) { printf("func2: data == %d\n", data); printf("func2: i == %d\n", i); data = i; } prog2.c int __declspec(dllimport) data; void __declspec(dllimport) func2(int); int main() { data = 11; func2(12); printf("main: data == %d\n",data); return 0; } Step 1: Create the DLL obj2.dll and its import library obj2.lib using the following series of commands: % pgcc -Bdynamic -c object2.c % pgcc -Mmakedll object2.obj -o obj2.dll Step 2: Compile the main program: % pgcc -Bdynamic -o prog2 prog2.c -defaultlib:obj2 The –Bdynamic switch causes the compiler to link against the PGI runtime DLLs instead of the PGI runtime static libraries. The –Bdynamic switch is required when linking against any PGI-compiled DLL such as obj2.dll. The #defaultlib: switch specifies that obj2.lib, the DLL’s import library, should be used to resolve the imported data and subroutine in prog2.c. Step 3: Ensure that obj2.dll is in your path, then run the executable prog2 to determine if the DLL was successfully created and linked:PGI® User’s Guide 84 % prog2 func2: data == 11 func2: i == 12 main: data == 12 Should you wish to change obj2.dll without changing the subroutine or function interfaces, no rebuilding of prog2 is necessary. Just recreate obj2.dll and the new obj2.dll is loaded at runtime. Example 7.3. Build DLLs Containing Circular Mutual Imports: C In this example we build two DLLs, obj3.dll and obj4.dll, each of which imports a routine that is exported by the other. To link the first DLL, the import library for the second DLL must be available. Usually an import library is created when a DLL is linked. In this case, however, the second DLL cannot be linked without the import library for the first DLL. When such circular imports exist, an import library for one of the DLLs must be created in a separate step without creating the DLL. The PGI drivers call the Microsoft lib tool to create import libraries in this situation. Once the DLLs are built, we can use them to build the main program. /* object3.c */ void __declspec(dllimport) func_4b(void); void __declspec(dllexport) func_3a(void) { printf("func_3a, calling a routine in obj4.dll\n"); func_4b(); } void __declspec(dllexport) func_3b(void) { printf("func_3b\n"); } /* object4.c */ void __declspec(dllimport) func_3b(void); void __declspec(dllexport) func_4a(void) { printf("func_4a, calling a routine in obj3.dll\n"); func_3b(); } void __declspec(dllexport) func_4b(void) { printf("func_4b\n"); } /* prog3.c */ void __declspec(dllimport) func_3a(void); void __declspec(dllimport) func_4a(void); int main() { func_3a(); func_4a(); return 0; }Chapter 7. Creating and Using Libraries 85 Step 1: Use –Mmakeimplib with the PGI compilers to build an import library for the first DLL without building the DLL itself. % pgcc -Bdynamic -c object3.c % pgcc -Mmakeimplib -o obj3.lib object3.obj The –def= option can also be used with –Mmakeimplib. Use a .def file when you need to export additional symbols from the DLL. A .def file is not needed in this example because all symbols are exported using __declspec(dllexport). Step 2: Use the import library, obj3.lib, created in Step 1, to link the second DLL. % pgcc -Bdynamic -c object4.c % pgcc -Mmakedll -o obj4.dll object4.obj -defaultlib:obj3 Step 3: Use the import library, obj4.lib, created in Step 2, to link the first DLL. % pgcc -Mmakedll -o obj3.dll object3.obj -defaultlib:obj4 Step 4: Compile the main program and link against the import libraries for the two DLLs. % pgcc -Bdynamic prog3.c -o prog3 -defaultlib:obj3 -defaultlib:obj4 Step 5: Execute prog3.exe to ensure that the DLLs were create properly. % prog3 func_3a, calling a routine in obj4.dll func_4b func_4a, calling a routine in obj3.dll func_3bPGI® User’s Guide 86 Example 7.4. Build DLLs Containing Circular Mutual Imports: Fortran In this example we build two DLLs when each DLL is dependent on the other, and use them to build the main program. In the following source files, object2.f95 makes calls to routines defined in object3.f95, and vice versa. This situation of mutual imports requires two steps to build each DLL. In this example we build two DLLs, obj2.dll and obj3.dll, each of which imports a routine that is exported by the other. To link the first DLL, the import library for the second DLL must be available. Usually an import library is created when a DLL is linked. In this case, however, the second DLL cannot be linked without the import library for the first DLL. When such circular imports exist, an import library for one of the DLLs must be created in a separate step without creating the DLL. The PGI drivers call the Microsoft lib tool to create import libraries in this situation. Once the DLLs are built, we can use them to build the main program. object2.f95 subroutine func_2a external func_3b !DEC$ ATTRIBUTES DLLEXPORT :: func_2a !DEC$ ATTRIBUTES DLLIMPORT :: func_3b print*,"func_2a, calling a routine in obj3.dll" call func_3b() end subroutine subroutine func_2b !DEC$ ATTRIBUTES DLLEXPORT :: func_2b print*,"func_2b" end subroutine object3.f95 subroutine func_3a external func_2b !DEC$ ATTRIBUTES DLLEXPORT :: func_3a !DEC$ ATTRIBUTES DLLIMPORT :: func_2b print*,"func_3a, calling a routine in obj2.dll" call func_2b() end subroutine subroutine func_3b !DEC$ ATTRIBUTES DLLEXPORT :: func_3b print*,"func_3b" end subroutine prog2.f95 program prog2 external func_2a external func_3a !DEC$ ATTRIBUTES DLLIMPORT :: func_2a !DEC$ ATTRIBUTES DLLIMPORT :: func_3a call func_2a() call func_3a() end program Step 1: Use –Mmakeimplib with the PGI compilers to build an import library for the first DLL without building the DLL itself. % pgf95 -Bdynamic -c object2.f95 % pgf95 -Mmakeimplib -o obj2.lib object2.objChapter 7. Creating and Using Libraries 87 Tip The -def= option can also be used with -Mmakeimplib. Use a .def file when you need to export additional symbols from the DLL. A .def file is not needed in this example because all symbols are exported using DLLEXPORT. Step 2: Use the import library, obj2.lib, created in Step 1, to link the second DLL. % pgf95 -Bdynamic -c object3.f95 % pgf95 -Mmakedll -o obj3.dll object3.obj -defaultlib:obj2 Step 3: Use the import library, obj3.lib, created in Step 2, to link the first DLL. % pgf95 -Mmakedll -o obj2.dll object2.obj -defaultlib:obj3 Step 4: Compile the main program and link against the import libraries for the two DLLs. % pgf95 -Bdynamic prog2.f95 -o prog2 -defaultlib:obj2 -defaultlib:obj3 Step 5: Execute prog2 to ensure that the DLLs were created properly: % prog2 func_2a, calling a routine in obj3.dll func_3b func_3a, calling a routine in obj2.dll func_2b Example 7.5. Import a Fortran module from a DLL In this example we import a Fortran module from a DLL. We use the source file my_module_def.f90 to create a DLL containing a Fortran module. We then use the source file my_module_use.f90 to build a program that imports and uses the Fortran module from my_module_def.f90. defmod.f90 module testm type a_type integer :: an_int end type a_type type(a_type) :: a, b !DEC$ ATTRIBUTES DLLEXPORT :: a,b contains subroutine print_a !DEC$ ATTRIBUTES DLLEXPORT :: print_a write(*,*) a%an_int end subroutine subroutine print_b !DEC$ ATTRIBUTES DLLEXPORT :: print_b write(*,*) b%an_int end subroutine end module usemod.f90 use testm a%an_int = 1 b%an_int = 2 call print_a call print_b endPGI® User’s Guide 88 Step 1: Create the DLL. % pgf90 -Mmakedll -o defmod.dll defmod.f90 Creating library defmod.lib and object defmod.exp Step 2: Create the exe and link against the import library for the imported DLL. % pgf90 -Bdynamic -o usemod usemod.f90 -defaultlib:defmod.lib Step 3: Run the exe to ensure that the module was imported from the DLL properly. % usemod 1 2 Using LIB3F The PGI Fortran compilers include complete support for the de facto standard LIB3F library routines on both Linux and Windows operating systems. See the PGI Fortran Reference manual for a complete list of available routines in the PGI implementation of LIB3F. LAPACK, BLAS and FFTs Pre-compiled versions of the public domain LAPACK and BLAS libraries are included with the PGI compilers. The LAPACK library is called liblapack.a or on Windows, liblapack.lib. The BLAS library is called libblas.a or on Windows, libblas.lib. These libraries are installed to $PGI//lib, where is replaced with the appropriate target name (linux86, linux86-64, osx86, osx86-64, win32, win64, sfu32, sua32, or sua64). To use these libraries, simply link them in using the –l option when linking your main program: % pgf95 myprog.f -llapack -lblas Highly optimized assembly-coded versions of BLAS and certain FFT routines may be available for your platform. In some cases, these are shipped with the PGI compilers. See the current release notes for the PGI compilers you are using to determine if these optimized libraries exist, where they can be downloaded (if necessary), and how to incorporate them into your installation as the default. The C++ Standard Template Library The PGC++ compiler includes a bundled copy of the STLPort Standard C++ Library. See the online Standard C++ Library tutorial and reference manual at www.stlport.com for further details and licensing.89 Chapter 8. Using Environment Variables Environment variables allow you to set and pass information that can alter the default behavior of the PGI compilers and the executables which they generate. This chapter includes explanations of the environment variables specific to PGI compilers. Other environment variables are referenced and documented in other sections of this User’s Guide or the PGI Tools Guide. • You use OpenMP environment variables to control the behavior of OpenMP programs. For consistency related to the OpenMP environment, the details of the OpenMP-related environment variables are included in Chapter 5, “Using OpenMP”. • You can use environment variables to control the behavior of the PGDBG debugger or PGPROF profiler. For a description of environment variables that affect these tools, refer to the PGI Tools Guide. Setting Environment Variables Before we look at the environment variables that you might use with the PGI compilers and tools, let’s take a look at how to set environment variables. To illustrate how to set these variables in various environments, lets look at how a user might initialize the shell environment prior to using the PGI compilers and tools. Setting Environment Variables on Linux Let’s assume that you want access to the PGI products when you log on. Let’s further assume that you installed the PGI compilers in /opt/pgi and that the license file is in /opt/pgi/license.dat. For access at startup, you can add the following lines to your startup file. In csh, use these commands: % setenv PGI /opt/pgi % setenv MANPATH "$MANPATH":$PGI/linux86/7.1/man % setenv LM_LICENSE_FILE $PGI/license.dat % set path = ($PGI/linux86/7.1/bin $path) In bash, sh or ksh, use these commands: % PGI=/opt/pgi; export PGI PGI® User’s Guide 90 % MANPATH=$MANPATH:$PGI/linux86/7.1/man; export MANPATH % LM_LICENSE_FILE=$PGI/license.dat; export LM_LICENSE_FILE % PATH=$PGI/linux86/7.1/bin:$PATH; export PATH Setting Environment Variables on Windows In Windows, when you access PGI Workstation 7.1 (Start | PGI Workstation 7.1), you have two options that PGI provides for setting your environment variables - either the DOS command environment or the Cygwin Bash environment. When you open either of these shells available to you, the default environment variables are already set and available to you. You may want to use other environment variables, such as the OpenMP ones. This section explains how to do that. Suppose that your home directory is C:tmp. The following examples show how you might set the temporary directory to your home directory, and then verify that it is set. Command prompt: From PGI Workstation 7.1, select PGI Workstation Tools | PGI Command Prompt (32-bit or 64-bit), and enter the following: DOS> set TMPDIR=C:tmp DOS> echo %TMPDIR% C:\tmp DOS> Cygwin Bash prompt: From PGI Workstation 7.1, select PGI Workstation (32-bit or 64-bit) and at the Cygwin Bash prompt, enter the following PGI$ export TMPDIR=C:\\tmp PGI$ echo $TMPDIR C:\tmp PGI$ Setting Environment Variables on Mac OSX Let’s assume that you want access to the PGi products when you log on. Let’s further assume that you installed the PGI compilers in /opt/pgi and that the license file is in /opt/pgi/license.dat. For access at startup, you can add the following lines to your startup file. For x64 osx86-64 in a csh: % set path = (/opt/pgi/osx86-64/7.0/bin $path) % setenv MANPATH "$MANPATH":/opt/pgi/osx86-64/7.0/man For x64 osx86-64 in a bash, zsh, or ksh: % PATH=/opt/pgi/osx86-64/7.0/bin:$PATH; export PATH % MANPATH=$MANPATH:/opt/pgi/osx86-64/7.0/man; export MANPATH For x64 osx86 in a csh: % set path = (/opt/pgi/osx86/7.0/bin $path)Chapter 8. Using Environment Variables 91 % setenv MANPATH "$MANPATH":/opt/pgi/osx86/7.0/man For x64 osx86 in a bash, zsh, or ksh: % PATH=/opt/pgi/osx86/7.0/bin:$PATH % export PATH % MANPATH=$MANPATH:/opt/pgi/osx86/7.0/man % export MANPATH PGI-Related Environment Variables For easy reference, the following summary table provides a quick listing of the OpenMP and PGI compilerrelated environment variables. Later in this chapter are more detailed descriptions of the environment variables specific to PGI compilers and the executables they generate. Table 8.1. PGI-related Environment Variable Summary Table Environment Variable Description FLEXLM_BATCH (Windows only) When set to 1, prevents interactive pop-ups from appearing by sending all licensing errors and warnings to standard out rather than to a pop-up window. FORTRAN_OPT Allows the user to specify that the PGI Fortran compilers user VAX I/ O conventions. GMON_OUT_PREFIX Specifies the name of the output file for programs tha are compiler and linked with the –pg option. LD_LIBRARY_PATH Specifies a colon-separated set of directories where libraries should first be searched, prior to searching the standard set of directories. LM_LICENSE_FILE Specifies the full path of the license file that is required for running the PGI software. On Windows, LM_LICENSE _FILE does not need to be set. MANPATH Sets the directories that are seacrhed for manual pages associated with the command that the user types. MPSTKZ Increases the size of the stacks used by threads executing in parallel regions. The value should be an integer concatenated with M or m to specify stack sizes of n megabytes. MP_BIND Specifies whether to bind processes or threads executing in a parallel region to a physical processor. MP_BLIST When MP_BIND is yes, this variable specifically defines the threadCPU relationship, overriding the default values. MP_SPIN Specifies the number of times to check a semaphore before calling sched_yield() (on Linux) or _sleep() (on Windows). MP_WARN Allows you to eliminate certain default warning messages. NCPUS Sets the number of processes or threads used in parallel regions. NCPUS_MAX Limits the maximum number of processors or threads that can be used in a parallel region.PGI® User’s Guide 92 Environment Variable Description NO_STOP_MESSAGE If used, the execution of a plain STOP statement does not produce the message FORTRAN STOP. OMP_DYNAMIC Currently has no effect. Enables (TRUE) or disables (FALSE) the dynamic adjustment of the number of threads. The default is FALSE. OMP_NESTED Currently has no effect. Enables (TRUE) or disables (FALSE) nested parallelism. The default is FALSE. OMP_NUM_THREADS Specifies the number of threads to use during execution of parallel regions. Default is 1. OMP_SCHEDULE Specifies the type of iteration scheduling and, optionally, the chunk size to use for omp for and omp parallel for loops that include the run-time schedule clause. The default is STATIC with chunk size = 1. OMP_STACK_SIZE Overrides the default stack size for a newly created thread. OMP_WAIT_POLICY Sets the behavior of idle threads, defining whether they spin or sleep when idle. The values are ACTIVE and PASSIVE. The default is ACTIVE. PATH Determines which locations are searched for commands the user may type. PGI Specifies, at compile-time, the root directory where the PGI compilers and tools are installed. PGI_CONTINUE If set, when a program compiled with–Mchkfpstk is executed, the stack is automatically cleaned up and execution then continues. PGI_OBJSUFFIX Allows you to control the suffix on generated object files. PGI_STACK_USAGE (Windows only) Allows you to explicitly set stack properties for your program. PGI_TERM Controls the stack traceback and just-in-time debugging functionality. PGI_TERM_DEBUG Overrides the default behavior when PGI_TERM is set to debug. PWD Allows you to display the current directory. STATIC_RANDOM_SEED Forces the seed returned by RANDOM_SEED to be constant. TMP Sets the directory to use for temporary files created during execution of the PGI compilers and tools; interchangeable with TMPDIR. TMPDIR Sets the directory to use for temporary files created during execution of the PGI compilers and tools. PGI Environment Variables You use the environment variables listed in Table 8.1, “PGI-related Environment Variable Summary Table” to alter the default behavior of the PGI compilers and the executables which they generate. This section provides more detailed descriptions about the variables in this table that are not OpenMP environment variables.Chapter 8. Using Environment Variables 93 FLEXLM_BATCH By default, on Windows the license server creates interactive pop-up messages to issue warning and errors. You can use the environment variable FLEXLM_BATCH to prevent interactive pop-up windows. To do this, set the environment variable FLEXLM_BATCH to 1. The following csh example prevents interactive pop-up messages for licensing warnings and errors: % set FLEXLM_BATCH = 1; FORTRAN_OPT FORTRAN_OPT allows the user to specify that the PGI Fortran compilers user VAX I/O conventions. • If FORTRAN_OPT exists and contains the value vaxio, the record length in the open statement is in units of 4-byte words, and the $ edit descriptor only has an effect for lines beginning with a space or a plus sign (+). • If this variable exists and contains the value format_relaxed, an I/O item corresponding to a numerical edit descriptor (such as F, E, I, and so on) is not required to be a type implied by the descriptor. The following example causes the PGI Fortran compilers to use VAX I/O conventions: $ setenv FORTRAN_OPT vaxio GMON_OUT_PREFIX GMON_OUT_PREFIX specifies the name of the output file for programs that are compiled and linked with the -pg option. The default name is gmon.out.a. If GMON_OUT_PREFIX is set, the name of the output file has GMON_OUT_PREFIX as a prefix. Further, the suffix is the pid of the running process. The prefix and suffix are separated by a dot. For example, if the output file is mygmon, then the full filename may look something similar to this: GMON_OUT_PREFIX.mygmon.0012348567. The following example causes the PGI Fortran compilers to use pgout as the output file for programs compiled and linked with the -pg option. $ setenv GMON_OUT_PREFIX pgout LD_LIBRARY_PATH The LD_LIBRARY_PATH variable is a colon-separated set of directories specifying where libraries should first be searched, prior to searching the standard set of directories. This variable is useful when debugging a new library or using a nonstandard library for special purposes. The following csh example adds the current directory to your LD_LIBRARY_PATH variable. % setenv LD_LIBRARY_PATH "$LD_LIBRARY_PATH":"./" LM_LICENSE_FILE The LM_LICENSE_FILE variable specifies the full path of the license file that is required for running the PGI software.PGI® User’s Guide 94 For example, once the license file is in place, you can execute the following csh commands to make the products you have purchased accessible and to initialize your environment for use of FLEXlm. These commands assume that you use the default installation directory: /opt/pgi % setenv PGI /opt/pgi % setenv LM_LICENSE_FILE "$LM_LICENSE_FILE":/opt/pgi/license.dat To set the environment variable LM_LICENSE_FILE to the full path of the license key file, do this: 1. Open the System Properties dialog: Start | Control Panel | System. 2. Select the Advanced tab. 3. Click the Environment Variables button. • If LM_LICENSE_FILE is not already an environment variable, create a new system variable for it. Set its value to the full path, including the name of the file, for the license key file, license.dat. • If LM_LICENSE_FILE already exists as an environment variable, append the path to the license file to the variable’s current value using a semi-colon to separate entrie • If LM_LICENSE_FILE is not already an environment variable, create a new system variable for it. Set its value to the full path, including the name of the file, for the license key file, license.dat. • If LM_LICENSE_FILE already exists as an environment variable, append the path to the license file to the variable’s current value using a semi-colon to separate entrie MANPATH The MANPATH variable sets the directories that are searched for manual pages associated with the commands that the user types. When using PGI products, it is important that you set your PATH to include the location of the PGI products and then set the MANPATH variable to include the man pages associated with the products. The following csh example targets x64 linux86-64 version of the compilers and tool s and allows the user access to the manual pages associated with them. % set path = (/opt/pgi/linux86-64/7.1/bin $path % setenv MANPATH "$MANPATH":/opt/pgi/linux86-64/7.1/man MPSTKZ MPSTKZ increases the size of the stacks used by threads executing in parallel regions. You typically use this variable with programs that utilize large amounts of thread-local storage in the form of private variables or local variables in functions or subroutines called within parallel regions. The value should be an integer concatenated with M or m to specify stack sizes of n megabytes. For example, the following setting specifies a stack size of 8 megabytes. $ setenv MPSTKZ 8M MP_BIND You can set MP_BIND to yes or y to bind processes or threads executing in a parallel region to physical processor. Set it to no or n to disable such binding. The default is to not bind processes to processors. ThisChapter 8. Using Environment Variables 95 variable is an execution-time environment variable interpreted by the PGI runtime-support libraries. It does not affect the behavior of the PGI compilers in any way. Note The MP_BIND environment variable is not supported on all platforms. $ setenv MP_BIND y MP_BLIST MP_BLIST allows you to specifically define the thread-CPU relationship. Note This variable is only in effect when MP_BIND is yes . While the MP_BIND variable binds processors or threads to a physical processor, MP_BLIST allows you to specifically define which thread is associated with which processor. The list defines the processor-thread relationship order, beginning with thread 0. This list overrides the default binding. For example, the following setting for MP_BLIST maps CPUs 3, 2, 1 and 0 to threads 0, 1, 2 and 3 respectively. $ setenv MP_BLIST=3,2,1,0 MP_SPIN When a thread executing in a parallel region enters a barrier, it spins on a semaphore. You can use MP_SPIN to specify the number of times it checks the semaphore before calling sched_yield() (on Linux) or _sleep() (on Windows). These calls cause the thread to be re-scheduled, allowing other processes to run. The default values are 100 (on Linux) and 10000 (on Windows). $ setenv MP_SPIN 200 MP_WARN MP_WARN allows you to eliminate certain default warning messages. By default, a warning is printed to stderr if you execute an OpenMP or auto-parallelized program with NCPUS or OMP_NUM_THREADS set to a value larger than the number of physical processors in the system. For example, if you produce a parallelized executable a.out and execute as follows on a system with only one processor: % setenv OMP_NUM_THREADS 2 % a.out Warning: OMP_NUM_THREADS or NCPUS (2) greater than available cpus (1) FORTRAN STOP Setting MP_WARN to NO eliminates these warning messages.PGI® User’s Guide 96 NCPUS You can use the NCPUS environment variable to set the number of processes or threads used in parallel regions. The default is to use only one process or thread, which is known as serial mode. Note OMP_NUM_THREADS has the same functionality as NCPUS. For historical reasons, PGi supports the environment variable NCPUS. If both OMP_NUM_THREADS and NCPUS are set, the value of OMP_NUM_THREADS takes precedence. Warning Setting NCPUS to a value larger than the number of physical processors or cores in your system can cause parallel programs to run very slowly. NCPUS_MAX You can use the NCPUS_MAX environment variable to limit the maximum number of processes or threads used in a parallel program. Attempts to dynamically set the number of processes or threads to a higher value, for example using set_omp_num_threads(), will cause the number of processes or threads to be set at the value of NCPUS_MAX rather than the value specified in the function call. NO_STOP_MESSAGE If the NO_STOP_MESSAGE variable exists, the execution of a plain STOP statement does not produce the message FORTRAN STOP. The default behavior of the PGI Fortran compilers is to issue this message. PATH The PATH variable sets the directories that are searched for commands that the user types. When using PGI products, it is important that you set your PATH to include the location of the PGI products. You can also use this variable to specify that you want to use only the linux86 version of the compilers and tools, or to target linux86 as the default. The following csh example targets x64 linux86-64 version of the compilers and tools. % set path = (/opt/pgi/linux86-64/7.1/bin $path) PGI The PGI environment variable specifies the root directory where the PGI compilers and tools are installed. This variable is recognized at compile-time. If it is not set, the default value depends on your system as well as which compilers are installed: • On Linux, the default value of this variable is /opt/pgi. • On Windows, the default value is C:\Program Files\PGI, where C represents the system drive. If both 32- and 64-bit compilers are installed, the 32-bit compilers are inC:\Program Files (x86)\ PGIChapter 8. Using Environment Variables 97 • For SFU/SUA compilers, the default value of this variable is /opt/pgi in the SFU/SUA file system. The corresponding Windows-style path is C:/SFU/opt/pgi for SFU and C:/WINDOWS/SUA/opt/pgi for SUA, where C represents the system drive. In most cases, if the PGI environment variable is not set, the PGI compilers and tools dynamically determine the location of this root directory based on the instance of the compiler or tool that was invoked. However, there are still some dependencies on the PGI environment variable, and it can be used as a convenience when initializing your environment for use of the PGI compilers and tools. For example, assuming you use csh and want the 64-bit linux86-64 versions of the PGI compilers and tools to be the default, you would use this syntax: % setenv PGI /usr/pgi % setenv MANPATH "$MANPATH":$PGI/linux86/6.0/man % setenv LM_LICENSE_FILE $PGI/license.dat % set path = ($PGI/linux86-64/6.0/bin $path) PGI_CONTINUE You set the PGI_CONTINUE variable to specify the actions to take before continuing with execution. For example, if the PGI_CONTINUE environment variable is set and a program compiled with –Mchkfpstk is executed, the stack is automatically cleaned up and execution then continues. If PGI_CONTINUE is set to verbose, the stack is automatically cleaned up, a warning message is printed, and then execution continues. Note There is a performance penalty associated with the stack cleanup. PGI_OBJSUFFIX You can set the PGI_OBJSUFFIX environment variable to generate object files that have a specific suffix. For example, if you set PGI_OBJSUFFIX to .o, the object files have a suffix of .o rather than .obj. PGI_STACK_USAGE (Windows only) The PGI_STACK_USAGE variable (for Windows only) allows you to explicitly set stack properties for your program. When the user compiles a program with the –Mchkstk option and sets the PGI_STACK_USAGE environment variable to any value, the program displays the stack space allocated and used after the program exits. You might see something similar to the following message: thread 0 stack: max 8180KB, used 48KB This message indicates that the program used 48KB of a 8180KB allocated stack. For more information on the –Mchkstk option, refer to –Mchkstk. PGI_TERM The PGI_TERM environment variable controls the stack traceback and just-in-time debugging functionality. The runtime libraries use the value of ‘ to determine what action to take when a program abnormally terminates.PGI® User’s Guide 98 The value of PGI_TERM is a comma-separated list of options. The commands for setting the environment variable follow. • In csh: % setenv PGI_TERM option[,option...] • In bash or sh: $ PGI_TERM=option[,option...] $ export PGI_TERM • In the Windows Command Prompt: C:\> set PGI_TERM=option[,option...] Table 8.2 lists the supported values for option. Following the table is a complete description of each option that indicates specifically how you might apply the option. By default, all of these options are disabled. Table 8.2. Supported PGI_TERM Values [no]debug Enables/disables just-in-time debugging (debugging invoked on error) [no]trace Enables/disables stack traceback on error [no]signal Enables/disables establishment of signal handlers for common signals that cause program termination [no]abort Enables/disables calling the system termination routine abort() [no]debug This enables/disables just-in-time debugging. The default is nodebug. When PGI_TERM is set to debug, the following command is invoked on error, unless you use PGI_TERM_DEBUG to override this default. pgdbg -text -attach is the process ID of the process being debugged. The PGI_TERM_DEBUG environment variable may be set to override the default setting. For more information, refer to “PGI_TERM_DEBUG,” on page 99. [no]trace This enables/disables the stack traceback. The default is notrace. [no]signal This enables/disables the establishing signal handlers for the most common signals that cause program termination. The default is nosignal. You can set trace and debug automatically enables signal. Specifically setting nosignal allows you to override this behavior.Chapter 8. Using Environment Variables 99 [no]abort This enables/disables calling the system termination routine abort(). The default is noabort. When noabort is in effect the process terminates by calling _exit(127). On Linux and SUA, when abort is in effect, the abort routine creates a core file and exits with code 127. On Windows, when abort is in effect, the abort routine exits with the status of the exception received. For example, if the program receives an access violation, abort() exits with status 0xC0000005. A few runtime errors just print an error message and call exit(127), regardless of the status of PGI_TERM. These are mainly errors such as specifying an invalid environment variable value where a traceback would not be useful. If it appears that abort() does not generate core files on a Linux system, be sure to unlimit the coredumpsize. You can do this in these ways: • Using csh: % limit coredumpsize unlimited % setenv PGI_TERM abort • Using bash or sh: $ ulimit -c unlimited $ export PGI_TERM=abort To debug a core file with pgdbg, start pgdbg with the -core option. For example, to view a core file named “core” for a program named “a.out”: $ pgdbg -core core a.out For more information on why to use this variable, refer to “Stack Traceback and JIT Debugging,” on page 101. PGI_TERM_DEBUG The PGI_TERM_DEBUG variable may be set to override the default behavior when PGI_TERM is set to debug. The value of PGI_TERM_DEBUG should be set to the command line used to invoke the program. For example: gdb --quiet --pid %d The first occurrence of %d in the PGI_TERM_DEBUG string will be replaced by the process id. The program named in the PGI_TERM_DEBUG string must be found on the currentPATH or specified with a full path name. PWD The PWD variable allows you to display the current directory. STATIC_RANDOM_SEED You can use STATIC_RANDOM_SEED to force the seed returned by the Fortran 90/95 RANDOM_SEED intrinsic to be constant. The first call to RANDOM_SEED without arguments resets the random seed to aPGI® User’s Guide 100 default value, then advances the seed by a variable amount based on time. Subsequent calls to RANDOM_SEED without arguments reset the random seed to the same initial value as the first call. Unless the time is exactly the same, each time a program is run a different random number sequence is generated. Setting the environment variable STATIC_RANDOM_SEED to YES forces the seed returned by RANDOM_SEED to be constant, thereby generating the same sequence of random numbers at each execution of the program. TMP You can use TMP to specify the directory to use for placement of any temporary files created during execution of the PGI compilers and tools. This variable is interchangeable with TMPDIR. TMPDIR You can use TMPDIR to specify the directory to use for placement of any temporary files created during execution of the PGI compilers and tools. Using Environment Modules On Linux, if you use the Environment Modules package, that is, the module load command, PGI 7.1 includes a script to set up the appropriate module files. Assuming your installation base directory is /opt/pgi, and your MODULEPATH environment variable is / usr/local/Modules/modulefiles, execute this command: % /opt/pgi/linux86/7.1-1/etc/modulefiles/pgi.module.install \ -all -install /usr/local/Modules/modulefiles This command creates module files for all installed versions of the PGI compilers. You must have write permission to the modulefiles directory to enable the module commands: % module load pgi32/7.1 % module load pgi64/7.1 % module load pgi/7.1 where "pgi/7.1" uses the 32-bit compilers on a 32-bit system and uses 64-bit compilers on a 64-bit system. To see what versions are available, use this command: % module avail pgi The module load command sets or modifies the environment variables as indicated in the following table. This Environment Variable... Is set or modified to ... CC Full path to pgcc V Path to pgCC V Full path to pgCC CXX Path to pgCC FC Full path to pgf95 F77 Full path to pgf77Chapter 8. Using Environment Variables 101 This Environment Variable... Is set or modified to ... F90 Full path to pgf95 LD_LIBRARY_PATH Prepends the PGI library directory MANPATH Prepends the PGI man page directory PATH Prepends the PGI compiler and tools bin directory PGI The base installation directory PGI does not provide support for the Environment Modules package. For more information about the package, go to: modules.sourceforge.net. Stack Traceback and JIT Debugging When a programming error results in a run-time error message or an application exception, a program will usually exit, perhaps with an error message. The PGI run-time library includes a mechanism to override this default action and instead print a stack traceback, start a debugger, or (on Linux) create a core file for postmortem debugging. The stack traceback and just-in-time debugging functionality is controlled by an environment variable, PGI_TERM. The run-time libraries use the value of PGI_TERM to determine what action to take when a program abnormally terminates. When the PGI runtime library detects an error or catches a signal, it calls the routine pgi_stop_here() prior to generating a stack traceback or starting the debugger. The pgi_stop_here routine is a convenient spot to set a breakpoint when debugging a program. For more information on PGI_Term and the supported values, refer to “PGI_TERM,” on page 97.102103 Chapter 9. Distributing Files - Deployment Once you have successfully built, debugged and tuned your application, you may want to distribute it to users who need to run it on a variety of systems. This chapter addresses how to effectively distribute applications built using PGI compilers and tools. The application must be installed in such a way that the it executes accurately on a system other than the one on which it was built, and which may be configured differently. Deploying Applications on Linux To successfully deploy your application on Linux, there are a number of issues to consider, including these: • Runtime Libraries • 64-bit Linux Systems • Redistribution of Files • Linux Portability of files and packages • Licensing Runtime Library Considerations On Linux systems, the system runtime libraries can be linked to an application either statically, or dynamically, For example, for the C runtime library, libc, you can use either the static version libc.a or the shared object libc.so. If the application is intended to run on Linux systems other than the one on which it was built, it is generally safer to use the shared object version of the library. This approach ensures that the application uses a version of the library that is compatible with the system on which the application is running. Further, it works best when the application is linked on a system that has an equivalent or earlier version of the system software than the system on which the application will be run. Note Building on a newer system and running the application on an older system may not produce the desired output.PGI® User’s Guide 104 To use the shared object version of a library, the application must also link to shared object versions of the PGI runtime libraries. To execute an application built in such a way on a system on which PGI compilers are not installed, those shared objects must be available.To build using the shared object versions of the runtime libraries, use the -Bdynamic option, as shown here: $ pgf90 -Bdynamic myprog.f90 64-bit Linux Considerations On 64-bit Linux systems, 64-bit applications that use the -mcmodel=medium option sometimes cannot be successfully linked statically. Therefore, users with executables built with the -mcmodel=medium option may need to use shared libraries, linking dynamically. Also, runtime libraries built using the -fpic option use 32-bit offsets, so they sometimes need to reside near other runtime libs in a shared area of Linux program memory. Note If your application is linked dynamically using shared objects, then the shared object versions of the PGI runtime are required. Linux Redistributable Files There are two methods for installing the shared object versions of the runtime libraries required for applications built with PGI compilers and tools: Linux Portability Package and manual distribution. PGI provides the Linux Portability Package, an installation package that can be downloaded from the PGI web site. In addition, when the PGI compilers are installed, there is a directory named REDIST for each platform (linux86 and linux86-64) that contains the redistributed shared object libraries. These may be redistributed by licensed PGI customers under the terms of the PGI End-User License Agreement. Restrictions on Linux Portability You cannot expect to be able to run an executable on any given Linux machine. Portability depends on the system you build on as well as how much your program uses system routines that may have changed from Linux release to Linux release. For example, one area of significant change between some versions of Linux is in libpthread.so. PGI compilers use this shared object for the options -Mconcur (auto-parallel) and - mp (OpenMP) programs. Typically, portability is supported for forward execution, meaning running a program on the same or a later version of Linux; but not for backward compatibility, that is, running on a prior release. For example, a user who compiles and links a program under Suse 9.1 should not expect the program to run without incident on a Red Hat 8.0 system, which is an earlier version of Linux. It may run, but it is less likely. Developers might consider building applications on earlier Linux versions for wider usage. Installing the Linux Portability Package You can download the Linux Portability Packages from the Downloads page at http://www.pgroup.com. First download the package you need, then untar it, and run the install script. Then you can add the installation directory to your library path.Chapter 9. Distributing Files - Deployment 105 To use the installed libraries, you can either modify /etc/ld.so.conf and run ldconfig(1) or modify the environment variable LD_LIBRARY_PATH, as shown here: setenv LD_LIBRARY_PATH /usr/local/pgi or export LD_LIBRARY_PATH=/usr/local/pgi Licensing for Redistributable Files The installation of the Linux Portability Package presents the standard PGI usage license. The libs can be distributed for use with PGI compiled applications, within the provisions of that license. The files in the REDIST directories may be redistributed under the terms of the End-User License Agreement for the product in which they were included. Deploying Applications on Windows Windows programs may be linked statically or dynamically. • A statically linked program is completely self-contained, created by linking to static versions of the PGI and Microsoft runtime libraries. • A dynamically linked program depends on separate dynamically-linked libraries (DLLs) that must be installed on a system for the application to run on that system. Although it may be simpler to install a statically linked executable, there are advantages to using the DLL versions of the runtime, including these: • Executable binary file size is smaller. • Multiple processes can use DLLs at once, saving system resources. • New versions of the runtime can be installed and used by the application without rebuilding the application. Dynamically-linked Windows programs built with PGI compilers depend on dynamic run-time library files (DLLs). These DLLs must be distributed with such programs to enable them to execute on systems where the PGI compilers are not installed. These redistributable libraries include both PGI runtime libraries and Microsoft runtime libraries. PGI Redistributables PGI Redistributable directories contain all of the PGI Linux runtime library shared object files or Windows dynamically- linked libraries that can be re-distributed by PGI 7.1 licensees under the terms of the PGI Enduser License Agreement (EULA). Microsoft Redistributables The PGI products on Windows include Microsoft Open Tools. The Microsoft Open Tools directory contains a subdirectory named redist. PGI licensees may redistribute the files contained in this directory in accordance with the terms of the PGI End-User License Agreement.PGI® User’s Guide 106 Microsoft supplies installation packages, vcredist_x86.exe and vcredist_x64.exe, containing these runtime files. You can download these packages from www.microsoft.com. Code Generation and Processor Architecture The PGI compilers can generate much more efficient code if they know the specific x86 processor architecture on which the program will run. When preparing to deploy your application, you should determine whether you want the application to run on the widest possible set of x86 processors, or if you want to restrict the application to run on a specific processor or set of processors. The restricted approach allows you to optimize performance for that set of processors. Different processors have differences, some subtle, in hardware features, such as instruction sets and cache size. The compilers make architecture-specific decisions about such things as instruction selection, instruction scheduling, and vectorization, all of which can have a profound effect on the performance of your application. Processor- specific code generation is controlled by the -tp option, described in “–tp [,target...] ,” on page 202. When an application is compiled without any -tp options, the compiler generates code for the type of processor on which the compiler is run. Generating Generic x86 Code To generate generic x86 code, use one of the following forms of the-tp option on your command line: -tp px ! generate code for any x86 cpu type -tp p6 ! generate code for Pentium 2 or greater While both of these examples are good choices for portable execution, most users have Pentium 2 or greater CPUs. Generating Code for a Specific Processor You can use the -tp option to request that the compiler generate code optimized for a specific processor. The PGI Release Notes contains a list of supported processors or you can look at the -tp entry in the compiler output generated by using the -help option, described in “–help ,” on page 178. Generating Code for Multiple Types of Processors in One Executable PGI unified binaries provide a low-overhead method for a single program to run well on a number of hardware platforms. All 64-bit PGI compilers can produce PGI Unified Binary programs that contain code streams fully optimized and supported for both AMD64 and Intel EM64T processors using the -tp target option. The compilers generate and combine multiple binary code streams into one executable, where each stream is optimized for a specific platform. At runtime, this one executable senses the environment and dynamically selects the appropriate code stream. Different processors have differences, some subtle, in hardware features, such as instruction sets and cache size. The compilers make architecture-specific decisions about such things as instruction selection, instructionChapter 9. Distributing Files - Deployment 107 scheduling, and vectorization. PGI unified binaries provide a low-overhead means for a single program to run well on a number of hardware platforms. Executable size is automatically controlled via unified binary culling. Only those functions and subroutines where the target affects the generated code will have unique binary images, resulting in a code-size savings of 10-90% compared to generating full copies of code for each target. Programs can use PGI Unified Binary even if all of the object files and libraries are not compiled as unified binaries. Like any other object file, you can use PGI Unified Binary object files to create programs or libraries. No special start up code is needed; support is linked in from the PGI libraries. The -Mpfi option disables generation of PGI Unified Binary. Instead, the default target auto-detect rules for the host are used to select the target processor. Unified Binary Command-line Switches The PGI Unified Binary command-line switch is an extension of the target processor switch, -tp, which may be applied to individual files during compilation . The target processor switch, -tp, accepts a comma-separated list of 64-bit targets and generates code optimized for each listed target. The following example generates optimized code for three targets: -tp k8-64,p7-64,core2-64 A special target switch, -tp x64, is the same as -tp k8-64, p7-64s. Unified Binary Directives and Pragma Unified binary directives and pragmas may be applied to functions, subroutines, or whole files. The directives and pragmas cause the compiler to generate PGI Unified Binary code optimized for one or more targets. No special command line options are needed for these pragmas and directives to take effect. The syntax of the Fortran directive is this: pgi$[g|r| ] pgi tp [target]... where the scope is g (global), r (routine) or blank. The default is r, routine. For example, the following syntax indicates that the whole file, represented by g, should be optimized for both k8_64 and p7_64. pgi$g pgi tp k8_64 p7_64 The syntax of the C/C++ pragma is this: #pragma [global|routine|] tp [target]... where the scope is global, routine, or blank. The default is routine. For example, the following syntax indicates that the next function should be optimized for k8_64, p7_64, and core2_64. #pragma routine tp k8_64 p7_64 core2_64108109 Chapter 10. Inter-language Calling This chapter describes inter-language calling conventions for C, C++, and Fortran programs using the PGI compilers. The following sections describe how to call a Fortran function or subroutine from a C or C++ program and how to call a C or C++ function from a Fortran program. For information on calling assembly language programs, refer to Chapter 18, “Run-time Environment”. This chapter provides examples that use the following options related to inter-language calling. For more information on these options, refer to Chapter 15, “Command-Line Options Reference,” on page 163. -c -Mnomain Overview of Calling Conventions This chapter includes information on the following topics: • Functions and subroutines in Fortran, C, and C++ • Naming and case conversion conventions • Compatible data types • Argument passing and special return values • Arrays and indexes • Win32 calling conventions The sections “Inter-language Calling Considerations,” on page 110 through“Example - C++ Calling Fortran,” on page 119 describe how to perform inter-language calling using the Linux/Win64/SUA convention. Default Fortran calling conventions for Win32 differ, although Win32 programs compiled using the -Munix Fortran command-line option use the Linux/Win64 convention rather than the default Win32 conventions. All information in those sections pertaining to compatibility of arguments applies to Win32 as well. For details on the symbol name and argument passing conventions used on Win32 platforms, refer to “Win32 Calling Conventions,” on page 120.PGI® User’s Guide 110 Inter-language Calling Considerations In general, when argument data types and function return values agree, you can call a C or C++ function from Fortran as well as call a Fortran function from C or C++. When data types for arguments do not agree, you may need to develop custom mechanisms to handle them. For example, the Fortran COMPLEX type has a matching type in C99 but does not have a matching type in C90; however, it is still possible to provide inter-language calls but there are no general calling conventions for such cases. Note • If a C++ function contains objects with constructors and destructors, calling such a function from either C or Fortran is not possible unless the initialization in the main program is performed from a C++ program in which constructors and destructors are properly initialized. • In general, you can call a C or Fortran function from C++ without problems as long as you use the extern "C" keyword to declare the function in the C++ program. This declaration prevents name mangling for the C function name. If you want to call a C++ function from C or Fortran, you also have to use the extern "C" keyword to declare the C++ function. This keeps the C++ compiler from mangling the name of the function. • You can use the __cplusplus macro to allow a program or header file to work for both C and C++. For example, the following defines in the header file stdio.h allow this file to work for both C and C++. #ifndef _STDIO_H #define _STDIO_H #ifdef __cplusplus extern "C" { #endif /* __cplusplus */ . . /* Functions and data types defined... */ . #ifdef __cplusplus } #endif /* __cplusplus */ #endif • C++ member functions cannot be declared extern, since their names will always be mangled. Therefore, C++ member functions cannot be called from C or Fortran. Functions and Subroutines Fortran, C, and C++ define functions and subroutines differently. For a Fortran program calling a C or C++ function, observe the following return value convention: • When a C or C++ function returns a value, call it from Fortran as a function. • When a C or C++ function does not return a value, call it as a subroutine. For a C/C++ program calling a Fortran function, the call should return a similar type. Table 10.1, “Fortran and C/C++ Data Type Compatibility,” on page 111 lists compatible types. If the call is to a Fortran subroutine,Chapter 10. Inter-language Calling 111 a Fortran CHARACTER function, or a Fortran COMPLEX function, call it from C/C++ as a function that returns void. The exception to this convention is when a Fortran subroutine has alternate returns; call such a subroutine from C/C++ as a function returning int whose value is the value of the integer expression specified in the alternate RETURN statement. Upper and Lower Case Conventions, Underscores By default on Linux, Win64, OSX, and SUA systems, all Fortran symbol names are converted to lower case. C and C++ are case sensitive, so upper-case function names stay upper-case. When you use inter-language calling, you can either name your C/C++ functions with lower-case names, or invoke the Fortran compiler command with the option –Mupcase, in which case it will not convert symbol names to lower-case. When programs are compiled using one of the PGI Fortran compilers on Linux, Win64, OSX, and SUA systems, an underscore is appended to Fortran global names (names of functions, subroutines and common blocks). This mechanism distinguishes Fortran name space from C/C++ name space. Use these naming conventions: • If you call a C/C++ function from Fortran, you should rename the C/C++ function by appending an underscore or use C$PRAGMA C in the Fortran program. For more information on C$PRAGMA C, refer to “C$PRAGMA C,” on page 72. • If you call a Fortran function from C/C++, you should append an underscore to the Fortran function name in the calling program. Compatible Data Types Table 10.1 shows compatible data types between Fortran and C/C++. Table 10.2, “Fortran and C/C++ Representation of the COMPLEX Type,” on page 112 shows how the Fortran COMPLEX type may be represented in C/C++. If you can make your function/subroutine parameters as well as your return values match types, you should be able to use inter-language calling. Table 10.1. Fortran and C/C++ Data Type Compatibility Fortran Type (lower case) C/C++ Type Size (bytes) character x char x 1 character*n x char x[n] n real x float x 4 real*4 x float x 4 real*8 x double x 8 double precision double x 8 integer x int x 4 integer*1 x signed char x 1 integer*2 x short x 2 integer*4 x int x 4 integer*8 x long long x 8PGI® User’s Guide 112 Fortran Type (lower case) C/C++ Type Size (bytes) logical x int x 4 logical*1 x char x 1 logical*2 x short x 2 logical*4 int x 4 logical*8 long long x 8 Table 10.2. Fortran and C/C++ Representation of the COMPLEX Type Fortran Type (lower case) C/C++ Type Size (bytes) complex x struct {float r,i;} x; 8 float complex x; complex*8 x struct {float r,i;} x; 8 float complex x; 8 double complex x struct {double dr,di;} x; 16 double complex x; 16 complex *16 x struct {double dr,di;} x; 16 double complex x; 16 Note For C/C++, the complex type implies C99 or later. Fortran Named Common Blocks A named Fortran common block can be represented in C/C++ by a structure whose members correspond to the members of the common block. The name of the structure in C/C++ must have the added underscore. For example, the Fortran common block: INTEGER I COMPLEX C DOUBLE COMPLEX CD DOUBLE PRECISION D COMMON /COM/ i, c, cd, d is represented in C with the following equivalent: extern struct { int i; struct {float real, imag;} c; struct {double real, imag;} cd; double d; } com_; and in C++ with the following equivalent:Chapter 10. Inter-language Calling 113 extern "C" struct { int i; struct {float real, imag;} c; struct {double real, imag;} cd; double d; } com_; Tip For global or external data sharing, extern “C” is not required. Argument Passing and Return Values In Fortran, arguments are passed by reference, that is, the address of the argument is passed, rather than the argument itself. In C/C++, arguments are passed by value, except for strings and arrays, which are passed by reference. Due to the flexibility provided in C/C++, you can work around these differences. Solving the parameter passing differences generally involves intelligent use of the & and * operators in argument passing when C/C++ calls Fortran and in argument declarations when Fortran calls C/C++. For strings declared in Fortran as type CHARACTER, an argument representing the length of the string is also passed to a calling function. On Linux systems, or when using the UNIX calling convention on Windows (-Munix), the compiler places the length argument(s) at the end of the parameter list, following the other formal arguments. The length argument is passed by value, not by reference. Passing by Value (%VAL) When passing parameters from a Fortran subprogram to a C/C++ function, it is possible to pass by value using the %VAL function. If you enclose a Fortran parameter with %VAL(), the parameter is passed by value. For example, the following call passes the integer i and the logical bvar by value. integer*1 i logical*1 bvar call cvalue (%VAL(i), %VAL(bvar)) Character Return Values “Functions and Subroutines,” on page 110 describes the general rules for return values for C/C++ and Fortran inter-language calling. There is a special return value to consider. When a Fortran function returns a character, two arguments need to be added at the beginning of the C/C++ calling function’s argument list: • The address of the return character or characters • The length of the return character Example 10.1, “Character Return Parameters” illustrates the extra parameters, tmp and 10, supplied by the caller:PGI® User’s Guide 114 Example 10.1. Character Return Parameters ! Fortran function returns a character CHARACTER*(*) FUNCTION CHF( C1,I) CHARACTER*(*) C1 INTEGER I END /* C declaration of Fortran function */ extern void chf_(); char tmp[10]; char c1[9]; int i; chf_(tmp, 10, c1, &i, 9); If the Fortran function is declared to return a character value of constant length, for example CHARACTER*4 FUNCTION CHF(), the second extra parameter representing the length must still be supplied, but is not used. NOTE The value of the character function is not automatically NULL-terminated. Complex Return Values When a Fortran function returns a complex value, an argument needs to be added at the beginning of the C/ C++ calling function’s argument list; this argument is the address of the complex return value. Example 10.2, “COMPLEX Return Values” illustrates the extra parameter, cplx, supplied by the caller. Example 10.2. COMPLEX Return Values COMPLEX FUNCTION CF(C, I) INTEGER I . . . END extern void cf_(); typedef struct {float real, imag;} cplx; cplx c1; int i; cf_(&c1, &i); Array Indices C/C++ arrays and Fortran arrays use different default initial array index values. By default, C/C++ arrays start at 0 and Fortran arrays start at 1. If you adjust your array comparisons so that a Fortran second element is compared to a C/C++ first element, and adjust similarly for other elements, you should not have problems working with this difference. If this is not satisfactory, you can declare your Fortran arrays to start at zero. Another difference between Fortran and C/C++ arrays is the storage method used. Fortran uses columnmajor order and C/C++ use row-major order. For one-dimensional arrays, this poses no problems. For twodimensional arrays, where there are an equal number of rows and columns, row and column indexes can simply be reversed. For arrays other than single dimensional arrays, and square two-dimensional arrays, interlanguage function mixing is not recommended.Chapter 10. Inter-language Calling 115 Examples This section contains examples that illustrate inter-language calling. Example - Fortran Calling C Example 10.4, “C function cfunc_” shows a C function that is called by the Fortran main program shown in Example 10.3, “Fortran Main Program fmain.f”. Notice that each argument is defined as a pointer, since Fortran passes by reference. Also notice that the C function name uses all lower-case and a trailing "_". Example 10.3. Fortran Main Program fmain.f logical*1 bool1 character letter1 integer*4 numint1, numint2 real numfloat1 double precision numdoub1 integer*2 numshor1 external cfunc call cfunc (bool1, letter1, numint1, numint2, + numfloat1, numdoub1, numshor1) write( *, "(L2, A2, I5, I5, F6.1, F6.1, I5)") + bool1, letter1, numint1, numint2, numfloat1, + numdoub1, numshor1 end Example 10.4. C function cfunc_ #define TRUE 0xff #define FALSE 0 void cfunc_( bool1, letter1, numint1, numint2, numfloat1,\ numdoub1, numshor1, len_letter1) char *bool1, *letter1; int *numint1, *numint2; float *numfloat1; double *numdoub1; short *numshor1; int len_letter1; { *bool1 = TRUE; *letter1 = 'v'; *numint1 = 11; *numint2 = -44; *numfloat1 = 39.6 ; *numdoub1 = 39.2; *numshor1 = 981; } Compile and execute the program fmain.f with the call to cfunc_ using the following command lines: $ pgcc -c cfunc.c $ pgf95 cfunc.o fmain.f Executing the a.out file should produce the following output: T v 11 -44 39.6 39.2 981 Example - C Calling Fortran Example 10.6, “C Main Program cmain.c” shows a C main program that calls the Fortran subroutine shown in Example 10.5, “Fortran Subroutine forts.f”. Notice that each call uses the & operator to pass by reference. Also notice that the call to the Fortran subroutine uses all lower-case and a trailing "_".PGI® User’s Guide 116 Example 10.5. Fortran Subroutine forts.f subroutine forts ( bool1, letter1, numint1 + numint2, numfloat1, numdoub1, numshor1) logical*1 bool1 character letter1 integer numint1, numint2 double precision numdoub1 real numfloat1 integer*2 numshor1 bool1 = .true. letter1 = "v" numint1 = 11 numint2 = -44 numdoub1 = 902 numfloat1 = 39.6 numshor1 = 299 return end Example 10.6. C Main Program cmain.c main () { char bool1, letter1; int numint1, numint2; float numfloat1; double numdoub1; short numshor1; extern void forts_ (); forts_(&bool1,&letter1,&numint1,&numint2,&numfloat1,&numdoub1,&numshor1, 1); printf(" %s %c %d %d %3.1f %.0f %d\n", bool1?"TRUE":"FALSE",letter1,numint1, numint2, numfloat1, numdoub1, numshor1); } To compile this Fortran subroutine and C program, use the following commands: $ pgcc -c cmain.f $ pgf95 -Mnomain cmain.o forts.f Executing the resulting a.out file should produce the following output: TRUE v 11 -44 39.6 902 299 Example - C ++ Calling C Example 10.8, “C++ Main Program cpmain.C Calling a C Function” shows a C++ main program that calls the C function shown in Example 10.7, “Simple C Function cfunc.c”. Example 10.7. Simple C Function cfunc.c void cfunc(num1, num2, res) int num1, num2, *res; { printf("func: a = %d b = %d ptr c = %x\n",num1,num2,res); *res=num1/num2; printf("func: res = %d\n",*res); }Chapter 10. Inter-language Calling 117 Example 10.8. C++ Main Program cpmain.C Calling a C Function xtern "C" void cfunc(int n, int m, int *p); #include main() { int a,b,c; a=8; b=2; cout << "main: a = "< extern "C" { extern void forts_(char *,char *,int *,int *, float *,double *,short *); } main () { char bool1, letter1; int numint1, numint2; float numfloat1; double numdoub1; short numshor1; forts_(&bool1,&letter1,&numint1,&numint2,&numfloat1, &numdoub1,&numshor1); cout << " bool1 = "; bool1?cout << "TRUE ":cout << "FALSE "; cout < 2GB in size. Note that if you execute with the above settings in your environment, you may see the following: % bigadd Segmentation fault Execution fails because the stack size is not large enough. Try resetting the stack size in your environment: % limit stacksize 3000M PGI® User’s Guide 130 Note that ‘limit stacksize unlimited’ will probably not provide as large a stack as we are using above. % bigadd a[0]=1 b[0]=2 c[0]=3 n=599990000 a[599990000]=5.9999e+08 b[599990000]=1.19998e+09 c[599990000]=1.79997e+09 The size of the bss section of the bigadd executable is now larger than 2GB: % size –-format=sysv bigadd | grep bss .bss 4800000008 5245696 % size -–format=sysv bigadd | grep Total Total 4800005080 Example: Medium Memory Model and Large Array in Fortran The following example works with both the PGF95 and PGF77 compilers included in Release 7.0. Both compilers use 64-bit addresses and index arithmetic when the –mcmodel=medium option is used. Consider the following example: % cat mat.f program mat integer i, j, k, size, l, m, n parameter (size=16000) ! >2GB parameter (m=size,n=size) real*8 a(m,n),b(m,n),c(m,n),d do i = 1, m do j = 1, n a(i,j)=10000.0D0*dble(i)+dble(j) b(i,j)=20000.0D0*dble(i)+dble(j) enddo enddo !$omp parallel !$omp do do i = 1, m do j = 1, n c(i,j) = a(i,j) + b(i,j) enddo enddo !$omp do do i=1,m do j = 1, n d = 30000.0D0*dble(i)+dble(j)+dble(j) if(d .ne. c(i,j)) then print *,”err i=”,i,”j=”,j print *,”c(i,j)=”,c(i,j) print *,”d=”,d stop endif enddo enddo !$omp end parallel print *, “M =”,M,”, N =”,N print *, “c(M,N) = “, c(m,n) end When compiled with the PGF95 compiler using –mcmodel=medium: % pgf95 –mp –o mat mat.f –i8 –mcmodel=mediumChapter 11. Programming Considerations for 64-Bit Environments 131 % setenv OMP_NUM_THREADS 2 % mat M = 16000 , N = 16000 c(M,N) = 480032000.0000000 Example: Large Array and Small Memory Model in Fortran The following example uses large, dynamically-allocated arrays. The code is divided into a main and subroutine so you could put the subroutine into a shared library. Dynamic allocation of large arrays saves space in the size of executable and saves time initializing data. Further, the routines can be compiled with 32- bit compilers, by just decreasing the parameter size below. % cat mat_allo.f90 program mat_allo integer i, j integer size, m, n parameter (size=16000) parameter (m=size,n=size) double precision, allocatable::a(:,:),b(:,:),c(:,:) allocate(a(m,n), b(m,n), c(m,n)) do i = 100, m, 1 do j = 100, n, 1 a(i,j) = 10000.0D0 * dble(i) + dble(j) b(i,j) = 20000.0D0 * dble(i) + dble(j) enddo enddo call mat_add(a,b,c,m,n) print *, “M =”,m,”,N =”,n print *, “c(M,N) = “, c(m,n) end subroutine mat_add(a,b,c,m,n) integer m, n, i, j double precision a(m,n),b(m,n),c(m,n) !$omp do do i = 1, m do j = 1, n c(i,j) = a(i,j) + b(i,j) enddo enddo return end % pgf95 –o mat_allo mat_allo.f90 –i8 –Mlarge_arrays -mp -fast132133 Chapter 12. C/C++ Inline Assembly and Intrinsics Inline Assembly Inline Assembly lets you specify machine instructions inside a "C" function. The format for an inline assembly instruction is this: { asm | __asm__ } ("string"); The asm statement begins with the asm or __asm__ keyword. The __asm__ keyword is typically used in header files that may be included in ISO "C" programs. "string" is one or more machine specific instructions separated with a semi-colon (;) or newline (\n) character. These instructions are inserted directly into the compiler's assembly-language output for the enclosing function. Some simple asm statements are: asm ("cli"); asm ("sti"); The asm statements above disable and enable system interrupts respectively. In the following example, the eax register is set to zero. asm( "pushl %eax\n\t" "movl $0, %eax\n\t" "popl %eax"); Notice that eax is pushed on the stack so that it is it not clobbered. When the statement is done with eax, it is restored with the popl instruction. Typically a program uses macros that enclose asm statements. The interrupt constructs shown above are used in the following two examples: #define disableInt __asm__ ("cli"); #define enableInt __asm__ ("sti");PGI® User’s Guide 134 Extended Inline Assembly “Inline Assembly,” on page 133 explains how to use inline assembly to specify machine specific instructions inside a "C" function. This approach works well for simple machine operations such as disabling and enabling system interrupts. However, inline assembly has three distinct limitations: 1. The programmer must choose the registers required by the inline assembly. 2. To prevent register clobbering, the inline assembly must include push and pop code for registers that get modified by the inline assembly. 3. There is no easy way to access stack variables in an inline assembly statement. Extended Inline Assembly was created to address these limitations. The format for extended inline assembly, also known as extended asm, is as follows: { asm | __asm__ } [ volatile | __volatile__ ] ("string" [: [output operands]] [: [input operands]] [: [clobberlist]]); • Extended asm statements begin with the asm or __asm__ keyword. Typically the __asm__ keyword is used in header files that may be included by ISO "C" programs. • An optional volatile or __volatile__ keyword may appear after the asm keyword. This keyword instructs the compiler not to delete, move significantly, or combine with any other asm statement. Like __asm__, the __volatile__ keyword is typically used with header files that may be included by ISO "C" programs. • "string" is one or more machine specific instructions separated with a semi-colon (;) or newline (\n) character. The string can also contain operands specified in the [output operands], [input operands], and [clobber list]. The instructions are inserted directly into the compiler's assembly-language output for the enclosing function. • The [output operands], [input operands], and [clobber list] items each describe the effect of the instruction for the compiler. For example: asm( "movl %1, %%eax\n" "movl %%eax, %0":"=r" (x) : "r" (y) : "%eax" ); where "=r" (x) is an output operand "r" (y) is an input operand. "%eax" is the clobber list consisting of one register, "%eax". The notation for the output and input operands is a constraint string surrounded by quotes, followed by an expression, and surrounded by parentheses. The constraint string describes how the input and output operands are used in the asm "string". For example, "r" tells the compiler that the operand is a register. The "=" tells the compiler that the operand is write only, which means that a value is stored in an output operand's expression at the end of the asm statement. Each operand is referenced in the asm "string" by a percent "%" and its number. The first operand is number 0, the second is number 1, the third is number 2, and so on. In the preceding example, "%0" references the output operand, and "%1" references the input operand. The asm "string" also contains "%%eax", which references machine register "%eax". Hard coded registers like "%eax" should be specified in the clobber list to prevent conflicts with other instructions in the compiler's assembly-language output.Chapter 12. C/C++ Inline Assembly and Intrinsics 135 [output operands], [input operands], and [clobber list] items are described in more detail in the following sections. Output Operands The [output operands] are an optional list of output constraint and expression pairs that specify the result(s) of the asm statement. An output constraint is a string that specifies how a result is delivered to the expression. For example, "=r" (x) says the output operand is a write-only register that stores its value in the "C" variable x at the end of the asm statement. An example follows: int x; void example() { asm( "movl $0, %0" : "=r" (x) ); } The previous example assigns 0 to the "C" variable x. For the function in this example, the compiler produces the following assembly. If you want to produce an assembly listing, compile the example with the pgcc -S compiler option: example: ..Dcfb0: pushq %rbp ..Dcfi0: movq %rsp, %rbp ..Dcfi1: ..EN1: ## lineno: 8 movl $0, %eax movl %eax, x(%rip) ## lineno: 0 popq %rbp ret In the generated assembly shown, notice that the compiler generated two statements for the asm statement at line number 5. The compiler generated "movl $0, %eax" from the asm "string". Also notice that %eax appears in place of "%0" because the compiler assigned the %eax register to variable x. Since item 0 is an output operand, the result must be stored in its expression (x). The instruction movl %eax, x(%rip) assigns the output operand to variable x. In addition to write-only output operands, there are read/write output operands designated with a "+" instead of a "=". For example, "+r" (x) tells the compiler to initialize the output operand with variable x at the beginning of the asm statement. To illustrate this point, the following example increments variable x by 1: int x=1; void example2() { asm( "addl $1, %0" : "+r" (x) ); } To perform the increment, the output operand must be initialized with variable x. The read/write constraint modifier ("+") instructs the compiler to initialize the output operand with its expression. The compiler generates the following assembly code for the example2() function:PGI® User’s Guide 136 example2: ..Dcfb0: pushq %rbp ..Dcfi0: movq %rsp, %rbp ..Dcfi1: ..EN1: ## lineno: 5 movl x(%rip), %eax addl $1, %eax movl %eax, x(%rip) ## lineno: 0 popq %rbp ret From the example(2) code, two extraneous moves are generated in the assembly: one movl for initializing the output register and a second movl to write it to variable x. To eliminate these moves, use a memory constraint type instead of a register constraint type, as shown in the following example: int x=1; void example2() { asm( "addl $1, %0" : "+m" (x) ); } The compiler generates a memory reference in place of a memory constraint. This eliminates the two extraneous moves: example2: ..Dcfb0: pushq %rbp ..Dcfi0: movq %rsp, %rbp ..Dcfi1: ..EN1: ## lineno: 5 addl $1, x(%rip) ## lineno: 0 popq %rbp ret Because the assembly uses a memory reference to variable x, it does not have to move x into a register prior to the asm statement; nor does it need to store the result after the asm statement. Additional constraint types are found in “Additional Constraints,” on page 139. The examples thus far have used only one output operand. Because extended asm accepts a list of output operands, asm statements can have more than one result. For example: void example4() { int x=1; int y=2; asm( "addl $1, %1\n" "addl %1, %0": "+r" (x), "+m" (y) ); } The example above increments variable y by 1 then adds it to variable x. Multiple output operands are separated with a comma. The first output operand is item 0 ("%0") and the second is item 1 ("%1") in the asm "string". The resulting values for x and y are 4 and 3 respectively.Chapter 12. C/C++ Inline Assembly and Intrinsics 137 Input Operands The [input operands] are an optional list of input constraint and expression pairs that specify what "C" values are needed by the asm statement. The input constraints specify how the data is delivered to the asm statement. For example, "r" (x) says that the input operand is a register that has a copy of the value stored in "C" variable x. Another example is "m" (x) which says that the input item is the memory location associated with variable x. Other constraint types are discussed in “Additional Constraints,” on page 139. An example follows: void example5() { int x=1; int y=2; int z=3; asm( "addl %2, %1\n" "addl %2, %0" : "+r" (x), "+m" (y) : "r" (z) ); } The previous example adds variable z, item 2, to variable x and variable y. The resulting values for x and y are 4 and 5 respectively. Another type of input constraint worth mentioning here is the matching constraint. A matching constraint is used to specify an operand that fills both an input as well as an output role. An example follows: int x=1; void example6() { asm( "addl $1, %1" : "=r" (x) : "0" (x) ); } The previous example is equivalent to the example2() function shown in “Output Operands,” on page 135. The constraint/expression pair, "0" (x), tells the compiler to initialize output item 0 with variable x at the beginning of the asm statement. The resulting value for x is 2. Also note that "%1" in the asm "string" means the same thing as "%0" in this case. That is because there is only one operand with both an input and an output role. Matching constraints are very similar to the read/write output operands mentioned in “Output Operands,” on page 135. However, there is one key difference between read/write output operands and matching constraints. The matching constraint can have an input expression that differs from its output expression. The example below uses different values for the input and output roles: int x; int y=2; void example7() { asm( "addl $1, %1" : "=r" (x) : "0" (y) ); } The compiler generates the following assembly for example7(): example7: ..Dcfb0: pushq %rbp ..Dcfi0:PGI® User’s Guide 138 movq %rsp, %rbp ..Dcfi1: ..EN1: ## lineno: 8 movl y(%rip), %eax addl $1, %eax movl %eax, x(%rip) ## lineno: 0 popq %rbp ret Variable x gets initialized with the value stored in y, which is 2. After adding 1, the resulting value for variable x is 3. Because matching constraints perform an input role for an output operand, it does not make sense for the output operand to have the read/write ("+") modifier. In fact, the compiler disallows matching constraints with read/write output operands. The output operand must have a write only ("=") modifier. Clobber List The [clobber list] is an optional list of strings that hold machine registers used in the asm "string". Essentially, these strings tell the compiler which registers may be clobbered by the asm statement. By placing registers in this list, the programmer does not have to explicitly save and restore them as required in traditional inline assembly (described in “Inline Assembly,” on page 133). The compiler takes care of any required saving and restoring of the registers in this list. Each machine register in the [clobber list] is a string separated by a comma. The leading '%' is optional in the register name. For example, "%eax" is equivalent to "eax". When specifying the register inside the asm "string", you must include two leading '%' characters in front of the name (for example., "%%eax"). Otherwise, the compiler will behave as if a bad input/output operand was specified and generate an error message. An example follows: void example8() { int x; int y=2; asm( "movl %1, %%eax\n" "movl %1, %%edx\n" "addl %%edx, %%eax\n" "addl %%eax, %0" : "=r" (x) : "0" (y) : "eax", "edx" ); } The code shown above uses two hard-coded registers, eax and edx. It performs the equivalent of 3*y and assigns it to x, producing a result of 6. In addition to machine registers, the clobber list may contain the following special flags: "cc" The asm statement may alter the condition code register. "memory" The asm statement may modify memory in an unpredictable fashion.Chapter 12. C/C++ Inline Assembly and Intrinsics 139 The "memory" flag causes the compiler not to keep memory values cached in registers across the asm statement and not to optimize stores or loads to that memory. For example: asm("call MyFunc":::"memory"); This asm statement contains a "memory" flag because it contains a call. The callee may otherwise clobber registers in use by the caller without the "memory" flag. The following function uses extended asm and the "cc" flag to compute a power of 2 that is less than or equal to the input parameter n. #pragma noinline int asmDivideConquer(int n) { int ax = 0; int bx = 1; asm ( "LogLoop:\n" "cmp %2, %1\n" "jnle Done\n" "inc %0\n" "add %1,%1\n" "jmp LogLoop\n" "Done:\n" "dec %0\n" :"+r" (ax), "+r" (bx) : "r" (n) : "cc"); return ax; } The "cc" flag is used because the asm statement contains some control flow that may alter the condition code register. The #pragma noinline statement prevents the compiler from inlining the asmDivideConquer()function. If the compiler inlines asmDivideConquer(), then it may illegally duplicate the labels LogLoop and Done in the generated assembly. Additional Constraints Operand constraints can be divided into four main categories: • Simple Constraints • Machine Constraints • Multiple Alternative Constraints • Constraint Modifiers Simple Constraints The simplest kind of constraint is a string of letters or characters, known as Simple Constraints, such as the "r" and "m" constraints introduced in “Output Operands,” on page 135. Table 12.1, “Simple Constraints” describes these constraints. Table 12.1. Simple Constraints Constraint Description whitespace Whitespace characters are ignored.PGI® User’s Guide 140 Constraint Description E An immediate floating point operand. F Same as "E". g Any general purpose register, memory, or immediate integer operand is allowed. i An immediate integer operand. m A memory operand. Any address supported by the machine is allowed. n Same as "i". o Same as "m". p An operand that is a valid memory address. The expression associated with the constraint is expected to evaluate to an address (for example, "p" (&x) ). r A general purpose register operand. X Same as "g". 0,1,2,..9 Matching Constraint. See “Input Operands,” on page 137 for a description. The following example uses the general or "g" constraint, which allows the compiler to pick an appropriate constraint type for the operand; the compiler chooses from a general purpose register, memory, or immediate operand. This code lets the compiler choose the constraint type for "y". void example9() { int x, y=2; asm( "movl %1, %0\n" : "=r" (x) : "g" (y) ); } This technique can result in more efficient code. For example, when compiling example9() the compiler replaces the load and store of y with a constant 2. The compiler can then generate an immediate 2 for the y operand in the example. The assembly generated by pgcc for our example is as follows: example9: ..Dcfb0: pushq %rbp ..Dcfi0: movq %rsp, %rbp ..Dcfi1: ..EN1: ## lineno: 3 movl $2, %eax ## lineno: 6 popq %rbp ret In this example, notice the use of $2 for the "y" operand. Of course, if y is always 2, then the immediate value may be used instead of the variable with the "i" constraint, as shown here: void example10()Chapter 12. C/C++ Inline Assembly and Intrinsics 141 { int x; asm( "movl %1, %0\n" : "=r" (x) : "i" (2) ); } Compiling example10() with pgcc produces assembly similar to that produced for example9(). Machine Constraints Another category of constraints is Machine Constraints. The x86 and x86_64 architectures have several classes of registers. To choose a particular class of register, you can use the x86/x86_64 machine constraints described in Table 12.2, “x86/x86_64 Machine Constraints”. Table 12.2. x86/x86_64 Machine Constraints Constraint Description a a register (e.g., %al, %ax, %eax, %rax) A Specifies a or d registers. This is used primarily for holding 64-bit integer values on 32 bit targets. The d register holds the most significant bits and the a register holds the least significant bits. b b register (e.g, %bl, %bx, %ebx, %rbx) c c register (e.g., %cl, %cx, %ecx, %rcx) C Not supported. d d register (e.g., %dl, %dx, %edx, %rdx) D di register (e.g., %dil, %di, %edi, %rdi) e Constant in range of 0xffffffff to 0x7fffffff f Not supported. G Floating point constant in range of 0.0 to 1.0. I Constant in range of 0 to 31 (e.g., for 32-bit shifts). J Constant in range of 0 to 63 (e.g., for 64-bit shifts) K Constant in range of 0 to 127. L Constant in range of 0 to 65535. M Constant in range of 0 to 3 constant (e.g., shifts for lea instruction). N Constant in range of 0 to 255 (e.g., for out instruction). q Same as "r" simple constraint. Q Same as "r" simple constraint. R Same as "r" simple constraint. S si register (e.g., %sil, %si, %edi, %rsi) t Not supported. u Not supported.PGI® User’s Guide 142 Constraint Description x XMM SSE register y Not supported. Z Constant in range of 0 to 0x7fffffff. The following example uses the "x" or XMM register constraint to subtract c from b and store the result in a. double example11() { double a; double b = 400.99; double c = 300.98; asm ( "subpd %2, %0;" :"=x" (a) : "0" (b), "x" (c) ); return a; } The generated assembly for this example is this: example11: ..Dcfb0: pushq %rbp ..Dcfi0: movq %rsp, %rbp ..Dcfi1: ..EN1: ## lineno: 4 movsd .C00128(%rip), %xmm1 movsd .C00130(%rip), %xmm2 movapd %xmm1, %xmm0 subpd %xmm2, %xmm0; ## lineno: 10 ## lineno: 11 popq %rbp ret If a specified register is not available, the pgcc and pgcpp compilers issue an error message. For example, pgcc and pgcpp reserves the "%ebx" register for Position Independent Code (PIC) on 32-bit system targets. If a program has an asm statement with a "b" register for one of the operands, the compiler will not be able to obtain that register when compiling for 32-bit with the -fPIC switch (which generates PIC). To illustrate this point, the following example is compiled for a 32-bit target using PIC: void example12() { int x=1; int y=1; asm( "addl %1, %0\n" : "+a" (x) : "b" (y) ); } Compiling with the "-tp p7" switch chooses a 32-bit target.Chapter 12. C/C++ Inline Assembly and Intrinsics 143 % pgcc example12.c -fPIC -c -tp p7 PGC-S-0354-Can't find a register in class 'BREG' for extended ASM operand 1 (example12.c: 3) PGC/x86 Linux/x86 Rel Dev: compilation completed with severe errors Multiple Alternative Constraints Sometimes a single instruction can take a variety of operand types. For example, the x86 permits registerto-memory and memory-to-register operations. To allow this flexibility in inline assembly, use multiple alternative constraints. An alternative is a series of constraints for each operand. To specify multiple alternatives, separate each alternative with a comma. Table 12.3. Multiple Alternative Constraints Constraint Description , Separates each alternative for a particular operand. ? Ignored ! Ignored The following example uses multiple alternatives for an add operation. void example13() { int x=1; int y=1; asm( "addl %1, %0\n" : "+ab,cd" (x) : "db,cam" (y) ); } example13() has two alternatives for each operand: "ab,cd" for the output operand and "db,cam" for the input operand. Each operand must have the same number of alternatives; however, each alternative can have any number of constraints (for example, the output operand in example13() has two constraints for its second alternative and the input operand has three for its second alternative). The compiler first tries to satisfy the left-most alternative of the first operand (for example, the output operand in example13()). When satisfying the operand, the compiler starts with the left-most constraint. If the compiler cannot satisfy an alternative with this constraint (for example, if the desired register is not available), it tries to use any subsequent constraints. If the compiler runs out of constraints, it moves on to the next alternative. If the compiler runs out of alternatives, it issues an error similar to the one mentioned in example12(). If an alternative is found, the compiler uses the same alternative for subsequent operands. For example, if the compiler chooses the "c" register for the output operand in example13(), then it will use either the "a" or "m" constraint for the input operand. Constraint Modifiers Characters that affect the compiler's interpretation of a constraint are known as Constraint Modifiers. Two constraint modifiers, the "=" and the "+", were introduced in “Output Operands,” on page 135. Table 12.4 summarizes each constraint modifier.PGI® User’s Guide 144 Table 12.4. Constraint Modifier Characters Constraint Modifier Description = This operand is write-only. It is valid for output operands only. If specified, the "=" must appear as the first character of the constraint string. + This operand is both read and written by the instruction. It is valid for output operands only. The output operand is initialized with its expression before the first instruction in the asm statement. If specified, the "+" must appear as the first character of the constraint string. & A constraint or an alternative constraint, as defined in “Multiple Alternative Constraints,” on page 143, containing an "&" indicates that the output operand is an early clobber operand. This type operand is an output operand that may be modified before the asm statement finishes using all of the input operands. The compiler will not place this operand in a register that may be used as an input operand or part of any memory address. % Ignored. # Characters following a "#" up to the first comma (if present) are to be ignored in the constraint. * The character that follows the "*" is to be ignored in the constraint. The "=" and "+" modifiers apply to the operand, regardless of the number of alternatives in the constraint string. For example, the "+" in the output operand of example13() appears once and applies to both alternatives in the constraint string. The "&", "#", and "*" modifiers apply only to the alternative in which they appear. Normally, the compiler assumes that input operands are used before assigning results to the output operands. This assumption lets the compiler reuse registers as needed inside the asm statement. However, if the asm statement does not follow this convention, the compiler may indiscriminately clobber a result register with an input operand. To prevent this behavior, apply the early clobber "&" modifier. An example follows: void example15() { int w=1; int z; asm( "movl $1, %0\n" "addl %2, %0\n" "movl %2, %1" : "=a" (w), "=r" (z) : "r" (w) ); } The previous code example presents an interesting ambiguity because "w" appears both as an output and as an input operand. So, the value of "z" can be either 1 or 2, depending on whether the compiler uses the same register for operand 0 and operand 2. The use of constraint "r" for operand 2 allows the compiler to pick any general purpose register, so it may (or may not) pick register "a" for operand 2. This ambiguity can be eliminated by changing the constraint for operand 2 from "r" to "a" so the value of "z" will be 2, or by adding an early clobber "&" modifier so that "z" will be 1. The following example shows the same function with an early clobber "&" modifier:Chapter 12. C/C++ Inline Assembly and Intrinsics 145 void example16() { int w=1; int z; asm( "movl $1, %0\n" "addl %2, %0\n" "movl %2, %1" : "=&a" (w), "=r" (z) : "r" (w) ); } Adding the early clobber "&" forces the compiler not to use the "a" register for anything other than operand 0. Operand 2 will therefore get its own register with its own copy of "w". The result for "z" in example16() is 1. Operand Aliases Extended asm specifies operands in assembly strings with a percent '%' followed by the operand number. For example, "%0" references operand 0 or the output item "=&a" (w) in function example16() shown above. Extended asm also supports operand aliasing, which allows use of a symbolic name instead of a number for specifying operands. An example follows: void example17() { int w=1, z=0; asm( "movl $1, %[output1]\n" "addl %[input], %[output1]\n" "movl %[input], %[output2]" : [output1] "=&a" (w), [output2] "=r" (z) : [input] "r" (w)); } In example17(), "%[output1]" is an alias for "%0", "%[output2]" is an alias for "%1", and "%[input]" is an alias for "%2". Aliases and numeric references can be mixed, as shown in the following example: void example18() { int w=1, z=0; asm( "movl $1, %[output1]\n" "addl %[input], %0\n" "movl %[input], %[output2]" : [output1] "=&a" (w), [output2] "=r" (z) : [input] "r" (w)); } In example18(), "%0" and "%[output1]" both represent the output operand. Assembly String Modifiers Special character sequences in the assembly string affect the way the assembly is generated by the compiler. For example, the "%" is an escape sequence for specifying an operand, "%%" produces a percent for hard coded registers, and "\n" specifies a new line. Table 12.5, “Assembly String Modifier Characters”summarizes these modifiers, known as Assembly String Modifiers. Table 12.5. Assembly String Modifier Characters Modifier Description \ Same as \ in printf format strings.PGI® User’s Guide 146 Modifier Description %* Adds a '*' in the assembly string. %% Adds a '%' in the assembly string. %A Adds a '*' in front of an operand in the assembly string. (For example, %A0 adds a '*' in front of operand 0 in the assembly output.) %B Produces the byte op code suffix for this operand. (For example, %b0 produces 'b' on x86 and x86_64.) %L Produces the word op code suffix for this operand. (For example, %L0 produces 'l' on x86 and x86_64.) %P If producing Position Independent Code (PIC), the compiler adds the PIC suffix for this operand. (For example, %P0 produces @PLT on x86 and x86_64.) %Q Produces a quad word op code suffix for this operand if is supported by the target. Otherwise, it produces a word op code suffix. (For example, %Q0 produces 'q' on x86_64 and 'l' on x86.) %S Produces 's' suffix for this operand. (For example, %S0 produces 's' on x86 and x86_64.) %T Produces 't' suffix for this operand. (For example, %S0 produces 't' on x86 and x86_64.) %W Produces the half word op code suffix for this operand. (For example, %W0 produces 'w' on x86 and x86_64.) %a Adds open and close parentheses ( ) around the operand. %b Produces the byte register name for an operand. (For example, if operand 0 is in register 'a', then %b0 will produce '%al'.) %c Cuts the '$' character from an immediate operand. %k Produces the word register name for an operand. (For example, if operand 0 is in register 'a', then %k0 will produce '%eax'.) %q Produces the quad word register name for an operand if the target supports quad word. Otherwise, it produces a word register name. (For example, if operand 0 is in register 'a', then %q0 produces %rax on x86_64 or %eax on x86.) %w Produces the half word register name for an operand. (For example, if operand 0 is in register 'a', then %w0 will produce '%ax'.) %z Produces an op code suffix based on the size of an operand. (For example, 'b' for byte, 'w' for half word, 'l' for word, and 'q' for quad word.) %+ %C %D %F %O %X %f %h %l %n %s %y Not Supported. These modifiers begin with either a backslash "\" or a percent "%".Chapter 12. C/C++ Inline Assembly and Intrinsics 147 The modifiers that begin with a backslash "\" (e.g., "\n") have the same effect as they do in a printf format string. The modifiers that are preceded with a "%" are used to modify a particular operand. These modifiers begin with either a backslash "\" or a percent "%" For example, "%b0" means, "produce the byte or 8 bit version of operand 0". If operand 0 is a register, it will produce a byte register such as %al, %bl, %cl, and so on. Consider this example: void example19() { int a = 1; int *p = &a; asm ("add%z0 %q1, %a0" : "=&p" (p) : "r" (a), "0" (p) ); } On an x86 target, the compiler produces the following instruction for the asm string shown in the preceding example: addl %ecx, (%eax) The "%z0" modifier produced an 'l' (lower-case 'L') suffix because the size of pointer p is 32 bits on x86. The "%q1" modifier produced the word register name for variable a. The "%a0" instructs the compiler to add parentheses around operand 0, hence "(%eax)". On an x86_64 target, the compiler produces the following instruction for the above asm string shown in the preceding example: addq %rcx, (%rax) The "%z0" modifier produced a 'q' suffix because the size of pointer p is 64-bit on x86_64. Because x86_64 supports quad word registers, the "%q1" modifier produced the quad word register name (%rax) for variable a. Extended Asm Macros As with traditional inline assembly, described in“Inline Assembly,” on page 133, extended asm can be used in a macro. For example, you can use the following macro to access the runtime stack pointer. #define GET_SP(x) \ asm("mov %%sp, %0": "=m" (##x):: "%sp" ); void example20() { void * stack_pointer; GET_SP(stack_pointer); } The GET_SP macro assigns the value of the stack pointer to whatever is inserted in its argument (for example, stack_pointer). Another "C" extension known as statement expressions is used to write the GET_SP macro another way: #define GET_SP2 ({ \ void *my_stack_ptr; \ asm("mov %%sp, %0": "=m" (my_stack_ptr) :: "%sp" ); \PGI® User’s Guide 148 my_stack_ptr; \ }) void example21() { void * stack_pointer = GET_SP2; } The statement expression allows a body of code to evaluate to a single value. This value is specified as the last instruction in the statement expression. In this case, the value is the result of the asm statement, my_stack_ptr. By writing an asm macro with a statement expression, the asm result may be assigned directly to another variable (for example, void * stack_pointer = GET_SP2) or included in a larger expression, such as: void * stack_pointer = GET_SP2 - sizeof(long). Which style of macro to use depends on the application. If the asm statement needs to be a part of an expression, then a macro with a statement expression is a good approach. Otherwise, a traditional macro, like GET_SP(x), will probably suffice. Intrinsics Inline intrinsic functions map to actual x86 or x64 machine instructions. Intrinsics are inserted inline to avoid the overhead of a function call. The compiler has special knowledge of intrinsics, so with use of intrinsics, better code may be generated as compared to extended inline assembly code. The PGI Workstation version 7.0 or higher compiler intrinsics library implements MMX, SSE, SS2, SSE3, SSSE3, SSE4a, and ABM instructions. The intrinsic functions are available to C and C++ programs on Linux and Windows. Unlike most functions which are in libraries, intrinsics are implemented internally by the compiler. A program can call the intrinsic functions from C/C++ source code after including the corresponding header file. The intrinsics are divided into header files as follows: Table 12.6. Intrinsic Header File Organization Instructions Header File MMX mmintrin.h SSE xmmintrin.h SSE2 emmintrin.h SSE3 pmmintrin.h SSSE3 tmmintrin.h SSE4a ammintrin.h ABM intrin.h The following is a simple example program that calls XMM intrinsics. #include int main(){ __m128 __A, __B, result;Chapter 12. C/C++ Inline Assembly and Intrinsics 149 __A = _mm_set_ps(23.3, 43.7, 234.234, 98.746); __B = _mm_set_ps(15.4, 34.3, 4.1, 8.6); result = _mm_add_ps(__A,__B); return 0; }150151 Chapter 13. Fortran, C and C++ Data Types This chapter describes the scalar and aggregate data types recognized by the PGI Fortran, C, and C++ compilers, the format and alignment of each type in memory, and the range of values each type can have on x86 or x64 processor-based systems running a 32-bit operating system. For more information on x86- specific data representation, refer to the System V Application Binary Interface, Processor Supplement, listed in “Related Publications,” on page xxvii. This chapter specifically does not address x64 processor-based systems running a 64-bit operating system, because the application binary interface (ABI) for those systems is still evolving. For the latest version of the ABI, refer to http://www.x86-64.org/abi.pdf. Fortran Data Types Fortran Scalars A scalar data type holds a single value, such as the integer value 42 or the real value 112.6. The next table lists scalar data types, their size, format and range. Table 13.2, “Real Data Type Ranges,” on page 152 shows the range and approximate precision for Fortran real data types. Table 13.3, “Scalar Type Alignment,” on page 152 shows the alignment for different scalar data types. The alignments apply to all scalars, whether they are independent or contained in an array, a structure or a union. Table 13.1. Representation of Fortran Data Types Fortran Data Type Format Range INTEGER 2's complement integer -2 31 to 2 31 -1 INTEGER*2 2's complement integer -32768 to 32767 INTEGER*4 2's complement integer INTEGER*8 2's complement integer -2 63 to 2 63 -1 LOGICAL 32-bit value true or false LOGICAL*1 8-bit value true or falsePGI® User’s Guide 152 Fortran Data Type Format Range LOGICAL*2 16-bit value true or false LOGICAL*4 32-bit value true or false LOGICAL*8 64-bit value true or false BYTE 2's complement -128 to 127 REAL Single-precision floating point 10 -37 to 1038 (1) REAL*4 Single-precision floating point 10 -37 to 1038 (1) REAL*8 Double-precision floating point 10 -307 to 1038 (1) DOUBLE PRECISION Double-precision floating point 10 -307 to 1038 (1) COMPLEX Single-precision floating point 10 -37 to 1038 (1) DOUBLE COMPLEX Double-precision floating point 10 -307 to 1038 (1) COMPLEX*16 Double-precision floating point 10 -307 to 1038 (1) CHARACTER*n Sequence of n bytes (1) Approximate value The logical constants .TRUE. and .FALSE. are all ones and all zeroes, respectively. Internally, the value of a logical variable is true if the least significant bit is one and false otherwise. When the option –Munixlogical is set, a logical variable with a non-zero value is true and with a zero value is false. Table 13.2. Real Data Type Ranges Data Type Binary Range Decimal Range Digits of Precision REAL -2 -126 to 2 128 10 -37 to 1038 (1) 7-8 REAL*8 -2 -1022 to 2 1024 10 -307 to 1038 (1) 15-16 Table 13.3. Scalar Type Alignment This Type... ...Is aligned on this size boundary LOGICAL*1 1-byte LOGICAL*2 2-byte LOGICAL*4 4-byte LOGICAL*8 8-byte BYTE 1-byteChapter 13. Fortran, C and C++ Data Types 153 This Type... ...Is aligned on this size boundary INTEGER*2 2-byte INTEGER*4 4-byte INTEGER*8 8-byte REAL*4 4-byte REAL*8 8-byte COMPLEX*8 4-byte COMPLEX*16 8-byte FORTRAN 77 Aggregate Data Type Extensions The PGF77 compiler supports de facto standard extensions to FORTRAN 77 that allow for aggregate data types. An aggregate data type consists of one or more scalar data type objects. You can declare the following aggregate data types: array consists of one or more elements of a single data type placed in contiguous locations from first to last. structure is a structure that can contain different data types. The members are allocated in the order they appear in the definition but may not occupy contiguous locations. union is a single location that can contain any of a specified set of scalar or aggregate data types. A union can have only one value at a time. The data type of the union member to which data is assigned determines the data type of the union after that assignment. The alignment of an array, a structure or union (an aggregate) affects how much space the object occupies and how efficiently the processor can address members. Arrays use the alignment of their members. Array types align according to the alignment of the array elements. For example, an array of INTEGER*2 data aligns on a 2-byte boundary. Structures and Unions align according to the alignment of the most restricted data type of the structure or union. In the next example, the union aligns on a 4-byte boundary since the alignment of c, the most restrictive element, is four. STRUCTURE /astr/ UNION MAP INTEGER*2 a ! 2 bytes END MAP PGI® User’s Guide 154 MAP BYTE b ! 1 byte END MAP MAP INTEGER*4 c ! 4 bytes END MAP END UNION END STRUCTURE Structure alignment can result in unused space called padding. Padding between members of the structure is called internal padding. Padding between the last member and the end of the space is called tail padding. The offset of a structure member from the beginning of the structure is a multiple of the member’s alignment. For example, since an INTEGER*2 aligns on a 2-byte boundary, the offset of an INTEGER*2 member from the beginning of a structure is a multiple of two bytes. Fortran 90 Aggregate Data Types (Derived Types) The Fortran 90 standard added formal support for aggregate data types. The TYPE statement begins a derived type data specification or declares variables of a specified user-defined type. For example, the following would define a derived type ATTENDEE: TYPE ATTENDEE CHARACTER(LEN=30) NAME CHARACTER(LEN=30) ORGANIZATION CHARACTER (LEN=30) EMAIL END TYPE ATTENDEE In order to declare a variable of type ATTENDEE and access the contents of such a variable, code such as the following would be used: TYPE (ATTENDEE) ATTLIST(100) . . . ATTLIST(1)%NAME = ‘JOHN DOE’ C and C++ Data Types C and C++ Scalars Table 13.4, “C/C++ Scalar Data Types”lists C and C++ scalar data types, providing their size and format. The alignment of a scalar data type is equal to its size. Table 13.5, “Scalar Alignment,” on page 155 shows scalar alignments that apply to individual scalars and to scalars that are elements of an array or members of a structure or union. Wide characters are supported (character constants prefixed with an L). The size of each wide character is 4 bytes. Table 13.4. C/C++ Scalar Data Types Data Type Size (bytes) Format Range unsigned char 1 ordinal 0 to 255 [signed] char 1 2's complement integer -128 to 127 unsigned short 2 ordinal 0 to 65535 [signed] short 2 2's complement integer -32768 to 32767Chapter 13. Fortran, C and C++ Data Types 155 Data Type Size (bytes) Format Range unsigned int 4 ordinal 0 to 2 32 -1 [signed] int 4 2's complement integer -2 31 to 2 31 -1 [signed] long [int] (32-bit operating systems and win64) 4 2's complement integer -2 31 to 2 31 -1 [signed] long [int] (linux86- 64 and sua64) 8 2's complement integer -2 63 to 2 63 -1 unsigned long [int] (32-bit operating systems and win64) 4 ordinal 0 to 2 32 -1 unsigned long [int] (linux86- 64 and sua64) 8 ordinal 0 to 2 64 -1 [signed] long long [int] 8 2's complement integer -2 63 to 2 63 -1 unsigned long long [int] 8 ordinal 0 to 2 64 -1 float 4 IEEE single-precision floating-point 10 -37 to 10 38 (1) double 8 IEEE double-precision floating-point 10 -307 to 10 308 (1) long double 8 IEEE double-precision floating-point 10 -307 to 10 308 (1) bit field (2) (unsigned value) 1 to 32 bits ordinal 0 to 2 size -1, where size is the number of bits in the bit field bit field (2) (signed value) 1 to 32 bits 2's complement integer -2 size-1 to 2 size-1 -1, where size is the number of bits in the bit field pointer 4 address 0 to 2 32 -1 enum 4 2's complement integer -2 31 to 2 31 -1 (1) Approximate value (2) Bit fields occupy as many bits as you assign them, up to 4 bytes, and their length need not be a multiple of 8 bits (1 byte) Table 13.5. Scalar Alignment Data Type Alignment on this size boundary char 1-byte boundary, signed or unsigned. short 2-byte boundary, signed or unsigned. int 4-byte boundary, signed or unsigned.PGI® User’s Guide 156 Data Type Alignment on this size boundary enum 4-byte boundary. pointer 4-byte boundary. float 4-byte boundary. double 8-byte boundary. long double 8-byte boundary. long [int] 32-bit on Win64 4-byte boundary, signed or unsigned. long [int] linux86-64, sua64 8-byte boundary, signed or unsigned. long long [int] 8-byte boundary, signed or unsigned. C and C++ Aggregate Data Types An aggregate data type consists of one or more scalar data type objects. You can declare the following aggregate data types: array consists of one or more elements of a single data type placed in contiguous locations from first to last. class (C++ only) is a class that defines an object and its member functions. The object can contain fundamental data types or other aggregates including other classes. The class members are allocated in the order they appear in the definition but may not occupy contiguous locations. struct is a structure that can contain different data types. The members are allocated in the order they appear in the definition but may not occupy contiguous locations. When a struct is defined with member functions, its alignment rules are the same as those for a class. union is a single location that can contain any of a specified set of scalar or aggregate data types. A union can have only one value at a time. The data type of the union member to which data is assigned determines the data type of the union after that assignment. Class and Object Data Layout Class and structure objects with no virtual entities and with no base classes, that is just direct data field members, are laid out in the same manner as C structures. The following section describes the alignment and size of these C-like structures. C++ classes (and structures as a special case of a class) are more difficult to describe. Their alignment and size is determined by compiler generated fields in addition to user-specified fields. The following paragraphs describe how storage is laid out for more general classes. The user is warned that the alignment and size of a class (or structure) is dependent on the existence and placement of direct and virtual base classes and of virtual function information. The information that follows is for informational purposes only, reflects the current implementation, and is subject to change. Do not make assumptions about the layout of complex classes or structures. All classes are laid out in the same general way, using the following pattern (in the sequence indicated):Chapter 13. Fortran, C and C++ Data Types 157 • First, storage for all of the direct base classes (which implicitly includes storage for non-virtual indirect base classes as well): • When the direct base class is also virtual, only enough space is set aside for a pointer to the actual storage, which appears later. • In the case of a non-virtual direct base class, enough storage is set aside for its own non-virtual base classes, its virtual base class pointers, its own fields, and its virtual function information, but no space is allocated for its virtual base classes. • Next, storage for the class’s own fields. • Next, storage for virtual function information (typically, a pointer to a virtual function table). • Finally, storage for its virtual base classes, with space enough in each case for its own non-virtual base classes, virtual base class pointers, fields, and virtual function information. Aggregate Alignment The alignment of an array, a structure or union (an aggregate) affects how much space the object occupies and how efficiently the processor can address members. Arrays align according to the alignment of the array elements. For example, an array of short data type aligns on a 2-byte boundary. Structures and Unions align according to the most restrictive alignment of the enclosing members. For example the union un1 below aligns on a 4-byte boundary since the alignment of c, the most restrictive element, is four: union un1 { short a; /* 2 bytes */ char b; /* 1 byte */ int c; /* 4 bytes */ }; Structure alignment can result in unused space, called padding. Padding between members of a structure is called internal padding. Padding between the last member and the end of the space occupied by the structure is called tail padding. Figure 13.1, “Internal Padding in a Structure,” on page 157, illustrates structure alignment. Consider the following structure: struct strc1 { char a; /* occupies byte 0 */ short b; /* occupies bytes 2 and 3 */ char c; /* occupies byte 4 */ int d; /* occupies bytes 8 through 11 */ }; Figure 13.1. Internal Padding in a StructurePGI® User’s Guide 158 Figure 13.2, “Tail Padding in a Structure,” on page 158, shows how tail padding is applied to a structure aligned on a doubleword (8 byte) boundary. struct strc2{ int m1[4]; /* occupies bytes 0 through 15 */ double m2; /* occupies bytes 16 through 23 */ short m3; /* occupies bytes 24 and 25 */ } st; Bit-field Alignment Bit-fields have the same size and alignment rules as other aggregates, with several additions to these rules: • Bit-fields are allocated from right to left. • A bit-field must entirely reside in a storage unit appropriate for its type. Bit-fields never cross unit boundaries. • Bit-fields may share a storage unit with other structure/union members, including members that are not bitfields. • Unnamed bit-field's types do not affect the alignment of a structure or union. • Items of [signed/unsigned] long long type may not appear in field declarations on 32-bit systems. Figure 13.2. Tail Padding in a Structure Other Type Keywords in C and C++ The void data type is neither a scalar nor an aggregate. You can use void or void* as the return type of a function to indicate the function does not return a value, or as a pointer to an unspecified data type, respectively. The const and volatile type qualifiers do not in themselves define data types, but associate attributes with other types. Use const to specify that an identifier is a constant and is not to be changed. Use volatile to prevent optimization problems with data that can be changed from outside the program, such as memory#mapped I/O buffers.159 Chapter 14. C++ Name Mangling Name mangling transforms the names of entities so that the names include information on aspects of the entity’s type and fully qualified name. This ability is necessary since the intermediate language into which a program is translated contains fewer and simpler name spaces than there are in the C++ language; specifically: • Overloaded function names are not allowed in the intermediate language. • Classes have their own scopes in C++, but not in the generated intermediate language. For example, an entity x from inside a class must not conflict with an entity x from the file scope. • External names in the object code form a completely flat name space. The names of entities with external linkage must be projected onto that name space so that they do not conflict with one another. A function f from a class A, for example, must not have the same external name as a function f from class B. • Some names are not names in the conventional sense of the word, they're not strings of alphanumeric characters, for example: operator=. There are two main problems here: 1. Generating external names that will not clash. 2. Generating alphanumeric names for entities with strange names in C++. Name mangling solves these problems by generating external names that will not clash, and alphanumeric names for entities with strange names in C++. It also solves the problem of generating hidden names for some behind-the-scenes language support in such a way that they match up across separate compilations. You see mangled names if you view files that are translated by PGC++, and you do not use tools that demangle the C++ names. Intermediate files that use mangled names include the assembly and object files created by the pgcpp command. To view demangled names, use the tool pgdecode, which takes input from stdin. prompt> pgdecode g__1ASFf A::g(float) The name mangling algorithm for the PGC++ compiler is the same as that for cfront, and, except for a few minor details, also matches the description in Section 7.2, Function Name Encoding, of The Annotated C++ Reference Manual (ARM). Refer to the ARM for a complete description of name mangling.PGI® User’s Guide 160 Types of Mangling The following entity names are mangled: • Function names including non-member function names are mangled, to deal with overloading. Names of functions with extern "C" linkage are not mangled. • Mangled function names have the function name followed by __ followed by F followed by the mangled description of the types of the parameters of the function. If the function is a member function, the mangled form of the class name precedes the F. If the member function is static, an S also precedes the F. int f(float); // f__Ff class A int f(float); // f__1AFf static int g(float); // g__1ASFf ; • Special and operator function names, like constructors and operator=(). The encoding is similar to that for normal functions, but a coded name is used instead of the routine name: class A int operator+(float); // __pl__1Aff A(float); // __ct__1Aff ; int operator+(A, float); // __pl__F1Af • Static data member names. The mangled form is the member name followed by __ followed by the mangled form of the class name: class A static int i; // i__1A ; • Names of variables generated for virtual function tables. These have names like vtblmangled-classname or vtblmangled-base-class-namemangled-class-name. • Names of variables generated to contain runtime type information. These have names like Ttypeencoding and TIDtype-encoding. Mangling Summary This section lists some of the C++ entities that are mangled and provides some details on the mangling algorithm. For more details, refer to The Annotated C++ Reference Manual. Type Name Mangling Using PGC++, each type has a corresponding mangled encoding. For example, a class type is represented as the class name preceded by the number of characters in the class name, as in 5abcde for abcde. Simple types are encoded as lower-case letters, as in i for int or f for float. Type modifiers and declarators are encoded as upper-case letters preceding the types they modify, as in U for unsigned or P for pointer.Chapter 14. C++ Name Mangling 161 Nested Class Name Mangling Nested class types are encoded as a Q followed by a digit indicating the depth of nesting, followed by a _, followed by the mangled-form names of the class types in the fully-qualified name of the class, from outermost to innermost: class A class B // Q2_1A1B ; ; Local Class Name Mangling The name of the nested class itself is mangled to the form described above with a prefix __, which serves to make the class name distinct from all user names. Local class names are encoded as L followed by a number (which has no special meaning; it’s just an identifying number assigned to the class) followed by __ followed by the mangled name of the class (this is not in the ARM, and cfront encodes local class names slightly differently): void f() class A // L1__1A} ; ; This form is used when encoding the local class name as a type. It’s not necessary to mangle the name of the local class itself unless it's also a nested class. Template Class Name Mangling Template classes have mangled names that encode the arguments of the template: template class abc ; abc x; abc__pt__3_ii This describes two template arguments of type int with the total length of template argument list string, including the underscore, and a fixed string, indicates parameterized type as well, the name of the class template.162163 Chapter 15. Command-Line Options Reference A command-line option allows you to specify specific behavior when a program is compiled and linked. Compiler options perform a variety of functions, such as setting compiler characteristics, describing the object code to be produced, controlling the diagnostic messages emitted, and performing some preprocessor functions. Most options that are not explicitly set take the default settings. This reference chapter describes the syntax and operation of each compiler option. For easy reference, the options are arranged in alphabetical order. For an overview and tips on which options are best for which tasks, refer to Chapter 2, “Using Command Line Options,” on page 15, which also provides summary tables of the different options. This chapter uses the following notation: [item] Square brackets indicate that the enclosed item is optional. {item | item} Braces indicate that you must select one and only one of the enclosed items. A vertical bar (|) separates the choices. ... Horizontal ellipses indicate that zero or more instances of the preceding item are valid. PGI Compiler Option Summary The following tables include all the PGI compiler options that are not language-specific. The options are separated by category for easier reference. For a complete description of each option, see the detailed information later in this chapter. Build-Related PGI Options The options included in the following table are the ones you use when you are initially building your program or application.PGI® User’s Guide 164 Table 15.1. PGI Build-Related Compiler Options Option Description –# Display invocation information. –### Show but do not execute the driver commands (same as –dryrun). –c Stops after the assembly phase and saves the object code in filename.o. –D Defines a preprocessor macro. –d Prints additional information from the preprocessor. –dryrun Show but do not execute driver commands. –E Stops after the preprocessing phase and displays the preprocessed file on the standard output. –F Stops after the preprocessing phase and saves the preprocessed file in filename.f (this option is only valid for the PGI Fortran compilers). --flagcheck Simply return zero status if flags are correct. –flags Display valid driver options. –fpic (Linux only) Generate position-independent code. –fPIC (Linux only) Equivalent to –fpic. –G (Linux only) Passed to the linker. Instructs the linker to produce a shared object file. –g77libs (Linux only) Allow object files generated by g77 to be linked into PGI main programs. –help Display driver help message. –I Adds a directory to the search path for #include files. –i2, –i4 and –i8 –i2: Treat INTEGER variables as 2 bytes. –i4: Treat INTEGER variables as 4 bytes. –i8: Treat INTEGER and LOGICAL variables as 8 bytes and use 64- bits for INTEGER*8 operations. –K Requests special compilation semantics with regard to conformance to IEEE 754. --keeplnk If the compiler generates a temporary indirect file for a long linker command, preserves the temporary file instead of deleting it. –L Specifies a library directory. –l Loads a library. –m Displays a link map on the standard output. –M Selects variations for code generation and optimization. –mcmodel=mediumChapter 15. Command-Line Options Reference 165 Option Description (–tp k8-64 and –tp p7-64 targets only) Generate code which supports the medium memory model in the linux86-64 environment. –module (F90/F95/HPF only) Save/search for module files in directory . –mp[=align,[no]numa] Interpret and process user-inserted shared-memory parallel programming directives (see Chapters 5 and 6). –noswitcherror Ignore unknown command line switches after printing an warning message. –o Names the object file. –pc (–tp px/p5/p6/piii targets only) Set precision globally for x87 floating-point calculations; must be used when compiling the main program. may be one of 32, 64 or 80. –pg Instrument the generated executable to produce a gprof-style gmon.out sample-based profiling trace file (–qp is also supported, and is equivalent). –pgf77libs Append PGF77 runtime libraries to the link line. –pgf90libs Append PGF90/PGF95 runtime libraries to the link line. –Q Selects variations for compiler steps. –R (Linux only) Passed to the Linker. Hard code into the search path for shared object files. –r Creates a relocatable object file. –r4 and –r8 –r4: Interpret DOUBLE PRECISION variables as REAL. –r8: Interpret REAL variables as DOUBLE PRECISION. –rc file Specifies the name of the driver's startup file. –s Strips the symbol-table information from the object file. –S Stops after the compiling phase and saves the assembly–language code in filename.s. –shared (Linux only) Passed to the linker. Instructs the linker to generate a shared object file. Implies –fpic. –show Display driver's configuration parameters after startup. –silent Do not print warning messages. –soname Pass the soname option and its argument to the linker. –time Print execution times for the various compilation steps. –tp [,target...] Specify the type(s) of the target processor(s).PGI® User’s Guide 166 Option Description –u Initializes the symbol table with , which is undefined for the linker. An undefined symbol triggers loading of the first member of an archive library. –U Undefine a preprocessor macro. –V[release_number] Displays the version messages and other information, or allows invocation of a version of the compiler other than the default. –v Displays the compiler, assembler, and linker phase invocations. –W Passes arguments to a specific phase. –w Do not print warning messages. PGI Debug-Related Compiler Options The options included in the following table are the ones you typically use when you are debugging your program or application. Table 15.2. PGI Debug-Related Compiler Options Option Description –C Exposes Ansi warnings only. –c Instrument the generated executable to perform array bounds checking at runtime. –E Stops after the preprocessing phase and displays the preprocessed file on the standard output. --flagcheck Simply return zero status if flags are correct. –flags Display valid driver options. –g Includes debugging information in the object module. –gopt Includes debugging information in the object module, but forces assembly code generation identical to that obtained when is not present on the command line. –K Requests special compilation semantics with regard to conformance to IEEE 754. --keeplnk If the compiler generates a temporary indirect file for a long linker command, preserves the temporary file instead of deleting it. –M Selects variations for code generation and optimization. –pc (–tp px/p5/p6/piii targets only) Set precision globally for x87 floating-point calculations; must be used when compiling the main program. may be one of 32, 64 or 80. –Mprof=timeChapter 15. Command-Line Options Reference 167 Option Description Instrument the generated executable to produce a gprof-style gmon.out sample-based profiling trace file (–qp is also supported, and is equivalent). PGI Optimization-Related Compiler Options The options included in the following table are the ones you typically use when you are optimizing your program or application code. Table 15.3. Optimization-Related PGI Compiler Options Option Description –fast Generally optimal set of flags for targets that support SSE capability. –fastsse Generally optimal set of flags for targets that include SSE/SSE2 capability. –M Selects variations for code generation and optimization. –mp[=align,[no]numa] Interpret and process user-inserted shared-memory parallel programming directives (see Chapters 5 and 6). –nfast Generally optimal set of flags for the target. Doesn’t use SSE. –O Specifies code optimization level where is 0, 1, 2, 3, or 4. –pc (–tp px/p5/p6/piii targets only) Set precision globally for x87 floating-point calculations; must be used when compiling the main program. may be one of 32, 64 or 80. –Mprof=time Instrument the generated executable to produce a gprof-style gmon.out sample-based profiling trace file (-qp is also supported, and is equivalent). PGI Linking and Runtime-Related Compiler Options The options included in the following table are the ones you typically use to define parameters related to linking and running your program or application code. Table 15.4. Linking and Runtime-Related PGI Compiler Options Option Description –byteswapio (Fortran only) Swap bytes from big-endian to little-endian or vice versa on input/output of unformatted data –fpic (Linux only) Generate position-independent code. –fPIC (Linux only) Equivalent to –fpic. –G (Linux only) Passed to the linker. Instructs the linker to produce a shared object file.PGI® User’s Guide 168 Option Description –g77libs (Linux only) Allow object files generated by g77 to be linked into PGI main programs. –i2, –i4 and –i8 –i2: Treat INTEGER variables as 2 bytes. –i4: Treat INTEGER variables as 4 bytes. –i8: Treat INTEGER and LOGICAL variables as 8 bytes and use 64- bits for INTEGER*8 operations. –K Requests special compilation semantics with regard to conformance to IEEE 754. –M Selects variations for code generation and optimization. –mcmodel=medium (–tp k8-64 and –tp p7-64 targets only) Generate code which supports the medium memory model in the linux86-64 environment. –shared (Linux only) Passed to the linker. Instructs the linker to generate a shared object file. Implies –fpic. –soname Pass the soname option and its argument to the linker. –tp [,target...] Specify the type(s) of the target processor(s). C and C++ Compiler Options There are a large number of compiler options specific to the PGCC and PGC++ compilers, especially PGC++. The next table lists several of these options, but is not exhaustive. For a complete list of available options, including an exhaustive list of PGC++ options, use the –help command-line option. For further detail on a given option, use –help and specify the option explicitly. The majority of these options are related to building your program or application. Table 15.5. C and C++ -specific Compiler Options Option Description –A (pgcpp only) Accept proposed ANSI C++, issuing errors for non-conforming code. –a (pgcpp only) Accept proposed ANSI C++, issuing warnings for non-conforming code. --[no_]alternative_tokens (pgcpp only) Enable/disable recognition of alternative tokens. These are tokens that make it possible to write C++ without the use of the , , [, ], #, &, and ^ and characters. The alternative tokens include the operator keywords (e.g., and, bitand, etc.) and digraphs. The default is -–no_alternative_tokens. –B Allow C++ comments (using //) in C source. –b (pgcpp only) Compile with cfront 2.1 compatibility. This accepts constructs and a version of C++ that is not partChapter 15. Command-Line Options Reference 169 Option Description of the language definition but is accepted by cfront. EDG option. –b3 (pgcpp only) Compile with cfront 3.0 compatibility. See –b above. --[no_]bool (pgcpp only) Enable or disable recognition of bool. The default value is ––bool. – –[no_]builtin Do/don’t compile with math subroutine builtin support, which causes selected math library routines to be inlined. The default is ––builtin. --cfront_2.1 (pgcpp only) Enable compilation of C++ with compatibility with cfront version 2.1. --cfront_3.0 (pgcpp only) Enable compilation of C++ with compatibility with cfront version 3.0. --compress_names (pgcpp only) Create a precompiled header file with the name filename. --dependencies (see –M) (pgcpp only) Print makefile dependencies to stdout. --dependencies_to_file filename (pgcpp only) Print makefile dependencies to file filename. --display_error_number (pgcpp only) Display the error message number in any diagnostic messages that are generated. --diag_error tag (pgcpp only) Override the normal error severity of the specified diagnostic messages. --diag_remark tag (pgcpp only) Override the normal error severity of the specified diagnostic messages. --diag_suppress tag (pgcpp only) Override the normal error severity of the specified diagnostic messages. --diag_warning tag (pgcpp only) Override the normal error severity of the specified diagnostic messages. -e (pgcpp only) Set the C++ front-end error limit to the specified . --[no_]exceptions (pgcpp only) Disable/enable exception handling support. The default is ––exceptions ––gnu_extensions (pgcpp only) Allow GNU extensions like “include next” which are required to compile Linux system header files. --[no]llalign (pgcpp only) Do/don’t align longlong integers on integer boundaries. The default is ––llalign. –M Generate make dependence lists. –MD Generate make dependence lists.PGI® User’s Guide 170 Option Description –MD,filename (pgcpp only) Generate make dependence lists and print them to file filename. --optk_allow_dollar_in_id_chars (pgcpp only) Accept dollar signs in identifiers. –P Stops after the preprocessing phase and saves the preprocessed file in filename.i. -+p (pgcpp only) Disallow all anachronistic constructs. cfront option --pch (pgcpp only) Automatically use and/or create a precompiled header file. --pch_dir directoryname (pgcpp only) The directory dirname in which to search for and/or create a precompiled header file. --[no_]pch_messages (pgcpp only) Enable/ disable the display of a message indicating that a precompiled header file was created or used. --preinclude= (pgcpp only) Specify file to be included at the beginning of compilation so you can set system-dependent macros, types, and so on. -suffix (see–P ) (pgcpp only) Use with –E, –F, or –P to save intermediate file in a file with the specified suffix. –t Control instantiation of template functions. EDG option --use_pch filename (pgcpp only) Use a precompiled header file of the specified name as part of the current compilation. --[no_]using_std (pgcpp only) Enable/disable implicit use of the std namespace when standard header files are included. –X (pgcpp only) Allow $ in names. Generic PGI Compiler Options The following descriptions are for the PGI options. For easy reference, the options are arranged in alphabetical order. For a list of options by tasks, refer to Chapter 2, “Using Command Line Options,” on page 15. –# Displays the invocations of the compiler, assembler and linker. Default: The compiler does not display individual phase invocations. Usage:The following command-line requests verbose invocation information. $ pgf95 -# prog.f Description: The –# option displays the invocations of the compiler, assembler and linker. These invocations are command-lines created by the driver from your command-line input and the default value.Chapter 15. Command-Line Options Reference 171 Related options:–Minfo, –V, –v. –### Displays the invocations of the compiler, assembler and linker, but does not execute them. Default: The compiler does not display individual phase invocations. Usage:The following command-line requests verbose invocation information. $ pgf95 -### myprog.f Description: Use the –### option to display the invocations of the compiler, assembler and linker but not to execute them. These invocations are command lines created by the compiler driver from the PGIRC files and the command-line options. Related options: –#, –dryrun, –Minfo, –V –Bdynamic Compiles for and links to the DLL version of the PGI runtime libraries. Default: The compiler uses static libraries. Usage:You can create the DLL obj1.dll and its import library obj1.lib using the following series of commands: % pgf95 -Bdynamic -c object1.f % pgf95 -Mmakedll object1.obj -o obj1.dll Then compile the main program using this command: $ pgf95 -# prog.f For a complete example, refer to Example 7.1, “Build a DLL: Fortran,” on page 82. Description: Use this option to compile for and link to the DLL version of the PGI runtime libraries. This flag is required when linking with any DLL built by the PGI compilers. This flag corresponds to the /MD flag used by Microsoft’s cl compilers. Note On Windows, -Bdynamic must be used for both compiling and linking. When you use the PGI compiler flag –Bdynamic to create an executable that links to the DLL form of the runtime, the executable built is smaller than one built without –Bdynamic. The PGI runtime DLLs, however, must be available on the system where the executable is run. The –Bdynamic flag must be used when an executable is linked against a DLL built by the PGI compilers. Related options:–Bstatic, –Mdll –Bstatic Compiles for and links to the static version of the PGI runtime libraries.PGI® User’s Guide 172 Default: The compiler uses static libraries. Usage:The following command line explicitly compiles for and links to the static version of the PGI runtime libraries: % pgf95 -Bstatic -c object1.f Description: You can use this option to explicitly compile for and link to the static version of the PGI runtime libraries. Note On Windows, -Bstatic must be used for both compiling and linking. For more information on using static libraries on Windows, refer to “Creating and Using Static Libraries on Windows,” on page 79. Related options:–Bdynamic, –Mdll –byteswapio Swaps the byte-order of data in unformatted Fortran data files on input/output. Default: The compiler does not byte-swap data on input/output. Usage: The following command-line requests that byte-swapping be performed on input/output. $ pgf95 -byteswapio myprog.f Description: Use the –byteswapio option to swap the byte-order of data in unformatted Fortran data files on input/output. When this option is used, the order of bytes is swapped in both the data and record control words; the latter occurs in unformatted sequential files. You can use option to convert big-endian format data files produced by most RISC workstations and high-end servers to the little-endian format used on x86 or x64 systems on the fly during file reads/writes. This option assumes that the record layouts of unformatted sequential access and direct access files are the same on the systems. It further assumes that the IEEE representation is used for floating-point numbers. In particular, the format of unformatted data files produced by PGI Fortran compilers is identical to the format used on Sun and SGI workstations; this format allows you to read and write unformatted Fortran data files produced on those platforms from a program compiled for an x86 or x64 platform using the –byteswapio option. Related options: –C Enables array bounds checking. Default: The compiler does not enable array bounds checking. Usage: In this example, the compiler instruments the executable produced from myprog.f to perform array bounds checking at runtime:Chapter 15. Command-Line Options Reference 173 $ pgf95 -C myprog.f Description: Use this option to enable array bounds checking. If an array is an assumed size array, the bounds checking only applies to the lower bound. If an array bounds violation occurs during execution, an error message describing the error is printed and the program terminates. The text of the error message includes the name of the array, the location where the error occurred (the source file and the line number in the source), and information about the out of bounds subscript (its value, its lower and upper bounds, and its dimension). Related options: –Mbounds. –c Halts the compilation process after the assembling phase and writes the object code to a file. Default: The compiler produces an executable file (does not use the –c option). Usage: In this example, the compiler produces the object file myprog.o in the current directory. $ pgf95 -c myprog.f Description: Use the –c option to halt the compilation process after the assembling phase and write the object code to a file. If the input file is filename.f, the output file is filename.o. Related options: –E, –Mkeepasm, –o, and –S. –d Prints additional information from the preprocessor. Default: Syntax: -d[D|I|M|N] -dD Print macros and values from source files. -dI Print include file names. -dM Print macros and values, including predefined and command-line macros. -dN Print macro names from source files. Usage: In the following example, the compiler prints macro names from the source file. $ pgf95 -dN myprog.f Description: Use the -d option to print additional information from the preprocessor.PGI® User’s Guide 174 Related options: –E, –D, –U. –D Creates a preprocessor macro with a given value. Note You can use the –D option more than once on a compiler command line. The number of active macro definitions is limited only by available memory. Syntax: -Dname[=value] Where name is the symbolic name and value is either an integer value or a character string. Default: If you define a macro name without specifying a value, the preprocessor assigns the string 1 to the macro name. Usage: In the following example, the macro PATHLENGTH has the value 256 until a subsequent compilation. If the –D option is not used, PATHLENGTH is set to 128. $ pgf95 -DPATHLENGTH=256 myprog.F The source text in myprog.F is this: #ifndef PATHLENGTH #define PATHLENGTH 128 #endif SUBROUTINE SUB CHARACTER*PATHLENGTH path ... END Use the –D option to create a preprocessor macro with a given value. The value must be either an integer or a character string. You can use macros with conditional compilation to select source text during preprocessing. A macro defined in the compiler invocation remains in effect for each module on the command line, unless you remove the macro with an #undef preprocessor directive or with the –U option. The compiler processes all of the –U options in a command line after processing the –D options. Related options: –U –dryrun Displays the invocations of the compiler, assembler, and linker but does not execute them. Default: The compiler does not display individual phase invocations. Usage: The following command-line requests verbose invocation information. $ pgf95 -dryrun myprog.fChapter 15. Command-Line Options Reference 175 Description: Use the –dryrun option to display the invocations of the compiler, assembler, and linker but not have them executed. These invocations are command lines created by the compiler driver from the PGIRC file and the command-line supplied with –dryrun. Related options: –Minfo, –V, –### –E Halts the compilation process after the preprocessing phase and displays the preprocessed output on the standard output. Default: The compiler produces an executable file. Usage: In the following example the compiler displays the preprocessed myprog.f on the standard output. $ pgf95 -E myprog.f Description: Use the –E option to halt the compilation process after the preprocessing phase and display the preprocessed output on the standard output. Related options: –C, –c, –Mkeepasm, –o, –F, –S. –F Stops compilation after the preprocessing phase. Default: The compiler produces an executable file. Usage: In the following example the compiler produces the preprocessed file myprog.f in the current directory. $ pgf95 -F myprog.F Description: Use the –F option to halt the compilation process after preprocessing and write the preprocessed output to a file. If the input file is filename.F, then the output file is filename.f. Related options: –c,–E, –Mkeepasm, –o, –S –fast Enables vectorization with SEE instructions, cache alignment, and flushz for 64-bit targets. Default: The compiler enables vectorization with SEE instructions, cache alignment, and flushz. Usage: In the following example the compiler produces vector SEE code when targeting a 64-bit machine. $ pgf95 -fast vadd.f95 Description: When you use this option, a generally optimal set of options is chosen for targets that support SSE capability. In addition, the appropriate –tp option is automatically included to enable generation of code optimized for the type of system on which compilation is performed. This option enables vectorization with SEE instructions, cache alignment, and flushz.PGI® User’s Guide 176 Note Auto-selection of the appropriate –tp option means that programs built using the –fastsse option on a given system are not necessarily backward-compatible with older systems. Note C/C++ compilers enable –Mautoinline with –fast. Related options: –nfast, –O, –Munroll, –Mnoframe, –Mscalarsse, –Mvect, –Mcache_align, –tp –fastsse Synonymous with –fast. --flagcheck Causes the compiler to check that flags are correct then exit. Default: The compiler begins a compile without the additional step to first validate that flags are correct. Usage: In the following example the compiler checks that flags are correct, and then exits. $ pgf95 --flagcheck myprog.f Description: Use this option to make the compiler check that flags are correct and then exit. If flags are all correct then the compiler returns a zero status. Related options: –flags Displays driver options on the standard output. Default: The compiler does not display the driver options. Usage: In the following example the user requests information about the known switches. $ pgf95 -flags Description: Use this option to display driver options on the standard output. When you use this option with –v, in addition to the valid options, the compiler lists options that are recognized and ignored. Related options: –#, –###, –v –fpic (Linux only) Generates position-independent code suitable for inclusion in shared object (dynamically linked library) files. Default: The compiler does not generate position-independent code.Chapter 15. Command-Line Options Reference 177 Usage: In the following example the resulting object file, myprog.o, can be used to generate a shared object. $ pgf95 -fpic myprog.f (Linux only) Use the -fpic option to generate position-independent code suitable for inclusion in shared object (dynamically linked library) files. Related options: –shared, –fPIC, –G, –R –fPIC (Linux only) Equivalent to –fpic. Provided for compatibility with other compilers. –G (Linux only) Instructs the linker to produce a shared object file. Default: The compiler does not instruct the linker to produce a shared object file. Usage: In the following example the linker produces a shared object file. $ pgf95 -G myprog.f Description: (Linux only) Use this option to pass information to the linker that instructs the linker to produce a shared object file. Related options: –fpic, –shared, –R –g Instructs the compiler to include symbolic debugging information in the object module. Default: The compiler does not put debugging information into the object module. Usage: In the following example, the object file a.out contains symbolic debugging information. $ pgf95 -g myprog.f Description: Use the –g option to instruct the compiler to include symbolic debugging information in the object module. Debuggers, such as PGDBG, require symbolic debugging information in the object module to display and manipulate program variables and source code. If you specify the –g option on the command-line, the compiler sets the optimization level to –O0 (zero), unless you specify the –O option. For more information on the interaction between the –g and –O options, see the –O entry. Symbolic debugging may give confusing results if an optimization level other than zero is selected. Note Including symbolic debugging information increases the size of the object module. Related options:–OPGI® User’s Guide 178 –gopt Instructs the compiler to include symbolic debugging information in the object file, and to generate optimized code identical to that generated when –g is not specified. Default: The compiler does not put debugging information into the object module. Usage: In the following example, the object file a.out contains symbolic debugging information. $ pgf95 -gopt myprog.f Description: Using –g alters how optimized code is generated in ways that are intended to enable or improve debugging of optimized code. The –gopt option instructs the compiler to include symbolic debugging information in the object file, and to generate optimized code identical to that generated when –g is not specified. Related options: –g77libs (Linux only) Used on the link line, this option instructs the pgf95 driver to search the necessary g77 support libraries to resolve references specific to g77 compiled program units. Note The g77 compiler must be installed on the system on which linking occurs in order for this option to function correctly. Default: The compiler does not search g77 support libraries to resolve references at link time. Usage: The following command-line requests that g77 support libraries be searched at link time: $ pgf95 -g77libs myprog.f g77_object.o Description: (Linux only) Use the –g77libs option on the link line if you are linking g77-compiled program units into a pgf95-compiled main program using the pgf95 driver. When this option is present, the pgf95 driver searches the necessary g77 support libraries to resolve references specific to g77 compiled program units. Related options: –help Used with no other options, –help displays options recognized by the driver on the standard output. When used in combination with one or more additional options, usage information for those options is displayed to standard output. Default: The compiler does not display usage information. Usage: In the following example, usage information for –Minline is printed to standard output. $ pgcc -help -Minline -Minline[=lib:||except:|Chapter 15. Command-Line Options Reference 179 name:|size:|levels:] Enable function inlining lib: Use extracted functions from extlib Inline function func except: Do not inline function func name: Inline function func size: Inline only functions smaller than n levels: Inline n levels of functions -Minline Inline all functions that were extracted In the following example, usage information for –help shows how groups of options can be listed or examined according to function $ pgcc -help -help -help[=groups|asm|debug|language|linker|opt|other| overall|phase|prepro|suffix|switch|target|variable] Show compiler switches Description: Use the –help option to obtain information about available options and their syntax. You can use –help in one of three ways: • Use –help with no parameters to obtain a list of all the available options with a brief one-line description of each. • Add a parameter to –help to restrict the output to information about a specific option. The syntax for this usage is this: -help • Add a parameter to –help to restrict the output to a specific set of options or to a building process. The syntax for this usage is this: -help= The following table lists and describes the subgroups available with –help. –help=groups Gives available groups for help. Table 15.6. Subgroups for –help Option Use this –help option To get this information... –help=asm A list of options specific to the assembly phase. –help=debug A list of options related to debug information generation. –help=groups A list of available groups to use with the help option. –help=language A list of language-specific options. –help=linker A list of options specific to link phase. –help=opt A list of options specific to optimization phase. –help=other A list of other options, such as ansi conformance pointer aliasing for C. –help=overall A list of option generic to any compiler.PGI® User’s Guide 180 Use this –help option To get this information... –help=phase A list of build process phases and to which compiler they apply. –help=prepro A list of options specific to preprocessing phase. –help=suffix A list of known file suffixes and to which phases they apply. –help=switch A list of all known options, this is equivalent to usage of –help without any parameter. –help=target A list of options specific to target processor. –help=variable A list of all variables and their current value. They can be redefined on the command line using syntax VAR=VALUE. For more examples of –help, refer to “Help with Command-line Options,” on page 16. Related options: –#, –###, –show, –V, –flags –I Adds a directory to the search path for files that are included using either the INCLUDE statement or the preprocessor directive #include. Default: The compiler searches only certain directories for included files. • For gcc-lib includes: /usr/lib64/gcc-lib • For system includes: /usr/linclude Syntax: -Idirectory Where directory is the name of the directory added to the standard search path for include files. Usage: In the following example, the compiler first searches the directory mydir and then searches the default directories for include files. $ pgf95 -Imydir Description: Adds a directory to the search path for files that are included using the INCLUDE statement or the preprocessor directive #include. Use the –I option to add a directory to the list of where to search for the included files. The compiler searches the directory specified by the –I option before the default directories. The Fortran INCLUDE statement directs the compiler to begin reading from another file. The compiler uses two rules to locate the file: 1. If the file name specified in the INCLUDE statement includes a path name, the compiler begins reading from the file it specifies. 2. If no path name is provided in the INCLUDE statement, the compiler searches (in order):Chapter 15. Command-Line Options Reference 181 • Any directories specified using the –I option (in the order specified.) • The directory containing the source file • The current directory For example, the compiler applies rule (1) to the following statements: INCLUDE '/bob/include/file1' (absolute path name) INCLUDE '../../file1' (relative path name) and rule (2) to this statement: INCLUDE 'file1' Related options: –Mnostdinc –i2, –i4 and –i8 Treat INTEGER and LOGICAL variables as either two, four, or eight bytes. Default: The compiler treats INTERGER and LOGICAL variables as four bytes. Usage: In the following example using the i8 switch causes the integer variables to be treated as 64 bits. $ pgf95 -I8 int.f int.f is a function similar to this: int.f print *, “Integer size:”, bit_size(i) end Description: Use this option to treat INTEGER and LOGICAL variables as either two, four, or eight bytes. INTEGER*8 values not only occupy 8 bytes of storage, but operations use 64 bits, instead of 32 bits. Related options: –K Requests that the compiler provide special compilation semantics. Default: The compiler does not provide special compilation semantics. Syntax: –K Where flag is one of the following: ieee Perform floating-point operations in strict conformance with the IEEE 754 standard. Some optimizations are disabled, and on some systems a more accurate math library is linked if –Kieee is used during the link step.PGI® User’s Guide 182 noieee Default flag. Use the fastest available means to perform floating-point operations, link in faster non-IEEE libraries if available, and disable underflow traps. PIC (Linux only) Generate position-independent code. Equivalent to –fpic. Provided for compatibility with other compilers. pic (Linux only) Generate position-independent code. Equivalent to –fpic. Provided for compatibility with other compilers. trap=option [,option]... Controls the behavior of the processor when floating-point exceptions occur. Possible options include: • fp • align (ignored) • inv • denorm • divz • ovf • unf • inexact Usage: In the following example, the compiler performs floating-point operations in strict conformance with the IEEE 754 standard $ pgf95 -Kieee myprog.f Description: Use -K to instruct the compile to provide special compilation semantics. The default is –Knoieee. –Ktrap is only processed by the compilers when compiling main functions or programs. The options inv, denorm, divz, ovf, unf, and inexact correspond to the processor’s exception mask bits: invalid operation, denormalized operand, divide-by-zero, overflow, underflow, and precision, respectively. Normally, the processor’s exception mask bits are on, meaning that floating-point exceptions are masked—the processor recovers from the exceptions and continues. If a floating-point exception occurs and its corresponding mask bit is off, or “unmasked”, execution terminates with an arithmetic exception (C's SIGFPE signal). –Ktrap=fp is equivalent to –Ktrap=inv,divz,ovf. Note The PGI compilers do not support exception-free execution for–Ktrap=inexact. The purpose of this hardware support is for those who have specific uses for its execution, along with the appropriate signal handlers for handling exceptions it produces. It is not designed for normal floating point operation code support. Related options:Chapter 15. Command-Line Options Reference 183 --keeplnk (Windows only.) Preserves the temporary file when the compiler generates a temporary indirect file for a long linker command. Usage: In the following example the compiler preserves each temporary file rather than deleting it. $ pgf95 --keeplnk myprog.f Description: If the compiler generates a temporary indirect file for a long linker command, use this option to instruct the compiler to preserve the temporary file instead of deleting it. Related options: –L Specifies a directory to search for libraries. Note Multiple –L options are valid. However, the position of multiple –L options is important relative to –l options supplied. Syntax: -Ldirectory Where directory is the name of the library directory. Default: The compiler searches the standard library directory. Usage: In the following example, the library directory is /lib and the linker links in the standard libraries required by PGF95 from this directory. $ pgf95 -L/lib myprog.f In the following example, the library directory /lib is searched for the library file libx.a and both the directories /lib and /libz are searched for liby.a. $ pgf95 -L/lib -lx -L/libz -ly myprog.f Use the –L option to specify a directory to search for libraries. Using –L allows you to add directories to the search path for library files. Related options:-l –l Instructs the linker to load the specified library. The linker searches in addition to the standard libraries. Note The linker searches the libraries specified with –l in order of appearance before searching the standard libraries.PGI® User’s Guide 184 Syntax: -llibrary Where library is the name of the library to search. Usage: In the following example, if the standard library directory is /lib the linker loads the library /lib/ libmylib.a, in addition to the standard libraries. $ pgf95 myprog.f -lmylib Description: Use this option to instruct the linker to load the specified library. The compiler prepends the characters lib to the library name and adds the .a extension following the library name. The linker searches each library specifies before searching the standard libraries. Related options:–L –m Displays a link map on the standard output. Default: The compiler does display the link map and does not use the –m option. Usage:When the following example is executed on Windows, pgf95 creates a link map in the file myprog.map. $ pgf95 -m myprog.f Description: Use this option to display a link map. • On Linux, the map is written to stdout. • On Windows, the map is written to a .map file whose name depends on the executable. If the executable is myprog.f, the map file is in myprog.map. Related options: –c, –o, -s, –u –M Selects options for code generation. The options are divided into the following categories: Code generation Fortran Language Controls Optimization Environment C/C++ Language Controls Miscellaneous Inlining The following table lists and briefly describes the options alphabetically and includes a field showing the category. For more details about the options as they relate to these categories, refer to “–M Options by Category,” on page 219.Chapter 15. Command-Line Options Reference 185 Table 15.7. –M Options Summary pgflag Description Category allocatable=95|03 Controls whether to use Fortran 95 or Fortran 2003 semantics in allocatable array assignments. Fortran Language anno Annotate the assembly code with source code. Miscellaneous [no]autoinline C/C++ when a function is declared with the inline keyword, inline it at –O2 and above. Inlining [no]asmkeyword Specifies whether the compiler allows the asm keyword in C/C++ source files (pgcc and pgcpp only). C/C++ Language [no]backslash Determines how the backslash character is treated in quoted strings (pgf77, pgf95, and pghpf only). Fortran Language [no]bounds Specifies whether array bounds checking is enabled or disabled. Miscellaneous – –[no_]builtin Do/don’t compile with math subroutine builtin support, which causes selected math library routines to be inlined (pgcc and pgcpp only). Optimization byteswapio Swap byte-order (big-endian to little-endian or vice versa) during I/O of Fortran unformatted data. Miscellaneous cache_align Where possible, align data objects of size greater than or equal to 16 bytes on cache-line boundaries. Optimization chkfpstk Check for internal consistency of the x87 FP stack in the prologue of a function and after returning from a function or subroutine call (–tp px/p5/p6/ piii targets only). Miscellaneous chkptr Check for NULL pointers (pgf95 and pghpf only). Miscellaneous chkstk Check the stack for available space upon entry to and before the start of a parallel region. Useful when many private variables are declared. Miscellaneous concur Enable auto-concurrentization of loops. Multiple processors or cores will be used to execute parallelizable loops. Optimization cpp Run the PGI cpp-like preprocessor without performing subsequent compilation steps. Miscellaneous cray Force Cray Fortran (CF77) compatibility (pgf77, pgf95, and pghpf only). Optimization [no]daz Do/don’t treat denormalized numbers as zero. Code Generation [no]dclchk Determines whether all program variables must be declared (pgf77, pgf95, and pghpf only). Fortran LanguagePGI® User’s Guide 186 pgflag Description Category [no]defaultunit Determines how the asterisk character (“*”) is treated in relation to standard input and standard output (regardless of the status of I/O units 5 and 6, pgf77, pgf95, and pghpf only). Fortran Language [no]depchk Checks for potential data dependencies. Optimization [no]dse Enables [disables] dead store elimination phase for programs making extensive use of function inlining. Optimization [no]dlines Determines whether the compiler treats lines containing the letter "D" in column one as executable statements (pgf77, pgf95, and pghpf only). Fortran Language dll Link with the DLL version of the runtime libraries (Windows only). Miscellaneous dollar,char Specifies the character to which the compiler maps the dollar sign code (pgf77, pgf95, and pghpf only). Fortran Language dwarf1 When used with –g, generate DWARF1 format debug information. Code Generation dwarf2 When used with –g, generate DWARF2 format debug information. Code Generation dwarf3 When used with –g, generate DWARF3 format debug information. Code Generation extend Instructs the compiler to accept 132-column source code; otherwise it accepts 72-column code (pgf77, pgf95, and pghpf only). Fortran Language extract invokes the function extractor. Inlining fcon Instructs the compiler to treat floating-point constants as float data types (pgcc and pgcpp only). C/C++ Language fixed Instructs the compiler to assume F77-style fixed format source code (pgf95 and pghpf only). Fortran Language [no]flushz Do/don’t set SSE flush-to-zero mode Code Generation [no]fprelaxed[=option] Perform certain floating point intrinsic functions using relaxed precision. Optimization free Instructs the compiler to assume F90-style free format source code (pgf95 and pghpf only). Fortran Language func32 The compiler aligns all functions to 32-byte boundaries. Code Generation gccbug[s] Matches behavior of certain gcc bugs MiscellaneousChapter 15. Command-Line Options Reference 187 pgflag Description Category noi4 Determines how the compiler treats INTEGER variables (pgf77, pgf95, and pghpf only). Optimization info Prints informational messages regarding optimization and code generation to standard output as compilation proceeds. Miscellaneous inform Specifies the minimum level of error severity that the compiler displays. Miscellaneous inline Invokes the function inliner. Inlining [no]ipa Invokes inter-procedural analysis and optimization. Optimization [no]iomutex Determines whether critical sections are generated around Fortran I/O calls (pgf77, pgf95, and pghpf only). Fortran Language keepasm Instructs the compiler to keep the assembly file. Miscellaneous [no]large_arrays Enables support for 64-bit indexing and single static data objects of size larger than 2GB. Code Generation lfs Links in libraries that allow file I/O to files of size larger than 2GB on 32-bit systems (32-bit Linux only). Environment [no]lre Disable/enable loop-carried redundancy elimination. Optimization list Specifies whether the compiler creates a listing file. Miscellaneous makedll Generate a dynamic link library (DLL) (Windows only). Miscellaneous makeimplib Passes the -def switch to the librarian without a deffile, when used without –def:deffile. Miscellaneous mpi=option Link to MPI libraries: MPICH1, MPICH2, or Microsoft MPI libraries Code Generation [no]loop32 Aligns/does not align innermost loops on 32 byte boundaries with –tp barcelona Code Generation [no]movnt Force/disable generation of non-temporal moves and prefetching. Code Generation neginfo Instructs the compiler to produce information on why certain optimizations are not performed. Miscellaneous noframe Eliminates operations that set up a true stack frame pointer for functions. Optimization nomain When the link step is called, don’t include the object file that calls the Fortran main program (pgf77, pgf95, and pghpf only). Code GenerationPGI® User’s Guide 188 pgflag Description Category noopenmp When used in combination with the –mp option, causes the compiler to ignore OpenMP parallelization directives or pragmas, but still process SGI-style parallelization directives or pragmas. Miscellaneous nopgdllmain Do not link the module containing the default DllMain() into the DLL (Windows only). Miscellaneous norpath On Linux, do not add –rpath paths to the link line. Miscellaneous nosgimp When used in combination with the –mp option, causes the compiler to ignore SGI-style parallelization directives or pragmas, but still process OpenMP directives or pragmas. Miscellaneous [no]stddef Instructs the compiler to not recognize the standard preprocessor macros. Environment nostdinc Instructs the compiler to not search the standard location for include files. Environment nostdlib Instructs the linker to not link in the standard libraries. Environment [no]onetrip Determines whether each DO loop executes at least once (pgf77, pgf95, and pghpf only). Language novintr Disable idiom recognition and generation of calls to optimized vector functions. Optimization pfi Instrument the generated code and link in libraries for dynamic collection of profile and data information at runtime. Optimization pfo Read a pgfi.out trace file and use the information to enable or guide optimizations. Optimization [no]prefetch Enable/disable generation of prefetch instructions. Optimization preprocess Perform cpp-like preprocessing on assembly language and Fortran input source files. Miscellaneous prof Set profile options; function-level and line-level profiling are supported. Code Generation [no]r8 Determines whether the compiler promotes REAL variables and constants to DOUBLE PRECISION (pgf77, pgf95, and pghpf only). Optimization [no]r8intrinsics Determines how the compiler treats the intrinsics CMPLX and REAL (pgf77, pgf95, and pghpf only). Optimization [no]recursive Allocate (do not allocate) local variables on the stack, this allows recursion. SAVEd, data-initialized, Code GenerationChapter 15. Command-Line Options Reference 189 pgflag Description Category or namelist members are always allocated statically, regardless of the setting of this switch (pgf77, pgf95, and pghpf only). [no]reentrant Specifies whether the compiler avoids optimizations that can prevent code from being reentrant. Code Generation [no]ref_externals Do/don’t force references to names appearing in EXTERNAL statements (pgf77, pgf95, and pghpf only). Code Generation safeptr Instructs the compiler to override data dependencies between pointers and arrays (pgcc and pgcpp only). Optimization safe_lastval In the case where a scalar is used after a loop, but is not defined on every iteration of the loop, the compiler does not by default parallelize the loop. However, this option tells the compiler it safe to parallelize the loop. For a given loop, the last value computed for all scalars make it safe to parallelize the loop. Code Generation [no]save Determines whether the compiler assumes that all local variables are subject to the SAVE statement (pgf77, pgf95, and pghpf only). Fortran Language [no]scalarsse Do/don’t use SSE/SSE2 instructions to perform scalar floating-point arithmetic. Optimization schar Specifies signed char for characters (pgcc and pgcpp only - also see uchar). C/C++ Language [no]second_underscore Do/don’t add the second underscore to the name of a Fortran global if its name already contains an underscore (pgf77, pgf95, and pghpf only). Code Generation [no]signextend Do/don’t extend the sign bit, if it is set. Code Generation [no]single Do/don’t convert float parameters to double parameter characters (pgcc and pgcpp only). C/C++ Language [no]smart Do/don’t enable optional post-pass assembly optimizer. Optimization [no]smartalloc[=huge| huge:|hugebss] Add a call to the routine mallopt in the main routine. Supports large TLBs on Linux and Windows. Tip. To be effective, this switch must be specified when compiling the file containing the Fortran, C, or C++ main program. Environment standard Causes the compiler to flag source code that does not conform to the ANSI standard (pgf77, pgf95, and pghpf only). Fortran LanguagePGI® User’s Guide 190 pgflag Description Category [no]stride0 Do/do not generate alternate code for a loop that contains an induction variable whose increment may be zero (pgf77, pgf95, and pghpf only). Code Generation uchar Specifies unsigned char for characters (pgcc and pgcpp only - also see schar). C/C++ Language unix Uses UNIX calling and naming conventions for Fortran subprograms (pgf77, pgf95, and pghpf for Win32 only). Code Generation [no]nounixlogical Determines whether logical .TRUE. and .FALSE. are determined by non-zero (TRUE) and zero (FALSE) values for unixlogical. With nounixlogical, the default, -1 values are TRUE and 0 values are FALSE (pgf77, pgf95, and pghpf only). Fortran Language [no]unroll Controls loop unrolling. Optimization [no]upcase Determines whether the compiler allows uppercase letters in identifiers (pgf77, pgf95, and pghpf only). Fortran Language varargs Forces Fortran program units to assume calls are to C functions with a varargs type interface (pgf77 and pgf95 only). Code Generation [no]vect Do/don’t invoke the code vectorizer. Optimization –mcmodel=medium (For use only on 64-bit Linux targets) Generates code for the medium memory model in the linux86-64 execution environment. Implies –Mlarge_arrays. Default: The compiler generates code for the small memory model. Usage: The following command line requests position independent code be generated, and the –mcmodel=medium option be passed to the assembler and linker: $ pgf95 -mcmodel=medium myprog.f Description: The default small memory model of the linux86-64 environment limits the combined area for a user’s object or executable to 1GB, with the Linux kernel managing usage of the second 1GB of address for system routines, shared libraries, stacks, and so on. Programs are started at a fixed address, and the program can use a single instruction to make most memory references. The medium memory model allows for larger than 2GB data areas, or .bss sections. Program units compiled using either –mcmodel=medium or –fpic require additional instructions to reference memory. The effect on performance is a function of the data-use of the application. The –mcmodel=medium switch must be used at both compile time and link time to create 64-bit executables. Program units compiled for the default small memory model can be linked into medium memory model executables as long as they are compiled with –fpic, or position-independent.Chapter 15. Command-Line Options Reference 191 The linux86-64 environment provides static libxxx.a archive libraries that are built with and without –fpic, and dynamic libxxx.so shared object libraries that are compiled –fpic. The –mcmodel=medium link switch implies the –fpic switch and will utilize the shared libraries by default. Similarly, the $PGI/linux86-64// lib directory contains the libraries for building small memory model codes, and the $PGI/linux86-64// libso directory contains shared libraries for building –mcmodel=medium and –fpic executables. Note –mcmodel=medium -fpic is not allowed to create shared libraries. However, you can create static archive libraries (.a) that are –fpic. Related options:–Mlarge_arrays –module Allows you to specify a particular directory in which generated intermediate .mod files should be placed. Default: The compiler places .mod files in the current working directory, and searches only in the current working directory for pre-compiled intermediate .mod files. Usage: The following command line requests that any intermediate module file produced during compilation of myprog.f be placed in the directory mymods; specifically, the file ./mymods/myprog.mod is used. $ pgf95 -module mymods myprog.f Description: Use the –module option to specify a particular directory in which generated intermediate .mod files should be placed. If the –module option is present, and USE statements are present in a compiled program unit, then is searched for .mod intermediate files prior to a search in the default local directory. Related options: –mp[=align,[no]numa] Instructs the compiler to interpret user-inserted OpenMP shared-memory parallel programming directives and pragmas, and to generate an executable file which will utilize multiple processors in a shared-memory parallel system. Default: The compiler ignores user-inserted shared-memory parallel programming directives and pragmas. Usage: The following command line requests processing of any shared-memory directives present in myprog.f: $ pgf95 -mp myprog.f Description: Use the –mp option to instruct the compiler to interpret user-inserted OpenMP shared-memory parallel programming directives and to generate an executable file which utilizes multiple processors in a shared-memory parallel system. The align sub-option forces loop iterations to be allocated to OpenMP processes using an algorithm that maximizes alignment of vector sub-sections in loops that are both parallelized and vectorized for SSE. ThisPGI® User’s Guide 192 allocation can improve performance in program units that include many such loops. It can also result in loadbalancing problems that significantly decrease performance in program units with relatively short loops that contain a large amount of work in each iteration. The numa suboption uses libnuma on systems where it is available. For a detailed description of this programming model and the associated directives and pragmas, refer to Chapter 5, “Using OpenMP”. Related options: –Mconcur and –Mvect –nfast A generally optimal set of options is chosen depending on the target system. In addition, the appropriate –tp option is automatically included to enable generation of code optimized for the type of system on which compilation is performed. Note Auto-selection of the appropriate –tp option means that programs built using the –fast option on a given system are not necessarily backward-compatible with older systems. Usage: In the following example, the compiler selects a generally optimal set of options for the target system. $ pgf95 -nfast myprog.f Description: Use this option to instruct the compiler to select a generally optimal set of options for the target system. In addition, the appropriate –tp option is automatically included to enable generation of code optimized for the type of system on which compilation is performed. Related options: –O, –Munroll, –Mnoframe, –Mvect, –tp, –Mscalarsse –noswitcherror Issues warnings instead of errors for unknown switches. Ignores unknown command line switches after printing an warning message. Default: The compiler prints an error message and then halts. Usage: In the following example, the compiler ignores unknown command line switches after printing an warning message. $ pgf95 -noswitcherror myprog.f Description: Use this option to instruct the compiler to ignore unknown command line switches after printing an warning message. Tip You can configure this behavior in the siterc file by adding: set NOSWITCHERROR=1. Related options:None.Chapter 15. Command-Line Options Reference 193 –O Invokes code optimization at the specified level. Default: The compiler optimizes at level 2 (correct?) Syntax: –O [level] Where level is an integer from 0 to 4. Usage: In the following example, since no –O option is specified, the compiler sets the optimization to level 1. $ pgf95 myprog.f In the following example, since no optimization level is specified and a –O option is specified, the compiler sets the optimization to level 2. $ pgf95 -O myprog.f Description: Use this option to invoke code optimization at the specified level - one of the following: 0 creates a basic block for each statement. Neither scheduling nor global optimization is done. To specify this level, supply a 0 (zero) argument to the –O option. 1 schedules within basic blocks and performs some register allocations, but does no global optimization. 2 performs all level-1 optimizations, and also performs global scalar optimizations such as induction variable elimination and loop invariant movement. 3 level-three specifies aggressive global optimization. This level performs all level-one and level-two op-timizations and enables more aggressive hoisting and scalar replacement optimizations that may or may not be profitable. 4 level-four performs all level-one, level-two, and level-three op-timizations and enables hoisting of guarded invariant floating point expressions. Table 15.8 shows the interaction between the –O option, –g option, –Mvect, and –Mconcur options. Table 15.8. Optimization and –O, –g, –Mvect, and –Mconcur Options Optimize Option Debug Option –M Option Optimization Level none none none 1 none none –Mvect 2PGI® User’s Guide 194 Optimize Option Debug Option –M Option Optimization Level none none –Mconcur 2 none –g none 0 –O none or –g none 2 –Olevel none or –g none level –Olevel < 2 none or –g –Mvect 2 –Olevel < 2 none or –g –Mconcur 2 Unoptimized code compiled using the option –O0 can be significantly slower than code generated at other optimization levels. Like the –Mvect option, the –Munroll option sets the optimization level to level-2 if no –O or –g options are supplied. The –gopt option is recommended for generation of debug information with optimized code. For more information on optimization, see Chapter 3, “Using Optimization & Parallelization”. Related options: –g, –M, –gopt –o Names the executable file. Use the –o option to specify the filename of the compiler object file. The final output is the result of linking. Syntax: –o filename Where filename is the name of the file for the compilation output. The filename must not have a .f extension. Default: The compiler creates executable filenames as needed. If you do not specify the –o option, the default filename is the linker output file a.out. Usage: In the following example, the executable file is myprog instead of the default a.out. $ pgf95 myprog.f -o myprog Related options: –c, –E, –F, –S –pc Note This option is available only for –tp px/p5/p6/piii targets. Allows you to control the precision of operations performed using the x87 floating point unit, and their representation on the x87 floating point stack. Syntax: –pc { 32 | 64 | 80 }Chapter 15. Command-Line Options Reference 195 Usage: $ pgf95 -pc 64 myprog.c Description: The x87 architecture implements a floating-point stack using 8 80-bit registers. Each register uses bits 0-63 as the significant, bits 64-78 for the exponent, and bit 79 is the sign bit. This 80-bit real format is the default format, called the extended format. When values are loaded into the floating point stack they are automatically converted into extended real format. The precision of the floating point stack can be controlled, however, by setting the precision control bits (bits 8 and 9) of the floating control word appropriately. In this way, you can explicitly set the precision to standard IEEE double-precision using 64 bits, or to single precision using 32 bits. 1 The default precision is system dependent. To alter the precision in a given program unit, the main program must be compiled with the same -pc option. The command line option –pc val lets the programmer set the compiler’s precision preference. Valid values for val are: • 32 single precision • 64 double precision • 80 extended precision Extended Precision Option – Operations performed exclusively on the floating-point stack using extended precision, without storing into or loading from memory, can cause problems with accumulated values within the extra 16 bits of extended precision values. This can lead to answers, when rounded, that do not match expected results. For example, if the argument to sin is the result of previous calculations performed on the floating-point stack, then an 80-bit value used instead of a 64-bit value can result in slight discrepancies. Results can even change sign due to the sin curve being too close to an x-intercept value when evaluated. To maintain consistency in this case, you can assure that the compiler generates code that calls a function. According to the x86 ABI, a function call must push its arguments on the stack (in this way memory is guaranteed to be accessed, even if the argument is an actual constant.) Thus, even if the called function simply performs the inline expansion, using the function call as a wrapper to sin has the effect of trimming the argument precision down to the expected size. Using the –Mnobuiltin option on the command line for C accomplishes this task by resolving all math routines in the library libm, performing a function call of necessity. The other method of generating a function call for math routines, but one that may still produce the inline instructions, is by using the –Kieee switch. A second example illustrates the precision control problem using a section of code to determine machine precision: program find_precision w = 1.0 100 w=w+w y=w+1 z=y-w if (z .gt. 0) goto 100 C now w is just big enough that |((w+1)-w)-1| >= 1 ... print*,w 1 According to Intel documentation, this only affects the x87 operations of add, subtract, multiply, divide, and square root. In particular, it does not appear to affect the x87 transcendental instructions.PGI® User’s Guide 196 end In this case, where the variables are implicitly real*4, operations are performed on the floating-point stack where optimization removed unnecessary loads and stores from memory. The general case of copy propagation being performed follows this pattern: a = x y = 2.0 + a Instead of storing x into a, then loading a to perform the addition, the value of x can be left on the floatingpoint stack and added to 2.0. Thus, memory accesses in some cases can be avoided, leaving answers in the extended real format. If copy propagation is disabled, stores of all left-hand sides will be performed automatically and reloaded when needed. This will have the effect of rounding any results to their declared sizes. For the above program, w has a value of 1.8446744E+19 when executed using default (extended) precision. If, however, –Kieee is set, the value becomes 1.6777216E+07 (single precision.) This difference is due to the fact that –Kieee disables copy propagation, so all intermediate results are stored into memory, then reloaded when needed. Copy propagation is only disabled for floating-point operations, not integer. With this particular example, setting the –pc switch will also adjust the result. The switch –Kieee also has the effect of making function calls to perform all transcendental operations. Although the function still produces the x86 machine instruction for computation (unless in C the –Mnobuiltin switch is set), arguments are passed on the stack, which results in a memory store and load. Finally, –Kieee also disables reciprocal division for constant divisors. That is, for a/b with unknown a and constant b, the expression is usually converted at compile time to a*(1/b), thus turning an expensive divide into a relatively fast scalar multiplication. However, numerical discrepancies can occur when this optimization is used. Understanding and correctly using the –pc, –Mnobuiltin, and Kieee switches should enable you to produce the desired and expected precision for calculations which utilize floating-point operations. Related options: –pg (Linux only) Instructs the compiler to instrument the generated executable for gprof-style sample-based profiling. Usage: In the following example the program is compiled for profiling using pgdbg or gprof. $ pgf95 -pg myprog.c Default: The compiler does not instrument the generated executable for gprof-style profiling. Description: Use this option to instruct the compiler to instrument the generated executable for gprof-style sample-based profiling. You must use this option at both the compile and link steps. A gmon.out style trace is generated when the resulting program is executed, and can be analyzed using gprof or pgprof. –pgf77libs Instructs the compiler to append PGF77 runtime libraries to the link line.Chapter 15. Command-Line Options Reference 197 Default: The compiler does not append the PGF77 runtime libraries to the link line. Usage: In the following example a .c main program is linked with an object file compiled with pgf77. $ pgcc main.c myf77.o -pgf77libs Description: Use this option to instruct the compiler to append PGF77 runtime libraries to the link line. Related options:–pgf90libs –pgf90libs Instructs the compiler to append PGF90/PGF95 runtime libraries to the link line. Default: The compiler does not append the PGF90/PGF95 runtime libraries to the link line. Usage: In the following example a .c main program is linked with an object file compiled with pgf95. $ pgf95 main.c myf95.o -pgf90libs Description: Use this option to instruct the compiler to append PGF90/PGF95 runtime libraries to the link line. Related options:-pgf77libs –Q Selects variations for compilation. There are four uses for the –Q option. Usage: The following examples show the different –Q options. $ pgf95 -Qdir /home/comp/new hello.f $ pgf95 -Qoption ld,-s hello.f $ pgf95 -Qpath /home/test hello.f $ pgf95 -Qproduce .s hello.f Description: Use this option to select variations for compilation. As illustrated in the Usage section, there are four varieties for the –Q option. The first variety, using the dir keyword, lets you supply a directory parameter that indicates the directory where the compiler driver is located. -Qdirdirectory The second variety, using the option keyword, lets you supply the option opt to the program prog. The prog parameter can be one of pgftn, as, or ld. -Qoptionprog,opt The third –Q variety, using the path keyword, lets you supply an additional pathname to the search path for the compiler’s required .o files. -QpathpathnamePGI® User’s Guide 198 The fourth –Q variety, using the produce keyword, lets you choose a stop-after location for the compilation based on the supplied sourcetype parameter. Valid sourcetypes are: .i, .c, .s and .o, which respectively indicate the stop-after locations: preprocessing, compiling, assembling, or linking. -Qproducesourcetype Related options: –p –R (Linux only) Instructs the linker to hard-code the pathname into the search path for generated shared object (dynamically linked library) files. Note There cannot be a space between R and . Usage: In the following example, at runtime the a.out executable searches the specified directory, in this case /home./Joe/myso, for shared objects. $ pgf95 -Rm/home/Joe/myso myprog.f Description: Use this option to instruct the compiler to pass information to the linker to hard-code the pathname into the search path for shared object (dynamically linked library) files. Related options: –fpic, –shared, –G –r Linux only. Creates a relocatable object file. Default: The compiler does not create a relocatable object file and does not use the –r option. Usage: In this example, pgf95 creates a relocatable object file. $ pgf95 -r myprog.f Use this option to create a relocatable object file. Related options: –c, –o, –s, –u –r4 and –r8 Interprets DOUBLE PRECISION variables as REAL (–r4) or REAL variables as DOUBLE PRECISION (–r8). Usage: In this example, the double precision variables are interpreted as REAL. $ pgf95 -r4 myprog.f Description: Interpret DOUBLE PRECISION variables as REAL (–r4) or REAL variables as DOUBLE PRECISION (–r8). Related options: –i2, –i4, –i8, –nor8Chapter 15. Command-Line Options Reference 199 –rc Specifies the name of the driver startup configuration file. If the file or pathname supplied is not a full pathname, the path for the configuration file loaded is relative to the $DRIVER path (the path of the currently executing driver). If a full pathname is supplied, that file is used for the driver configuration file. Syntax: -rc [path] filename Where path is either a relative pathname, relative to the value of $DRIVER, or a full pathname beginning with "/ ". Filename is the driver configuration file. Default: The driver uses the configuration file .pgirc. Usage: In the following example, the file .pgf95rctest, relative to /usr/pgi/linux86/bin, the value of $DRIVER, is the driver configuration file. $ pgf95 -rc .pgf95rctest myprog.f Description: Use this option to specify the name of the driver startup configuration file. If the file or pathname supplied is not a full pathname, the path for the configuration file loaded is relative to the $DRIVER path - the path of the currently executing driver. If a full pathname is supplied, that file is used for the driver configuration file. Related options: –show –rpath Linux only. Syntax: -rpath path Speicifes the name of the dirver startip configuration file, where path is either a relative pathname, or a full pathname beginning with "/". Default: The driver uses the configuration file .pgirc. Usage: In the following example, the file .pgf95rctest, relative to /usr/pgi/linux86/bin, the value of $DRIVER, is the driver configuration file. $ pgf95 -rc .pgf95rctest myprog.f Description: Use this option to specify the name of the driver startup configuration file. If the file or pathname supplied is not a full pathname, the path for the configuration file loaded is relative to the $DRIVER path - the path of the currently executing driver. If a full pathname is supplied, that file is used for the driver configuration file. Related options: –show –s (Linux only) Strips the symbol-table information from the executable file.PGI® User’s Guide 200 Default: The compiler includes all symbol-table information and does not use the –s option. Usage: In this example, pgf95 strips symbol-table information from the a.out. executable file. $ pgf95 -s myprog.f Description: Use this option to strip the symbol-table information from the executable. Related options: –c, –o, –u –S Stops compilation after the compiling phase and writes the assembly-language output to a file. Default: The compiler does not produce a .s file. Usage: In this example, pgf95 produces the file myprog.s in the current directory. $ pgf95 -S myprog.f Description: Use this option to stop compilation after the compiling phase and then write the assemblylanguage output to a file. If the input file is filename.f, then the output file is filename.s. Related options: –c, –E, –F, –Mkeepasm, –o –shared (Linux only) Instructs the compiler to pass information to the linker to produce a shared object (dynamically linked library) file. Default: The compiler does not pass information to the linker to produce a shared object file. Usage: In the following example the compiler passes information to the linker to produce the shared object file: myso.so. $ pgf95 -shared myprog.f -o myso.so Description: Use this option to instruct the compiler to pass information to the linker to produce a shared object (dynamically linked library) file. Related options: –fpic, –G, –R –show Produces driver help information describing the current driver configuration. Default: The compiler does not show driver help information. Usage: In the following example, the driver displays configuration information to the standard output after processing the driver configuration file. $ pgf95 -show myprog.f Description: Use this option to produce driver help information describing the current driver configuration.Chapter 15. Command-Line Options Reference 201 Related options: –V, –v, –###, –help, –rc –silent Do not print warning messages. Default: The compiler prints warning messages. Usage: In the following example, the driver does not display warning messages. $ pgf95 -silent myprog.f Description: Use this option to suppress warning messages. Related options: –v, –V, –w –soname (Linux only.) The compiler recognizes the –soname option and passes it through to the linker. Default: The compiler does not recognize the –soname option. Usage: In the following example, the driver passes the soname option and its argument through to the linker. $ pgf95 -soname library.so myprog.f Description: Use this option to instruct the compiler to recognize the –soname option and pass it through to the linker. Related options: –stack (Windows only.) Allows you to explicitly set stack properties for your program. Default: If –stack is not specified, then the defaults are as followed: Win32 Setting is -stack:2097152,2097152, which is approximately 2MB for reserved and committed bytes. Win64 No default setting Syntax: -stack={ (reserved bytes)[,(committed bytes)] }{, [no]check } Usage: The following example demonstrates how to reserve 524,288 stack bytes (512KB), commit 262,144 stack bytes for each routine (256KB), and disable the stack initialization code with the nocheck argument. $ pgf95 -stack=524288,262144,nocheck myprog.f Description: Use this option to explicitly set stack properties for your program. The –stack option takes one or more arguments: (reserved bytes), (committed bytes), [no]check.PGI® User’s Guide 202 reserved bytes Specifies the total stack bytes required in your program. committed bytes Specifies the number of stack bytes that the Operating System will allocate for each routine in your program. This value must be less than or equal to the stack reserved bytes value. Default for this argument is 4096 bytes [no]check Instructs the compiler to generate or not to generate stack initialization code upon entry of each routine. Check is the default, so stack initialization code is generated. Stack initialization code is required when a routine's stack exceeds the committed bytes size. When your committed bytes is equal to the reserved bytes or equal to the stack bytes required for each routine, then you can turn off the stack initialization code using the -stack=nocheck compiler option. If you do this, the compiler assumes that you are specifying enough committed stack space; and therefore, your program does not have to manage its own stack size. For more information on determining the amount of stack required by your program, refer to –Mchkstk compiler option, described in “–M Miscellaneous Controls”. Note -stack=(reserved bytes),(committed bytes) are linker options. -stack=[no]check is a compiler option. If you specify -stack=(reserved bytes),(committed bytes) on your compile line, it is only used during the link step of your build. Similarly, –stack=[no]check can be specified on your link line, but its only used during the compile step of your build. Related options:–Mchkstk –time Print execution times for various compilation steps. Default: The compiler does not print execution times for compilation steps. Usage: In the following example, pgf95 prints the execution times for the various compilation steps. $ pgf95 -time myprog.f Description: Use this option to print execution times for various compilation steps. Related options: –# –tp [,target...] Sets the target architecture.Chapter 15. Command-Line Options Reference 203 Default: The PGI compilers produce code specifically targeted to the type of processor on which the compilation is performed. In particular, the default is to use all supported instructions wherever possible when compiling on a given system. The default style of code generation is auto-selected depending on the type of processor on which compilation is performed. Further, the –tp x64 style of unified binary code generation is only enabled by an explicit –tp x64 option. Note Executables created on a given system may not be usable on previous generation systems. (For example, executables created on a Pentium 4 may fail to execute on a Pentium III or Pentium II.) Usage: In the following example, pgf95 sets the target architecture to EM64T: $ pgf95 -tp p7-64 myprog.f Description: Use this option to set the target architecture. By default, the PGI compiler uses all supported instructions wherever possible when compiling on a given system. As a result, executables created on a given system may not be usable on previous generation systems. For example, executables created on a Pentium 4 may fail to execute on a Pentium III or Pentium II. Processor-specific optimizations can be specified or limited explicitly by using the –tp option. Thus, it is possible to create executables that are usable on previous generation systems. With the exception of k8-64, k8- 64e, p7-64, and x64, any of these sub-options are valid on any x86 or x64 processor-based system. The k8-64, k8-64e, p7-64 and x64 options are valid only on x64 processor-based systems. The –tp x64 option generates unified binary object and executable files, as described in the section called “Using –tp to Generate a Unified Binary”. The following list is the possible sub-options for –tp and the processors that each sub-option is intended to target: k8-32 generate 32-bit code for AMD Athlon64, AMD Opteron and compatible processors. k8-64 generate 64-bit code for AMD Athlon64, AMD Opteron and compatible processors. k8-64e generate 64-bit code for AMD Opteron Revision E, AMD Turion, and compatible processors. p6 generate 32-bit code for Pentium Pro/II/III and AthlonXP compatible processors. p7 generate 32-bit code for Pentium 4 and compatible processors. p7-64 generate 64-bit code for Intel P4/Xeon EM64T and compatible processors. core2 generate 32-bit code for Intel Core 2 Duo and compatible processors.PGI® User’s Guide 204 core2-64 generate 64-bit code for Intel Core 2 Duo EM64T and compatible processors. piii generate 32-bit code for Pentium III and compatible processors, including support for single-precision vector code using SSE instructions. px generate 32-bit code that is usable on any x86 processor-based system. x64 generate 64-bit unified binary code including full optimizations and support for both AMD and Intel x64 processors. Refer to Table 2, “Processor Options,” on page xxvi for a concise list of the features of these processors that distinguish them as separate targets when using the PGI compilers and tools. Syntax for 64-bit targets: -tp {k8-64 | k8-64e | p7-64 | core2-64 | x64} Syntax for 32-bit targets: -tp {k8-32 | p6 | p7 | core2 | piii | px} Using –tp to Generate a Unified Binary Different processors have differences, some subtle, in hardware features such as instruction sets and cache size. The compilers make architecture-specific decisions about such things as instruction selection, instruction scheduling, and vectorization. Any of these decisions can have significant effects on performance and compatibility. PGI unified binaries provide a low-overhead means for a single program to run well on a number of hardware platforms. You can use the –tp option to produce PGI Unified Binary programs. The compilers generate, and combine into one executable, multiple binary code streams, each optimized for a specific platform. At runtime, this one executable senses the environment and dynamically selects the appropriate code stream. The target processor switch, –tp, accepts a comma-separated list of 64-bit targets and will generate code optimized for each listed target. For example, the following switch generates optimized code for three targets: k8-64, p7-64, and core2-64. Syntax for optimizing for multiple targets: -tp k8-64,p7-64,core2-64 The –tp k8-64 and –tp k8-64e options result in generation of code supported on and optimized for AMD x64 processors, while the –tp p7-64 option results in generation of code that is supported on and optimized for Intel x64 processors. Performance of k8-64 or k8-64e code executed on Intel x64 processors, or of p7-64 code executed on AMD x64 processors, can often be significantly less than that obtained with a native binary. The special –tp x64 option is equivalent to –tp k8-64,p7-64. This switch produces PGI Unified Binary programs containing code streams fully optimized and supported for both AMD64 and Intel EM64T processors.Chapter 15. Command-Line Options Reference 205 For more information on unified binaries, refer to “Processor-Specific Optimization and the Unified Binary,” on page 36. Related options: –u Initializes the symbol-table with , which is undefined for the linker. Default: The compiler does not use the –u option. Syntax: -usymbol Where symbol is a symbolic name. Usage: In this example, pgf95 initializes symbol-table with , $ pgf95 -utest myprog.f Description: Use this option to initialize the symbol-table with , which is undefined for the linker. An undefined symbol triggers loading of the first member of an archive library. Related options: –c, –o, –s –U Undefines a preprocessor macro. Syntax: -Usymbol Where symbol is a symbolic name. Usage: The following examples undefine the macro test. $ pgf95 -Utest myprog.F $ pgf95 -Dtest -Utest myprog.F Description: Use this option to undefine a preprocessor macro. You can also use the #undef pre-processor directive to undefine macros. Related options: –D,–Mnostddef. –V[release_number] Displays additional information, including version messages. Further, if a release_number is appended, the compiler driver attempts to compile using the specified release instead of the default release. Note There can be no space between –V and release_number.PGI® User’s Guide 206 Default: The compiler does not display version information and uses the release specified by your path to compile. Usage: The following command-line shows the output using the –V option. % pgf95 -V myprog.f The following command-line causes PGF95 to compile using the 5.2 release instead of the default release. % pgcc -V5.2 myprog.c Description: Use this option to display additional information, including version messages or, if a release_number is appended, to instruct the compiler driver to attempt to compile using the specified release instead of the default release. The specified release must be co-installed with the default release, and must have a release number greater than or equal to 4.1, which was the first release that supported this functionality. Related options: –Minfo, –v –v Displays the invocations of the compiler, assembler, and linker. Default: The compiler does not display individual phase invocations. Usage: In the following example you use –v to see the commands sent to compiler tools, assembler, and linker. $ pgf95 -v myprog.f90 Description: Use the –v option to display the invocations of the compiler, assembler, and linker. These invocations are command lines created by the compiler driver from the files and the –W options you specify on the compiler command-line. Related options: –Minfo, –, V, –W –W Passes arguments to a specific phase. Syntax: -W{0 | a | l },option[,option...] Note You cannot have a space between the –W and the single-letter pass identifier, between the identifier and the comma, or between the comma and the option. 0 (the number zero) specifies the compiler.Chapter 15. Command-Line Options Reference 207 a specifies the assembler. l (lowercase letter l) specifies the linker. option is a string that is passed to and interpreted by the compiler, assembler or linker. Options separated by commas are passed as separate command line arguments. Usage: In the following example the linker loads the text segment at address 0xffc00000 and the data segment at address 0xffe00000. $ pgf95 -Wl,-k,-t,0xffc00000,-d,0xffe00000 myprog.f Description: Use this option to pass arguments to a specific phase. You can use the –W option to specify options for the assembler, compiler, or linker. Note A given PGI compiler command invokes the compiler driver, which parses the command-line, and generates the appropriate commands for the compiler, assembler, and linker. Related options: –w Do not print warning messages. Default: The compiler prints warning messages. Usage: In the following example no warning messages are printed. $ pgf95 -w myprog.f Description: Use the –w option to not print warning messages. Sometimes the compiler issues many warning in which you may have no interest. You can use this option to not issue those warnings. Related options:–silent –Xs Use legacy standard mode for C and C++. Default:None. Usage: In the following example the compiler uses legacy standard mode. $ pgcc -XS myprog.c Description: Use this option to use legacy standard mode for C and C++. This option implies - alias=traditional. Related options:-alias, –XtPGI® User’s Guide 208 –Xt Use legacy traditional mode for C and C++. Default:None. Usage: In the following example the compiler uses legacy traditional mode. $ pgcc -XStmyprog.c Description: Use this option to use legacy standard mode for C and C++. This option implies - alias=traditional. Related options:-alias, –Xs C and C++ -specific Compiler Options There are a large number of compiler options specific to the PGCC and PGC++ compilers, especially PGC++. This section provides the details of several of these options, but is not exhaustive. For a complete list of available options, including an exhaustive list of PGC++ options, use the –help command-line option. For further detail on a given option, use –help and specify the option explicitly, as described in –help . –A (pgcpp only) Instructs the PGC++ compiler to accept code conforming to the proposed ANSI C++ standard, issuing errors for non-conforming code. Default: By default, the compiler accepts code conforming to the standard C++ Annotated Reference Manual. Usage: The following command-line requests ANSI conforming C++. $ pgcpp -A hello.cc Description: Use this option to instruct the PGC++ compiler to accept code conforming to the proposed ANSI C++ standard and to issues errors for non-conforming code. Related options:–a, –b and +p. –a (pgcpp only) Instructs the PGC++ compiler to accept code conforming to the proposed ANSI C++ standard, issuing warnings for non-conforming code. Default: By default, the compiler accepts code conforming to the standard C++ Annotated Reference Manual. Usage: The following command-line requests ANSI conforming C++, issuing warnings for non-conforming code. $ pgcpp -a hello.cc Description: Use this option to instruct the PGC++ compiler to accept code conforming to the proposed ANSI C++ standard and to issues warnings for non-conforming code.Chapter 15. Command-Line Options Reference 209 Related options:–A, –b and +p. –alias select optimizations based on type-based pointer alias rules in C and C++. Syntax: -alias=[ansi|traditional] Default:None Usage: The following command-line enables optimizations. $ pgcpp -alias=ansi hello.cc Description: Use this option to select optimizations based on type-based pointer alias rules in C and C++. ansi Enable optimizations using ANSI C type-based pointer disambiguation traditional Disable type-based pointer disambiguation Related options: --[no_]alternative_tokens (pgcpp only) Enables or disables recognition of alternative tokens. These are tokens that make it possible to write C++ without the use of the comma (,) , [, ], #, &, ^, and characters. The alternative tokens include the operator keywords (e.g., and, bitand, etc.) and digraphs. The default behavior is --no_alternative_tokens. Default:. The default behavior is that the recognition of alternative tokens is disabled: -- no_alternative_tokens. Usage: The following command-line enables alternative token recognition. $ pgcpp --alternative_tokens hello.cc (pgcpp only) Use this option to enable or disable recognition of alternative tokens. These tokens make it possible to write C++ without the use of the comma (,), [, ], #, &, ^, and characters. The alternative tokens include digraphs and the operator keywords, such as and, bitand, and so on. The default behavior is -- no_alternative_tokens. Related options: –B (pgcc and pgcpp only) Enables use of C++ style comments starting with // in C program units. Default: The PGCC ANSI and K&R C compiler does not allow C++ style comments. Usage: In the following example the compiler accepts C++ style comments.PGI® User’s Guide 210 $ pgcc -B myprog.cc Description: Use this option to enable use of C++ style comments starting with // in C program units. Related options: –b (pgcpp only) Enables compilation of C++ with cfront 2.1 compatibility and acceptance of anachronisms. Default: The compiler does not accept cfront language constructs that are not part of the C++ language definition. Usage: In the following example the compiler accepts cfront constructs. $ pgcpp -b myprog.cc Description: Use this option to enable compilation of C++ with cfront 2.1 compatibility. The compiler then accepts language constructs that, while not part of the C++ language definition, are accepted by the AT&T C++ Language System (cfront release 2.1). This option also enables acceptance of anachronisms. Related options: ––cfront2.1, –b3 , ––cfront3.0, +p, –A –b3 (pgcpp only) Enables compilation of C++ with cfront 3.0 compatibility and acceptance of anachronisms. Default: The compiler does not accept cfront language constructs that are not part of the C++ language definition. Usage: In the following example, the compiler accepts cfront constructs. $ pgcpp -b3 myprog.cc Description: Use this option to enable compilation of C++ with cfront 3.0 compatibility. The compiler then accepts language constructs that, while not part of the C++ language definition, are accepted by the AT&T C++ Language System (cfront release 3.0). This option also enables acceptance of anachronisms. Related options: ––cfront2.1, –b, ––cfront3.0, +p, –A --[no_]bool (pgcpp only) Enables or disables recognition of bool. Default: The compile recognizes bool: --bool. Usage: In the following example, the compiler does not recognize bool. $ pgcpp --no_bool myprog.ccChapter 15. Command-Line Options Reference 211 Description: Use this option to enable or disable recognition of bool. Related options: – –[no_]builtin Compile with or without math subroutine builtin support. Default: The default is to compile with math subroutine support: ––built. Usage: In the following example, the compiler does not build with math subroutine support. $ pgcpp --no_builtin myprog.cc Description: Use this option to enable or disable compiling with math subroutine builtin support. When you compile with math subroutine builtin support, the selected math library routines are inlined. Related options: --cfront_2.1 (pgcpp only) Enables compilation of C++ with cfront 2.1 compatibility and acceptance of anachronisms. Default: The compiler does not accept cfront language constructs that are not part of the C++ language definition. Usage: In the following example, the compiler accepts cfront constructs. $ pgcpp --cfront_2.1 myprog.cc Description: Use this option to enable compilation of C++ with cfront 2.1 compatibility. The compiler then accepts language constructs that, while not part of the C++ language definition, are accepted by the AT&T C++ Language System (cfront release 2.1). This option also enables acceptance of anachronisms. Related options: –b, –b3, ––cfront3.0, +p, –A --cfront_3.0 (pgcpp only) Enables compilation of C++ with cfront 3.0 compatibility and acceptance of anachronisms. Default: The compiler does not accept cfront language constructs that are not part of the C++ language definition. Usage: In the following example, the compiler accepts cfront constructs. $ pgcpp --cfront_3.0 myprog.cc Description: Use this option to enable compilation of C++ with cfront 3.0 compatibility. The compiler then accepts language constructs that, while not part of the C++ language definition, are accepted by the AT&T C++ Language System (cfront release 3.0). This option also enables acceptance of anachronisms.PGI® User’s Guide 212 Related options: ––cfront2.1, –b, –b3, +p, –A --compress_names Compresses long function names in the file. Default: The compiler does not compress names: --no_compress_names. Usage: In the following example, the compiler compresses long function names. $ pgcpp --ccompress_names yprog.cc Description: Use this option to specify to compress long function names. Highly nested template parameters can cause very long function names. These long names can cause problems for older assemblers. Users encountering these problems should compileall C++ code, including library code with the switch -- compress_name. Libraries supplied by PGI work with --compress_names. Related options: --create_pch filename (pgcpp only) If other conditions are satisfied, create a precompiled header file with the specified name. Note If --pch (automatic PCH mode) appears on the command line following this option, its effect is erased. Default: The compiler does not create a precompiled header file. Usage: In the following example, the compiler creates a precompiled header file, hdr1. $ pgcpp --create_pch hdr1 myprog.cc Description: If other conditions are satisfied, use this option to create a precompiled header file with the specified name. Related options: --diag_error tag (pgcpp only) Overrides the normal error severity of the specified diagnostic messages. Default: The compiler does not override normal error severity. Description: Use this option to override the normal error severity of the specified diagnostic messages. The message(s) may be specified using a mnemonic error tag or using an error number. ? Related options:--diag_remark tag, --diag_suppress tag, --diag_warning tag, --display_error_number --diag_remark tag (pgcpp only) Overrides the normal error severity of the specified diagnostic messages.Chapter 15. Command-Line Options Reference 213 Default: The compiler does not override normal error severity. Description: Use this option to override the normal error severity of the specified diagnostic messages. The message(s) may be specified using a mnemonic error tag or using an error number. Related options: --diag_error tag, --diag_suppress tag, --diag_warning tag, --display_error_number --diag_suppress tag (pgcpp only) Overrides the normal error severity of the specified diagnostic messages. Default: The compiler does not override normal error severity. Usage: In the following example, the compiler overrides the normal error severity ofthe specified diagnostic messages.. $ pgcpp --diag_suppress error_tag prog.cc Description: Use this option to override the normal error severity of the specified diagnostic messages. The message(s) may be specified using a mnemonic error tag or using an error number. Related options:--diag_error tag, --diag_remark tag, --diag_warning tag, --diag_error_number --diag_warning tag (pgcpp only) Overrides the normal error severity of the specified diagnostic messages. Default: The compiler does not override normal error severity. Usage: In the following example, the compiler overrides the normal error severity of the specified diagnostic messages. $ pgcpp --diag_suppress an_error_tag myprog.cc Description: Use this option to override the normal error severity of the specified diagnostic messages. The message(s) may be specified using a mnemonic error tag or using an error number. Related options: --diag_error tag, --diag_remark tag, --diag_suppress tag, --diag_error_number --display_error_number (pgcpp only) Displays the error message number in any diagnostic messages that are generated. The option may be used to determine the error number to be used when overriding the severity of a diagnostic message. Default: The compiler does not display error message numbers for generated diagnostic messages. Usage: In the following example, the compiler displays the error message number for any generated diagnostic messages.PLEASE PROVIDE ONE $ pgcpp --display_error_number myprog.cc Description: Use this option to display the error message number in any diagnostic messages that are generated. You can use this option to determine the error number to be used when overriding the severity of a diagnostic message.PGI® User’s Guide 214 Related options: --diag_error tag, --diag_remark tag, --diag_suppress tag, --diag_warning tag -e (pgcpp only) Set the C++ front-end error limit to the specified . --[no_]exceptions (pgcpp only) Enables or disables exception handling support. Default: The compiler provides exception handling support: --exceptions. Usage: In the following example, the compiler does not provide exception handling support. PLEASE PROVIDE ONE $ pgcpp --no_exceptions myprog.cc Description: Use this option to enable or disable exception handling support. Related options: ––gnu_extensions (pgcpp only) Allows GNU extensions. Default: The compiler does not allow GNU extensions. Usage: In the following example, the compiler allows GNU extensions. $ pgcpp --gnu_extensions myprog.cc Description: Use this option to allow GNU extensions, such as “include next”, which are required to compile Linux system header files. Related options: --[no]llalign (pgcpp only) Enables or disables alignment of long long integers on long long boundaries. Default: The compiler aligns long long integers on long long boundaries: --llalign. Usage: In the following example, the compiler does not align long long integers on long long boundaries. $ pgcpp --nollalign myprog.cc Description: Use this option to allow enable or disable alignment of long long integers on long long boundaries. Related options: –M Generates a list of make dependencies and prints them to stdout.Chapter 15. Command-Line Options Reference 215 Note The compilation stops after the preprocessing phase. Default: The compiler does not generate a list of make dependencies. Usage: In the following example, the compiler generates a list of make dependencies. $ pgcpp -M myprog.cc Description: Use this option to generate a list of make dependencies and prints them to stdout. Related options:–MD, –P, –suffix –MD Generates a list of make dependencies and prints them to a file. Default: The compiler does not generate a list of make dependencies. Usage: In the following example, the compiler generates a list of make dependencies and prints them to the file myprog.d. $ pgcpp -MD myprog.cc Description: Use this option to generate a list of make dependencies and prints them to a file. The name of the file is determined by the name of the file under compilation.dependencies_file. Related options:–M, –P, –suffix --optk_allow_dollar_in_id_chars (pgcpp only) Accepts dollar signs ($) in identifiers. Default: The compiler does not accept dollar signs ($) in identifiers. Usage: In the following example, the compiler allows dollar signs ($) in identifiers. $ pgcpp -optk_allow_dollar_in_id_chars myprog.cc Description: Use this option to instruct the compiler to accept dollar signs ($) in identifiers. –P Halts the compilation process after preprocessing and writes the preprocessed output to a file. Default: The compiler produces an executable file. Usage: In the following example, the compiler produces the preprocessed file myprog.i in the current directory. $ pgcpp -P myprog.ccPGI® User’s Guide 216 Description: Use this option to halt the compilation process after preprocessing and write the preprocessed output to a file. If the input file is filename.c or filename.cc., then the output file is filename.i. Note Use the –suffix option with this option to save the intermediate file in a file with the specified suffix. Related options: –C,–c,–E, –Mkeepasm, –o, –S -+p (pgcpp only) Disallow all anachronistic constructs. Default: The compiler disallows all anachronistic constructs. Usage: In the following example, the compiler disallows all anachronistic constructs. $ pgcpp -+p myprog.cc Description: Use this option to disallow all anachronistic constructs. Related options: --pch (pgcpp only) Automatically use and/or create a precompiled header file. Note If --use_pch or --create_pch (manual PCH mode) appears on the command line following this option, this option has no effect. Default: The compiler does not automatically use or create a precompiled header file. Usage: In the following example, the compiler automatically uses a precompiled header file. $ pgcpp --pch myprog.cc Description: Use this option to automatically use and/or create a precompiled header file. Related options: --pch_dir directoryname (pgcpp only) Specifies the directory in which to search for and/or create a precompiled header file. The compiler searches your PATH for precompiled header files / use or create a precompiled header file. Usage: In the following example, the compiler searches in the directory myhdrdir for a precompiled header file. $ pgcpp --pch_dir myhdrdir myprog.ccChapter 15. Command-Line Options Reference 217 Description: Use this option to specify the directory in which to search for and/or create a precompiled header file. You may use this option with automatic PCH mode (--pch) or manual PCH mode (--create_pch or --use_pch). Related options:--create_pch, --pch, --use_pch --[no_]pch_messages (pgcpp only) Enables or disables the display of a message indicating that the current compilation used or created a precompiled header file. The compiler displays a message when it uses or creates a precompiled header file. In the following example, no message is displayed when the precompiled header file located in myhdrdir is used in the compilation. $ pgcpp --pch_dir myhdrdir --no_pch_messages myprog.cc Description: Use this option to enable or disable the display of a message indicating that the current compilation used or created a precompiled header file. Related options:--pch_dir, --preinclude= (pgcpp only) Specifies the name of a file to be included at the beginning of the compilation. In the following example, the compiler includes the file incl_file.c at the beginning of the compilation. me $ pgcpp --preinclude=incl_file.c myprog.cc Description: Use this option to specify the name of a file to be included at the beginning of the compilation. For example, you can use this option to set system-dependent macros and types. Related options: --use_pch filename (pgcpp only) Uses a precompiled header file of the specified name as part of the current compilation. Note If --pch (automatic PCH mode) appears on the command line following this option, its effect is erased. Default: The compiler does not use a precompiled header file. In the following example, the compiler uses the precompiled header file, hdr1 as part of the current compilation. $ pgcpp --use_pch hdr1 myprog.ccPGI® User’s Guide 218 Use a precompiled header file of the specified name as part of the current compilation. If --pch (automatic PCH mode) appears on the command line following this option, its effect is erased. Related options:--create_pch, --pch_dir, --pch_messages --[no_]using_std (pgcpp only) Enables or disables implicit use of the std namespace when standard header files are included. Default:The compiler uses std namespace when standard header files are included: --using_std. Usage: The following command-line disables implicit use of the std namespace: $ pgcpp --no_using_std hello.cc Description: Use this option to enable or disable implicit use of the std namespace when standard header files are included in the compilation. Related options: –t (pgcpp only) Control instantiation of template functions. –t [arg] Default:No templates are instantiated. Usage: In the following example, all templates are instantiated. $ pgcpp -tall myprog.cc Description: Use this option to control instantiation of template functions. The argument is one of the following: all Instantiates all functions whether or not they are used. local Instantiates only the functions that are used in this compilation, and forces those functions to be local to this compilation. Note: This may cause multiple copies of local static variables. If this occurs, the program may not execute correctly. none Instantiates no functions. (this is the default) used Instantiates only the functions that are used in this compilation. Usage: In the following example, all templates are instantiated. $ pgcppChapter 15. Command-Line Options Reference 219 -tall myprog.cc –X (pgcpp only) Generates cross-reference information and places output in the specified file. Syntax: –Xfoo where foo is the specifies file for the cross reference information. Default: The compiler does not generate cross-reference information. Usage: In the following example, the compiler generates cross-reference information, placing it in the file: xreffile. $ pgcpp -Xxreffile myprog.cc Description: Use this option to generate cross-reference information and place output in the specified file. This is an EDG option. Related options: --zc_eh (Linux only) Generates zero-overhead exceptionregions. Default:The compiler does not to use --zc_eh but instead uses --sjlj_eh, which implements exception handling with setjmp and longjmp. Usage: The following command-line enables zero-overhead exception regions: $ pgcpp --zc_eh ello.cc Description: Use this option to generate zero-overhead exception regions. The --zc_eh option defers the cost of exception handling until an exception is thrown. For a program with many exception regions and few throws, this option may lead to improved run-time performance. This option is compatible with C++ code that was compiled with previous version if PGI C++. Note The --zc_eh option is available only on newer Linux systems that supply the system unwind libraries in libgcc_eh and on Windows. Related options: –M Options by Category This section describes each of the options available with –M by the categories: Code generation Fortran Language Controls OptimizationPGI® User’s Guide 220 C/C++ Language Controls Inlining Miscellaneous Environment For a complete alphabetical list of all the options, refer to “ –M Options Summary,” on page 185. The following sections provide detailed descriptions of several, but not all, of the –M options. For a complete alphabetical list of all the options, refer to “ –M Options Summary,” on page 185. These options are grouped according to categories and are listed with exact syntax, defaults, and notes concerning similar or related options. For the latest information and description of a given option, or to see all available options, use the –help command-line option, described in“–help ,” on page 178. –M Code Generation Controls This section describes the –M options that control code generation. Default: For arguments that you do not specify, the default code generation controls are these: nodaz noreentrant nostride0 noflushz noref_externals signextend norecursive nosecond_underscore Related options: –D, –I, –L, –l, –U Syntax: Description and Related Options –Mdaz Set IEEE denormalized input values to zero; there is a performance benefit but misleading results can occur, such as when dividing a small normalized number by a denormalized number. To take effect, this option must be set for the main program. –Mnodaz Do not treat denormalized numbers as zero.To take effect, this option must be set for the main program. –Mdwarf1 Generate DWARF1 format debug information; must be used in combination with –g. –Mdwarf2 Generate DWARF2 format debug information; must be used in combination with –g. –Mdwarf3 Generate DWARF3 format debug information; must be used in combination with –g. –Mflushz Set SSE flush-to-zero mode; if a floating-point underflow occurs, the value is set to zero.To take effect, this option must be set for the main program. –Mnoflushz Do not set SSE flush-to-zero mode; generate underflows.To take effect, this option must be set for the main program.Chapter 15. Command-Line Options Reference 221 –Mfunc32 Align functions on 32-byte boundaries. –Mlarge_arrays Enable support for 64-bit indexing and single static data objects larger than 2GB in size. This option is default in the presence of –mcmodel=medium. Can be used separately together with the default small memory model for certain 64-bit applications that manage their own memory space. For more information, refer to Chapter 11, “Programming Considerations for 64-Bit Environments”. –Mmpi=option -Mmpi adds the include and library options to the compile and link commands necessary to build an MPI application using MPI librariews installed with the PGI Cluister Development Kit (CDK). On Linux, this option inserts -I$MPIDIR/include into the compile line and -L$MPIDIR/lib into the link line. The specifies option determines whether to select MPICH-1 or MPICH-2 headers and libraries. The base directories for MPICH-1 and MPICH-2 are set in localrc. On Windows, this option inserts -I$MCCP_HOME/IncludeIncludeinto the compile line and - L$CCP_HOME/lib into the link line. The -Mmpi options are as specified: • –Mmpi=mpich1 - Selects preconfigured MPICH-1 communication libraries. • –Mmpi=mpich2 - Selects preconfigured MPICH-2 communication libraries. • –Mmpi=msmpi - Select Microsoft MSMPI libraries. Note The user can set the environment variables MPIDIR and MPILIBNAME to override the default values for the MPI directory and library name. MPICH1 and MPICH2 apply only for PGI CDK Cluster Development Kit; MSMPI applies only on Microsoft Compute Cluster systems. For –Mmpi=msmpi to work, the CCP_HOME environment variable must be set. When the Microsoft Compute Cluster SDK is installed, this variable is typically set to point to the MSMPI library directory. –Mnolarge_arrays Disable support for 64-bit indexing and single static data objects larger than 2GB in size. When placed after –mcmodel=medium on the command line, disables use of 64-bit indexing for applications that have no single data object larger than 2GB. –Mnomain Instructs the compiler not to include the object file that calls the Fortran main program as part of the link step. This option is useful for linking programs in which the main program is written in C/C++ and one or more subroutines are written in Fortran (pgf77, pgf95, and pghpf only). –M[no]movnt Instructs the compiler to generate nontemporal move and prefetch instructions even in cases where the compiler cannot determine statically at compile-time that these instructions will be beneficial.PGI® User’s Guide 222 –Mprof[=option[,option,...]] Set performance profiling options. Use of these options causes the resulting executable to create a performance profile that can vbe viewed and analyzed with the PGPROF performance profiler. In the descriptions that follow, PGI-style profiling implies compiler-generated source instrumentation. MPICHstyle profiling implies the use of instrumented wrappers for MPI library routines. The option argument can be any of the following: dwarf Generate limited DWARF symbol information sufficient for most performance profilers. func Perform PGI-style function-level profiling. hwcts Generate a profile using event-based sampling of hardware counters via the PAPI interface. (linux86- 64 platforms only; PAPI must be installed). lines Perform PGI-style line-level profiling. mpich1 Perform MPICH-style profiling for MPICH-1. Implied –Mmpi=mpich1. (Linux only). mpich2 Perform MPICH-style profiling for MPICH-2. Implies –Mmpi=mpich2. (Linux with MPI support licence privileges only.) msmpi Perform MPICH-style profiling for Microsoft MSMPI. Implies –Mmpi=msmpi. (Microsoft Compute Cluster Server only ). For -Mprof=msmpi to work, the CCP_HOME environment variable must be set. This variable is typically set when the Microsoft Compute Cluster SDK is installed. time Generate a profile using time-based instruction-level statistical sampling. This is equivalent to -pg, except that the profile is saved to a file names pgprof.out rather than gmon.out. –Mrecursive instructs the compiler to allow Fortran subprograms to be called recursively. –Mnorecursive Fortran subprograms may not be called recursively. –Mref_externals force references to names appearing in EXTERNAL statements (pgf77, pgf95, and pghpf only). –Mnoref_externals do not force references to names appearing in EXTERNAL statements (pgf77, pgf95, and pghpf only). –Mreentrant instructs the compiler to avoid optimizations that can prevent code from being reentrant.Chapter 15. Command-Line Options Reference 223 –Mnoreentrant instructs the compiler not to avoid optimizations that can prevent code from being reentrant. –Msecond_underscore instructs the compiler to add a second underscore to the name of a Fortran global symbol if its name already contains an underscore. This option is useful for maintaining compatibility with object code compiled using g77, which uses this convention by default (pgf77, pgf95, and pghpf only). –Mnosecond_underscore instructs the compiler not to add a second underscore to the name of a Fortran global symbol if its name already contains an underscore (pgf77, pgf95, and pghpf only). –Msignextend instructs the compiler to extend the sign bit that is set as a result of converting an object of one data type to an object of a larger signed data type. –Mnosignextend instructs the compiler not to extend the sign bit that is set as the result of converting an object of one data type to an object of a larger data type. –Msafe_lastval In the case where a scalar is used after a loop, but is not defined on every iteration of the loop, the compiler does not by default parallelize the loop. However, this option tells the compiler it’s safe to parallelize the loop. For a given loop the last value computed for all scalars make it safe to parallelize the loop. –Mstride0 instructs the compiler to inhibit certain optimizations and to allow for stride 0 array references. This option may degrade performance and should only be used if zero-stride induction variables are possible. –Mnostride0 instructs the compiler to perform certain optimizations and to disallow for stride 0 array references. –Munix use UNIX symbol and parameter passing conventions for Fortran subprograms (pgf77, pgf95, and pghpf for Win32 only). –Mvarargs force Fortran program units to assume procedure calls are to C functions with a varargs-type interface (pgf77 and pgf95 only). –M C/C++ Language Controls This section describes the –M options that affect C/C++ language interpretations by the PGI C and C++ compilers. These options are only valid to the pgcc and pgcpp compiler drivers. Default: For arguments that you do not specify, the defaults are as follows: noasmkeyword nosingle dollar,_ schar Usage:PGI® User’s Guide 224 In this example, the compiler allows the asm keyword in the source file. $ pgcc -Masmkeyword myprog.c In the following example, the compiler maps the dollar sign to the dot character. $ pgcc -Mdollar,. myprog.c In the following example, the compiler treats floating-point constants as float values. $ pgcc -Mfcon myprog.c In the following example, the compiler does not convert float parameters to double parameters. $ pgcc -Msingle myprog.c Without –Muchar or with –Mschar, the variable ch is a signed character: char ch; signed char sch; If –Muchar is specified on the command line: $ pgcc -Muchar myprog.c char ch above is equivalent to: unsigned char ch; Syntax: Description and Related Options –Masmkeyword instructs the compiler to allow the asm keyword in C source files. The syntax of the asm statement is as follows: asm("statement"); Where statement is a legal assembly-language statement. The quote marks are required. Note. The current default is to support gcc's extended asm, where the syntax of extended asm includes asm strings. The –M[no]asmkeyword switch is useful only if the target device is a Pentium 3 or older cpu type (–tp piii|p6|k7|athlon|athlonxp|px). –Mnoasmkeyword instructs the compiler not to allow the asm keyword in C source files. If you use this option and your program includes the asm keyword, unresolved references will be generated –Mdollar,char char specifies the character to which the compiler maps the dollar sign ($). The PGCC compiler allows the dollar sign in names; ANSI C does not allow the dollar sign in names. –Mfcon instructs the compiler to treat floating-point constants as float data types, instead of double data types. This option can improve the performance of single-precision code. –Mschar specifies signed char characters. The compiler treats "plain" char declarations as signed char.Chapter 15. Command-Line Options Reference 225 –Msingle do not to convert float parameters to double parameters in non-prototyped functions. This option can result in faster code if your program uses only float parameters. However, since ANSI C specifies that routines must convert float parameters to double parameters in non-prototyped functions, this option results in non#ANSI conformant code. –Mnosingle instructs the compiler to convert float parameters to double parameters in non-prototyped functions. –Muchar instructs the compiler to treat "plain" char declarations as unsigned char. –M Environment Controls This section describes the –M options that control environments. Default: For arguments that you do not specify, the default environment option depends on your configuration. Syntax: Description and Related Options –Mlfs (32-bit Linux only) link in libraries that enable file I/O to files larger than 2GB (Large File Support). –Mnostartup instructs the linker not to link in the standard startup routine that contains the entry point (_start) for the program. Note If you use the –Mnostartup option and do not supply an entry point, the linker issues the following error message: Warning: cannot find entry symbol _start –M[no]smartalloc[=huge|h[uge:|hugebss] adds a call to the routine mallopt in the main routine. This option supports large TLBs on Linux and Windows. This option must be used to compile the main routine to enable optimized malloc routines. The option arguments can be any of the following: huge Link in the huge page runtime library Enables large 2-megabyte pages to be allocated. The effect is to reduce the number of TLB entries required to execute a program. This option is most effective on Barcelona and Core 2 systems; older architectures do not have enough TLB entries for this option to be benefitical. By itself, the huge suboption tries to allocate as many huge pages as required. huge: Link the huge page runtime library and allocate n huge pages. Use this suboption to limit the number of huge pages allocated to n.PGI® User’s Guide 226 You can also limit the pages allocated by using the environment variable PGI_HUGE_PAGES. hugebss Puts the BSS section in huge pages; attempts to put a program's unititlaized data section into huge pages. Tip. To be effective, this switch must be specified when compiling the file containing the Fortran, C, or C++ main program. –M[no]stddef instructs the compiler not to predefine any macros to the preprocessor when compiling a C program. –Mnostdinc instructs the compiler to not search the standard location for include files. –Mnostdlib instructs the linker not to link in the standard libraries libpgftnrtl.a, libm.a, libc.a and libpgc.a in the library directory lib within the standard directory. You can link in your own library with the –l option or specify a library directory with the –L option. –M Fortran Language Controls This section describes the –M options that affect Fortran language interpretations by the PGI Fortran compilers. These options are valid only for the pghpf, pgf77 and pgf95 compiler drivers. Default: For arguments that you do not specify, the defaults are as follows: nobackslash noiomutex nodclchk noonetrip nodefaultunit nosave nodlines nounixlogical dollar,_ noupcase Syntax: Description and Related Options –Mallocatable=95|03 controls whether Fortran 95 or Fortran 2003 semantics are used in allocatable array assignments. The default behavior is to use Fortran 95 semantics; the 03 option instructs the compiler to use Fortran 2003 semantics. –Mbackslash the compiler treats the backslash as a normal character, and not as an escape character in quoted strings. –Mnobackslash the compiler recognizes a backslash as an escape character in quoted strings (in accordance with standard C usage). –Mdclchk the compiler requires that all program variables be declared.Chapter 15. Command-Line Options Reference 227 –Mnodclchk the compiler does not require that all program variables be declared. –Mdefaultunit the compiler treats "*" as a synonym for standard input for reading and standard output for writing. –Mnodefaultunit the compiler treats "*" as a synonym for unit 5 on input and unit 6 on output. –Mdlines the compiler treats lines containing "D" in column 1 as executable statements (ignoring the "D"). –Mnodlines the compiler does not treat lines containing "D" in column 1 as executable statements (does not ignore the "D"). –Mdollar,char char specifies the character to which the compiler maps the dollar sign. The compiler allows the dollar sign in names. –Mextend with –Mextend, the compiler accepts 132-column source code; otherwise it accepts 72-column code. –Mfixed with –Mfixed, the compiler assumes input source files are in FORTRAN 77-style fixed form format. –Mfree with –Mfree, the compiler assumes the input source files are in Fortran 90/95 freeform format. –Miomutex the compiler generates critical section calls around Fortran I/O statements. –Mnoiomutex the compiler does not generate critical section calls around Fortran I/O statements. –Monetrip the compiler forces each DO loop to execute at least once. –Mnoonetrip the compiler does not force each DO loop to execute at least once. This option is useful for programs written for earlier versions of Fortran. –Msave the compiler assumes that all local variables are subject to the SAVE statement. Note that this may allow older Fortran programs to run, but it can greatly reduce performance. –Mnosave the compiler does not assume that all local variables are subject to the SAVE statement. –Mstandard the compiler flags non-ANSI–conforming source code. –Munixlogical directs the compiler to treat logical values as true if the value is non-zero and false if the value is zero (UNIX F77 convention.) When –Munixlogical is enabled, a logical value or test that is non-zero isPGI® User’s Guide 228 .TRUE., and a value or test that is zero is .FALSE.. In addition, the value of a logical expression is guaranteed to be one (1) when the result is .TRUE.. –Mnounixlogical Directs the compiler to use the VMS convention for logical values for true and false. Even values are true and odd values are false. –Mupcase the compiler allows uppercase letters in identifiers. With –Mupcase, the identifiers "X" and "x" are different, and keywords must be in lower case. This selection affects the linking process: if you compile and link the same source code using –Mupcase on one occasion and –Mnoupcase on another, you may get two different executables (depending on whether the source contains uppercase letters). The standard libraries are compiled using the default –Mnoupcase. –Mnoupcase the compiler converts all identifiers to lower case. This selection affects the linking process: If you compile and link the same source code using –Mupcase on one occasion and –Mnoupcase on another, you may get two different executables (depending on whether the source contains uppercase letters). The standard libraries are compiled using –Mnoupcase. –M Inlining Controls This section describes the –M options that control function inlining. Before looking at all the options, let’s look at a couple examples. Usage: In the following example, the compiler extracts functions that have 500 or fewer statements from the source file myprog.f and saves them in the file extract.il. $ pgf95 -Mextract=500 -oextract.il myprog.f In the following example, the compiler inlines functions with fewer than approximately 100 statements in the source file myprog.f and writes the executable code in the default output file a.out. $ pgf95 -Minline=size:100 myprog.f Related options: –o, –Mextract Syntax: Description and Related Options –M[no]autoinline instructs the compiler to inline a C/C++ function at –O2 and above when it is declared with the inline keyword. –Mextract[=option[,option,...]] Extracts functions from the file indicated on the command line and creates or appends to the specified extract directory where option can be any of: name:func instructs the extractor to extract function func from the file. size:number instructs the extractor to extract functions with number or fewer, statements from the file.Chapter 15. Command-Line Options Reference 229 lib:filename.ext Use directory filename.ext as the extract directory (required in order to save and re-use inline libraries). If you specify both name and size, the compiler extracts functions that match func, or that have number or fewer statements. For examples of extracting functions, see Chapter 4, “Using Function Inlining”. –Minline[=option[,option,...]] This passes options to the function inliner, where the option can be any of these: except:func instructs the inliner to inline all eligible functions except func, a function in the source text. Multiple functions can be listed, comma-separated. [name:]func instructs the inliner to inline the function func. The func name should be a non-numeric string that does not contain a period. You can also use a name: prefix followed by the function name. If name: is specified, what follows is always the name of a function. [lib:]filename.ext instructs the inliner to inline the functions within the library file filename.ext. The compiler assumes that a filename.ext option containing a period is a library file. Create the library file using the –Mextract option. You can also use a lib: prefix followed by the library name. If lib: is specified, no period is necessary in the library name. Functions from the specified library are inlined. If no library is specified, functions are extracted from a temporary library created during an extract prepass. levels:number instructs the inliner to perform number levels of inlining. The default number is 1. [no]reshape instructs the inliner to allow (disallow)inlining in Fortran even when array shapes do not match. The default is -Minline=noreshape, except with -Mconcur or -mp, where the default is -Minline=reshape. [size:]number instructs the inliner to inline functions with number or fewer statements. You can also use a size: prefix followed by a number. If size: is specified, what follows is always taken as a number. If you specify both func and number, the compiler inlines functions that match the function name or have number or fewer statements. For examples of inlining functions, refer to Chapter 4, “Using Function Inlining”. –M Optimization Controls This section describes the –M options that control optimization. Before looking at all the options, let’s look at the defaults. Default: For arguments that you do not specify, the default optimization control options are as follows: depchk noipa nounroll nor8 i4 nolre novect nor8intrinsics nofprelaxed noprefetchPGI® User’s Guide 230 Note If you do not supply an option to –Mvect, the compiler uses defaults that are dependent upon the target system. Usage: In this example, the compiler invokes the vectorizer with use of packed SSE instructions enabled. $ pgf95 -Mvect=sse -Mcache_align myprog.f Related options: –g, –O Syntax: Description and Related Options –Mcache_align Align unconstrained objects of length greater than or equal to 16 bytes on cache-line boundaries. An unconstrained object is a data object that is not a member of an aggregate structure or common block. This option does not affect the alignment of allocatable or automatic arrays. Note: To effect cache-line alignment of stack-based local variables, the main program or function must be compiled with –Mcache_align. –Mconcur[=option [,option,...]] Instructs the compiler to enable auto-concurrentization of loops. If –Mconcur is specified, multiple processors will be used to execute loops that the compiler determines to be parallelizable. Where option is one of the following: [no]altcode:n Instructs the parallelizer to generate alternate serial code for parallelized loops. If altcode is specified without arguments, the parallelizer determines an appropriate cutoff length and generates serial code to be executed whenever the loop count is less than or equal to that length. If altcode:n is specified, the serial altcode is executed whenever the loop count is less than or equal to n. If noaltcode is specified, the parallelized version of the loop is always executed regardless of the loop count. cncall Calls in parallel loops are safe to parallelize. Loops containing calls are candidates for parallelization. Also, no minimum loop count threshold must be satisfied before parallelization will occur, and last values of scalars are assumed to be safe. dist:block Parallelize with block distribution (this is the default). Contiguous blocks of iterations of a parallelizable loop are assigned to the available processors. dist:cyclic Parallelize with cyclic distribution. The outermost parallelizable loop in any loop nest is parallelized. If a parallelized loop is innermost, its iterations are allocated to processors cyclically. For example, if there are 3 processors executing a loop, processor 0 performs iterations 0, 3, 6, etc.; processor 1 performs iterations 1, 4, 7, etc.; and processor 2 performs iterations 2, 5, 8, etc. [no]innermost Enable parallelization of innermost loops. The default is to not parallelize innermost loops, since it is usually not profitable on dual-core processors.Chapter 15. Command-Line Options Reference 231 noassoc Disables parallelization of loops with reductions. When linking, the –Mconcur switch must be specified or unresolved references will result. The NCPUS environment variable controls how many processors or cores are used to execute parallelized loops. Note This option applies only on shared-memory multi-processor (SMP) or multi-core processorbased systems. –Mcray[=option[,option,...]] (pgf77 and pgf95 only) Force Cray Fortran (CF77) compatibility with respect to the listed options. Possible values of option include: pointer for purposes of optimization, it is assumed that pointer-based variables do not overlay the storage of any other variable. –Mdepchk instructs the compiler to assume unresolved data dependencies actually conflict. –Mnodepchk instructs the compiler to assume potential data dependencies do not conflict. However, if data dependencies exist, this option can produce incorrect code. –Mdse Enables a dead store elimination phase that is useful for programs that rely on extensive use of inline function calls for performance. This is disabled by default. –Mnodse (default) Disables the dead store elimination phase. –M[no]fpapprox[=option] Perform certain fp operations using low-precision approximation. By default -Mfpapprox is not used. If -Mfpapprox is used without suboptions, it defaults to use approximate div, sqrt, and rsqrt. The available suboptions are these: div Approximate floating point division sqrt Approximate floating point square root rsqrt Approximate floating point reciprocal square root –M[no]fpmisalign Instructs the compiler to allow (not allow) vector arithmetic instructions with memory operands that are not aligned on 16-byte boundaries. The default is -Mnofpmisalign on all processors.PGI® User’s Guide 232 Note Applicable only with one of these options: –tp barcelona or –tp barcelona-64 –Mfprelaxed[=option] instructs the compiler to use relaxed precision in the calculation of some intrinsic functions. Can result in improved performance at the expense of numerical accuracy. The possible values for option are: div Perform divide using relaxed precision. noorder Perform reciprocal square root (1/sqrt) using relaxed precision. order Perform reciprocal square root (1/sqrt) using relaxed precision. rsqrt Perform reciprocal square root (1/sqrt) using relaxed precision. sqrt Perform square root with relaxed precision. With no options, –Mfprelaxed generates relaxed precision code for those operations that generate a significant performance improvement, depending on the target processor. –Mnofprelaxed (default) instructs the compiler not to use relaxed precision in the calculation of intrinsic functions. –Mi4 (pgf77 and pgf95 only) the compiler treats INTEGER variables as INTEGER*4. –Mipa=

. Examples of correct nosquash_nid syntax are 192.168.1.1@tcp and 4@elan8. There is no syntax checking for nosquash_nid. In case of a syntax error, nosquash_nid is set to the default value, LNET_NID_ANY. The syntax for rootsquash is ':'. Examples of correct rootsquash syntax are 500:501, 500 and :501. In case of a syntax error, Lustre commands handle the error in different ways: ¦ mkfs.lustre and tune.lustre commands ignore the syntax error, and set the parameter to the default value, 0:0. ¦ lctl set_param and lctl conf_param commands preserve the current rootsquash value. In the case of the lctl conf_param command, the incorrect parameter takes effect when the MDS restarts. The parameter is set to the default value, 0:0.27-1 C H A P T E R 27 Lustre Operating Tips This chapter describes tips to improve Lustre operations and includes the following sections: ¦ Adding an OST to a Lustre File System ¦ A Simple Data Migration Script ¦ Adding Multiple SCSI LUNs on Single HBA ¦ Failures Running a Client and OST on the Same Machine ¦ Improving Lustre Metadata Performance While Using Large Directories27-2 Lustre 1.8 Operations Manual • December 2010 27.1 Adding an OST to a Lustre File System To add an OST to existing Lustre file system: 1. Add a new OST by passing on the following commands, run: $ mkfs.lustre --fsname=spfs --ost --mgsnode=mds16@tcp0 /dev/sda $ mkdir -p /mnt/test/ost0 $ mount -t lustre /dev/sda /mnt/test/ost0 2. Migrate the data (possibly). The file system is quite unbalanced when new empty OSTs are added. New file creations are automatically balanced. If this is a scratch file system or files are pruned at a regular interval, then no further work may be needed. Files existing prior to the expansion can be rebalanced with an in-place copy, which can be done with a simple script. The basic method is to copy existing files to a temporary file, then move the temp file over the old one. This should not be attempted with files which are currently being written to by users or applications. This operation redistributes the stripes over the entire set of OSTs. For a sample data migration script, see A Simple Data Migration Script. A very clever migration script would do the following: ¦ Examine the current distribution of data. ¦ Calculate how much data should move from each full OST to the empty ones. ¦ Search for files on a given full OST (using lfs getstripe). ¦ Force the new destination OST (using lfs setstripe). ¦ Copy only enough files to address the imbalance. If a Lustre administrator wants to explore this approach further, per-OST disk-usage statistics can be found under /proc/fs/lustre/osc/*/rpc_statsChapter 27 Lustre Operating Tips 27-3 27.2 A Simple Data Migration Script #!/bin/bash # set -x # A script to copy and check files. # To avoid allocating objects on one or more OSTs, they should be # deactivated on the MDS via "lctl --device {device_number} deactivate", # where {device_number} is from the output of "lctl dl" on the MDS. # To guard against corruption, the file is chksum'd # before and after the operation. # CKSUM=${CKSUM:-md5sum} usage() { echo "usage: $0 [-O

]

" 1>&2 echo " -O can be specified multiple times" 1>&2 exit 1 } while getopts "O:" opt $*; do case $opt in O) OST_PARAM="$OST_PARAM -O $OPTARG";; \?) usage;; esac done shift $((OPTIND - 1)) MVDIR=$1 if [ $# -ne 1 -o ! -d $MVDIR ]; then usage fi lfs find -type f $OST_PARAM $MVDIR | while read OLDNAME; do echo -n "$OLDNAME: " if [ ! -w "$OLDNAME" ]; then echo "No write permission, skipping" continue fi27-4 Lustre 1.8 Operations Manual • December 2010 OLDCHK=$($CKSUM "$OLDNAME" | awk '{print $1}') if [ -z "$OLDCHK" ]; then echo "checksum error - exiting" 1>&2 exit 1 fi NEWNAME=$(mktemp "$OLDNAME.tmp.XXXXXX") if [ $? -ne 0 -o -z "$NEWNAME" ]; then echo "unable to create temp file - exiting" 1>&2 exit 2 fi cp -a "$OLDNAME" "$NEWNAME" if [ $? -ne 0 ]; then echo "copy error - exiting" 1>&2 rm -f "$NEWNAME" exit 4 fi NEWCHK=$($CKSUM "$NEWNAME" | awk '{print $1}') if [ -z "$NEWCHK" ]; then echo "'$NEWNAME' checksum error - exiting" 1>&2 exit 6 fi if [ $OLDCHK != $NEWCHK ]; then echo "'$NEWNAME' bad checksum - "$OLDNAME" not moved, exiting" 1>&2 rm -f "$NEWNAME" exit 8 else mv "$NEWNAME" "$OLDNAME" if [ $? -ne 0 ]; then echo "rename error - exiting" 1>&2 rm -f "$NEWNAME" exit 12 fi fi echo "done" doneChapter 27 Lustre Operating Tips 27-5 27.3 Adding Multiple SCSI LUNs on Single HBA The configuration of the kernels packaged by the Lustre group is similar to that of the upstream RedHat and SuSE packages. Currently, RHEL does not enable CONFIG_SCSI_MULTI_LUN because it can cause problems with SCSI hardware. To enable this, set the scsi_mod max_scsi_luns=xx option (typically, xx is 128) in either modprobe.conf (2.6 kernel) or modules.conf (2.4 kernel). To pass this option as a kernel boot argument (in grub.conf or lilo.conf), compile the kernel with CONFIG_SCSI_MULT_LUN=y 27.4 Failures Running a Client and OST on the Same Machine There are inherent problems if a client and OST share the same machine (and the same memory pool). An effort to relieve memory pressure (by the client), requires memory to be available to the OST. If the client is experiencing memory pressure, then the OST is as well. The OST may not get the memory it needs to help the client get the memory it needs because it is all one memory pool; this results in deadlock. Running a client and an OST on the same machine can cause these failures: ¦ If the client contains a dirty file system in memory and memory pressure, a kernel thread flushes dirty pages to the file system, and it writes to a local OST. To complete the write, the OST needs to do an allocation. Then the blocking of allocation occurs while waiting for the above kernel thread to complete the write process and free up some memory. This is a deadlock condition. ¦ If the node with both a client and OST crashes, then the OST waits for the mounted client on that node to recover. However, since the client is now in crashed state, the OST considers it to be a new client and blocks it from mounting until the recovery completes. As a result, running OST and client on same machine can cause a double failure and prevent a complete recovery.27-6 Lustre 1.8 Operations Manual • December 2010 27.5 Improving Lustre Metadata Performance While Using Large Directories To improve metadata performance while using large directories, follow these tips: ¦ Increase RAM on the MDS – On the MDS, more memory translates into bigger caches, thereby increasing the metadata performance. ¦ Patch the core kernel on the MDS with the 3G/1G patch (if not running a 64-bit kernel), which increases the available kernel address space. This translates into support for bigger caches on the MDS.PART V Reference This part includes reference information on Lustre user utilities, configuration files and module parameters, programming interfaces, system configuration utilities, and system limits.28-1 C H A P T E R 28 User Utilities (man1) This chapter describes user utilities and includes the following sections: ¦ lfs ¦ lfs_migrate ¦ lfsck ¦ Filefrag ¦ Mount ¦ Handling Timeouts28-2 Lustre 1.8 Operations Manual • December 2010 28.1 lfs The lfs utility can be used for user configuration routines and monitoring. With lfs you can create a new file with a specific striping pattern, determine the striping pattern of existing files, and gather the extended attributes (object numbers and location) of a specific file. Synopsis lfs lfs check lfs df [-i] [-h] [path] lfs find [[!] --atime|-A [-+]N] [[!] --mtime|-M [-+]N] [[!] --ctime|-C [-+]N] [--maxdepth|-D N] [--name|-n ] [--print|-p] [--print0|-P] [[!] --obd|-O ] [[!] --size|-S [+-]N[kMGTPE]] --type |-t {bcdflpsD}] [[!] --gid|-g|--group|-G |] [[!] --uid|-u|--user|-U |] lfs osts [path] lfs getstripe [--obd|-O ] [--quiet|-q] [--verbose|-v] [--count|-c] [--index|-i | --offset|-o] [--size|-s] [--pool|-p] [--directory|-d] [--recursive|-r] lfs setstripe [--size|-s stripe_size] [--count|-c stripe_cnt] [--index|-i [--offset|-o start_ost_index] [--pool|-p ] lfs setstripe -d lfs poollist ] | lfs quota [-q] [-v] [-o obd_uuid|-I ost_idx|-i mdt_idx] [-u|-g |uid|gname|gid>] lfs quota -t <-u|-g> lfs quotacheck [-ugf] lfs quotachown [-i] lfs quotaon [-ugf] lfs quotaoff [-ug] lfs quotainv [-ug] [-f] Chapter 28 User Utilities (man1) 28-3 lfs setquota <-u|--user|-g|--group> [--block-softlimit

] [--block-hardlimit

] [--inode-softlimit

] [--inode-hardlimit

]

lfs setquota <-u|--user|-g|--group> [-b

] [-B

] [-i

] [-I

]

lfs setquota -t <-u|-g> [--block-grace

] [--inode-grace

]

lfs setquota -t <-u|-g> [-b

] [-i

]

lfs help Note – In the above example, the parameter refers to the mount point of the Lustre file system. The default mount point is /mnt/lustre. Note – The old lfs quota output was very detailed and contained cluster-wide quota statistics (including cluster-wide limits for a user/group and cluster-wide usage for a user/group), as well as statistics for each MDS/OST. Now, lfs quota has been updated to provide only cluster-wide statistics, by default. To obtain the full report of cluster-wide limits, usage and statistics, use the -v option with lfs quota. Description The lfs utility is used to create a new file with a specific striping pattern, determine the default striping pattern, gather the extended attributes (object numbers and location) for a specific file, find files with specific attributes, list OST information or set quota limits. It can be invoked interactively without any arguments or in a non-interactive mode with one of the supported arguments.28-4 Lustre 1.8 Operations Manual • December 2010 Options The various lfs options are listed and described below. For a complete list of available options, type help at the lfs prompt. Option Description check Displays the status of the MDS or OSTs (as specified in the command) or all servers (MDS and OSTs). df [-i] [-h] [--pool|-p [.] [path] Reports file system disk space usage or inode usage (with -i) of each MDT/OST or a subset of OSTs if a pool is specified with -p. By default, prints the usage of all mounted Lustre file systems. Otherwise, if the path is specified, prints only the usage of that file system. If -h is given, the output is printed in human-readable format, using SI base-2 suffixes for Mega-, Giga-, Tera-, Peta- or Exabyte. find Searches the directory tree rooted at the given directory/filename for files that match the given parameters. The --print and --print0 options print the full filename, followed by a new line or NUL character correspondingly. Using ! before an option negates its meaning (files NOT matching the parameter). Using + before a numeric value means files with the parameter OR MORE. Using - before a numeric value means files with the parameter OR LESS. --atime File was last accessed N*24 hours ago. (There is no guarantee that atime is kept coherent across the cluster.) OSTs store a transient atime that is updated when clients do read requests. Permanent atime is written to the MDS when the file is closed. On-disk atime is only updated if it is more than 60 seconds old (/proc/fs/lustre/mds/*/max_atime_diff). Lustre considers the latest atime from all OSTs. If a setattr is set by user, then it is updated on both the MDS and OST, allowing atime to go backward. --ctime File status was last changed N*24 hours ago. --mtime File status was last modified N*24 hours ago. --obdChapter 28 User Utilities (man1) 28-5 File has an object on a specific OST(s). --size File has a size in bytes or kilo-, Mega-, Giga-, Tera-, Peta- or Exabytes if a suffix is given. --type File has a type (block, character, directory, pipe, file, symlink, socket or Door [for Solaris]). --uid File has a specific numeric user ID. --user File is owned by a specific user (numeric user ID is allowed). --gid File has a specific group ID. --group File belongs to a specific group (numeric group ID allowed). --maxdepth Limits find to descend at most N levels of the directory tree. osts Lists all OSTs for all mounted file systems. getstripe Lists the striping information for a given filename or directory. By default, the stripe count, stripe size and offset are returned. If you only want specific striping information, then the options of --count,--size,--index or --offset, plus various combinations of these options can be used to retrieve specific information. --obd Lists only files that have an object on a specific OST. --quiet Lists only information about a file’s object ID. --verbose Prints additional striping information. --count Lists the stripe count (how many OSTs to use). Option Description28-6 Lustre 1.8 Operations Manual • December 2010 --size Lists the stripe size (how much data to write to one OST before moving to the next OST). --index Lists the index for each OST in the file system. --offset Lists the OST index on which file striping starts. --pool Lists the pools to which a file belongs. --directory Lists entries about a specific directory instead of its contents (in the same manner as ls -d). --recursive Recurses into all subdirectories. setstripe Creates a new file or sets the directory default with specific striping parameters. † --size stripe_size * Number of bytes to store on an OST before moving to the next OST. A stripe size of 0 uses the file system’s default stripe size, 1MB. Can be specified with k (KB), m (MB), or g (GB), respectively. --count stripe_cnt Number of OSTs over which to stripe a file. A stripe count of 0 uses the file system-wide default stripe count (1). A stripe count of -1 stripes over all available OSTs. --offset start_ost The OST index (base 10, starting at 0) on which to start striping for this file. A start_ost of -1 allows the MDS to choose the starting index. This is the default, and it means that the MDS selects the starting OST as it wants. It has no relevance on whether the MDS will use round-robin or QoS weighted allocation for the remaining stripes in the file. We strongly recommend selecting this default value, as it allows space and load balancing to be done by the MDS as needed. Option DescriptionChapter 28 User Utilities (man1) 28-7 --pool Name of the pre-defined pool of OSTs (see lctl) that will be used for striping. The stripe_cnt, stripe_size and start_ost_index values are used as well. The start-ost value must be part of the pool or an error is returned. setstripe -d Deletes default striping on the specified directory. poollist [.] | Lists pools in the file system or pathname or OSTs in the filesystem.pool. quota [-q] [-v] [-o obd_uuid|-i mdt_idx|-I ost_idx] [-u|-g ] Displays disk usage and limits, either for the full file system or for objects on a specific OBD. A user or group name or an ID can be specified. If both user and group are omitted, quotas for the current UID/GID are shown. The -q option disables printing of additional descriptions (including column titles). It also fills in blank spaces in the ''grace'' column with zeros (when there is no grace period set), to ensure that the number of columns is consistent. The -v option provides more verbose (with per-OBD statistics) output. quota -t <-u|-g> Displays block and inode grace times for user (-u) or group (-g) quotas. quotacheck [-ugf] Scans the specified file system for disk usage, and creates or updates quota files. Options specify quota for users (-u), groups (-g), and force (-f). quotachown [-i] Changes the file’s owner and group on OSTs of the specified file system. quotaon [-ugf] Turns on file system quotas. Options specify quota for users (-u), groups (-g), and force (-f). quotaoff [-ugf] Turns off file system quotas. Options specify quota for users (-u), groups (-g), and force (-f). Option Description28-8 Lustre 1.8 Operations Manual • December 2010 quotainv [-ug] [-f] Clears quota files (administrative quota files if used without -f, operational quota files otherwise), all of their quota entries for users (-u) or groups (-g). After running quotainv, you must run quotacheck before using quotas. CAUTION: Use extreme caution when using this command; its results cannot be undone. setquota <-u|-g> ||| [--block-softlimit

] [--block-hardlimit

] [--inode-softlimit

] [--inode-hardlimit

]

Sets file system quotas for users or groups. Limits can be specified with --{block|inode}-{softlimit|hardlimit} or their short equivalents -b, -B, -i, -I. Users can set 1, 2, 3 or 4 limits. ‡ Also, limits can be specified with special suffixes, -b, -k, -m, -g, -t, and -p to indicate units of 1, 2^10, 2^20, 2^30, 2^40 and 2^50, respectively. By default, the block limits unit is 1 kilobyte (1,024), and block limits are always kilobyte-grained (even if speci?ed in bytes). See Examples. setquota -t <-u|-g> [--block-grace

] [--inode-grace

]

Sets file system quota grace times for users or groups. Grace time is specified in “XXwXXdXXhXXmXXs” format or as an integer seconds value. See Examples. help Provides brief help on various lfs arguments. exit/quit Quits the interactive lfs session. * The default stripe-size is 0. The default stripe-start is -1. Do NOT confuse them! If you set stripe-start to 0, all new file creations occur on OST 0 (seldom a good idea). † The file cannot exist prior to using setstripe. A directory must exist prior to using setstripe. ‡ The old setquota interface is supported, but it may be removed in a future Lustre release. Option DescriptionChapter 28 User Utilities (man1) 28-9 Examples $ lfs setstripe -s 128k -c 2 /mnt/lustre/file1 Creates a file striped on two OSTs with 128 KB on each stripe. $ lfs setstripe -d /mnt/lustre/dir Deletes a default stripe pattern on a given directory. New files use the default striping pattern. $ lfs getstripe -v /mnt/lustre/file1 Lists the detailed object allocation of a given file. $ lfs setstripe --pool my_pool -c 2 /mnt/lustre/file Creates a file striped on two OSTs from the pool my_pool $ lfs poollist /mnt/lustre/ Lists the pools defined for the mounted Lustre file system /mnt/lustre $ lfs poollist my_fs.my_pool Lists the OSTs which are members of the pool my_pool in file system my_fs $ lfs getstripe -v /mnt/lustre/file1 Lists the detailed object allocation of a given file. $ lfs find /mnt/lustre Efficiently lists all files in a given directory and its subdirectories. $ lfs find /mnt/lustre -mtime +30 -type f -print Recursively lists all regular files in a given directory more than 30 days old.28-10 Lustre 1.8 Operations Manual • December 2010 $ lfs find --obd OST2-UUID /mnt/lustre/ Recursively lists all files in a given directory that have objects on OST2-UUID. The lfs check servers command checks the status of all servers (MDT and OSTs). $ lfs find /mnt/lustre --pool poolA Finds all directories/files associated with poolA. $ lfs find /mnt//lustre --pool "" Finds all directories/files not associated with a pool. $ lfs find /mnt/lustre ! --pool "" Finds all directories/files associated with pool. $ lfs check servers Checks the status of all servers (MDT, OST) $ lfs osts Lists all OSTs in the file system. $ lfs df -h Lists space usage per OST and MDT in human-readable format. $ lfs df -i Lists inode usage per OST and MDT. $ lfs df --pool [.] | List space or inode usage for a specific OST pool. $ lfs quotachown -i /mnt/lustre Changes file owner and group.Chapter 28 User Utilities (man1) 28-11 $ lfs quotacheck -ug /mnt/lustre Checks quotas for user and group. Turns on quotas after making the check. $ lfs quotaon -ug /mnt/lustre Turns on quotas of user and group. $ lfs quotaoff -ug /mnt/lustre Turns off quotas of user and group. $ lfs setquota -u bob --block-softlimit 2000000 --block-hardlimit 1000000 /mnt/lustre Sets quotas of user ‘bob’, with a 1 GB block quota hardlimit and a 2 GB block quota softlimit. $ lfs setquota -t -u --block-grace 1000 --inode-grace 1w4d /mnt/lustre Sets grace times for user quotas: 1000 seconds for block quotas, 1 week and 4 days for inode quotas. $ lfs quota -u bob /mnt/lustre List quotas of user ‘bob’. $ lfs quota -t -u /mnt/lustre Show grace times for user quotas on /mnt/lustre. $ lfs setstripe --pool my_pool /mnt/lustre/dir Associates a directory with the pool my_pool, so all new files and directories are created in the pool. $ lfs find /mnt/lustre --pool poolA Finds all directories/files associated with poolA.28-12 Lustre 1.8 Operations Manual • December 2010 $ lfs find /mnt//lustre --pool "" Finds all directories/files not associated with a pool. $ lfs find /mnt/lustre ! --pool "" Finds all directories/files associated with pool.Chapter 28 User Utilities (man1) 28-13 28.2 lfs_migrate The lfs_migrate utility is a simple tool to migrate files between Lustre OSTs. Synopsis lfs_migrate [-c|-s] [-h] [-l] [-n] [-y] [file|directory ...] Description The lfs_migrate utility is a simple tool to assist migration of files between Lustre OSTs. It is simply copying each specified file to a new file, verifying the file contents have not changed, and then renaming the new file back to the original filename. This allows balancing space usage between OSTs, moving files of OSTs that are starting to show hardware problems (though are still functional), or OSTs will be discontinued. Because lfs_migrate is not closely integrated with the MDS, it cannot determine whether a file is currently open and/or in-use by other applications or nodes. That makes it UNSAFE for use on files that might be modified by other applications, since the migrated file is only a copy of the current file. This will result in the old file becoming an open-unlinked file and any modifications to that file will be lost. Files to be migrated can be specified as command-line arguments. If a directory is specified on the command-line then all files within that directory are migrated. If no files are specified on the command-line, then a list of files is read from the standard input, making lfs_migrate suitable for use with lfs(1) find to locate files on specific OSTs and/or matching other file attributes. The current file allocation policies on the MDS dictate where the new files are placed, taking into account whether specific OSTs have been disabled on the MDS via lctl (8) (preventing new files from being allocated there), whether some OSTs are overly full (reducing the number of files placed on those OSTs), or if there is a specific default file striping for the target directory (potentially changing the stripe count, stripe size, OST pool, or OST index of a new file).28-14 Lustre 1.8 Operations Manual • December 2010 Options Options supporting lfs_migrate are described below. Examples $ lfs_migrate /mnt/lustre/file To rebalance all files within /mng/lustre/dir. $ lfs find /test -obd test-OST004 -size +4G | lfs_migrate -y To migrate files within /test filesystem on OST004 larger than 4 GB in size. Option Description -c Compares file data after migrate (default value, use -s to disable). -s Skips file data comparison after migrate (use -c to enable). -h Displays help information. -l Migrates files with hard links (skips, by default). Files with multiple hard links are split into multiple separate files by lfs_migrate, so they are skipped, by default, to avoid breaking the hard links. -n Only prints the names of files to be migrated. -q Runs quietly (does not print filenames or status). -y Answers 'y' to usage warning without prompting (for scripts).Chapter 28 User Utilities (man1) 28-15 Known Bugs Hard links could be handled correctly in Lustre 2.0 by using lfs(1) fid2path. Eventually, this functionality will be integrated into lfs(1) itself and will integrate with the MDS layout locking to make it safe in the presence of opened files and ongoing file I/O. Availability lfs_migrate is part of the Lustre(7) file system package, and was added in the 1.8.4 release. See Also lfs28-16 Lustre 1.8 Operations Manual • December 2010 28.3 lfsck Lfsck ensures that objects are not referenced by multiple MDS files, that there are no orphan objects on the OSTs (objects that do not have any file on the MDS which references them), and that all of the objects referenced by the MDS exist. Under normal circumstances, Lustre maintains such coherency by distributed logging mechanisms, but under exceptional circumstances that may fail (e.g. disk failure, file system corruption leading to e2fsck repair). To avoid lengthy downtime, you can also run lfsck once Lustre is already started. The e2fsck utility is run on each of the local MDS and OST device file systems and verifies that the underlying ldiskfs is consistent. After e2fsck is run, lfsck does distributed coherency checking for the Lustre file system. In most cases, e2fsck is sufficient to repair any file system issues and lfsck is not required. Synopsis lfsck [-c|--create] [-d|--delete] [-f|--force] [-h|--help] [-l|--lostfound] [-n|--nofix] [-v|--verbose] --mdsdb mds_database_file --ostdb ost1_database_file [ost2_database_file...] Note – As shown, the parameter refers to the Lustre file system mount point. The default mount point is /mnt/lustre. Note – For lfsck, database filenames must be provided as absolute pathnames. Relative paths do not work, the databases cannot be properly opened.Chapter 28 User Utilities (man1) 28-17 Options Options supporting lfsck are described below. Description The lfsck utility is used to check and repair the distributed coherency of a Lustre file system. If an MDS or an OST becomes corrupt, run a distributed check on the file system to determine what sort of problems exist. Use lfsck to correct any defects found. For more information on using e2fsck and lfsck, including examples, see Recovering from Errors or Corruption on a Backing File System. For information on resolving orphaned objects, see Working with Orphaned Objects. Option Description -c Creates (empty) missing OST objects referenced by MDS inodes. -d Deletes orphaned objects from the file system. Since objects on the OST are often only one of several stripes of a file, it can be difficult to compile multiple objects together in a single, usable file. -h Prints a brief help message. -l Puts orphaned objects into a lost+found directory in the root of the file system. -n Performs a read-only check; does not repair the file system. -v Verbose operation - more verbosity by specifying the option multiple times. --mdsdb mds_database_file MDS database file created by running e2fsck --mdsdb mds_database_file on the MDS backing device. This is required. --ostdb ost1_database_file [ost2_database_file...] OST database files created by running e2fsck --ostdb ost_database_file on each of the OST backing devices. These are required unless an OST is unavailable, in which case all objects thereon are considered missing.28-18 Lustre 1.8 Operations Manual • December 2010 28.4 Filefrag The e2fsprogs package contains the filefrag tool which reports the extent of file fragmentation. Synopsis filefrag [ -belsv ] [ files... ] Description The filefrag utility reports the extent of fragmentation in a given file. Initially, filefrag attempts to obtain extent information using FIEMAP ioctl, which is efficient and fast. If FIEMAP is not supported, then filefrag uses FIBMAP. Note – Lustre only supports FIEMAP ioctl. FIBMAP ioctl is not supported. In default mode 1 , filefrag returns the number of physically discontiguous extents in the file. In extent or verbose mode, each extent is printed with details. For Lustre, the extents are printed in device offset order, not logical offset order. 1. The default mode is faster than the verbose/extent mode.Chapter 28 User Utilities (man1) 28-19 Options The options and descriptions for the filefrag utility are listed below. Examples Lists default output. $ filefrag /mnt/lustre/foo /mnt/lustre/foo: 6 extents found Lists verbose output in extent format. $ filefrag -ve /mnt/lustre/foo Checking /mnt/lustre/foo Filesystem type is: bd00bd0 Filesystem cylinder groups is approximately 5 File size of /mnt/lustre/foo is 157286400 (153600 blocks) ext:device_logical:start..end physical: start..end:length: device:flags: 0: 0.. 49151: 212992.. 262144: 49152: 0: remote 1: 49152.. 73727: 270336.. 294912: 24576: 0: remote 2: 73728.. 76799: 24576.. 27648: 3072: 0: remote 3: 0.. 57343: 196608.. 253952: 57344: 1: remote 4: 57344.. 65535: 139264.. 147456: 8192: 1: remote 5: 65536.. 76799: 163840.. 175104: 11264: 1: remote /mnt/lustre/foo: 6 extents found Option Description -b Uses the 1024-byte blocksize for the output. By default, this blocksize is used by Lustre, since OSTs may use different block sizes. -e Uses the extent mode when printing the output. -l Displays extents in LUN offset order. -s Synchronizes the file before requesting the mapping. --v Uses the verbose mode when checking file fragmentation.28-20 Lustre 1.8 Operations Manual • December 2010 28.5 Mount Lustre uses the standard mount(8) Linux command. When mounting a Lustre file system, mount(8) executes the /sbin/mount.lustre command to complete the mount. The mount command supports these Lustre-specific options: 28.6 Handling Timeouts Timeouts are the most common cause of hung applications. After a timeout involving an MDS or failover OST, applications attempting to access the disconnected resource wait until the connection gets established. When a client performs any remote operation, it gives the server a reasonable amount of time to respond. If a server does not reply either due to a down network, hung server, or any other reason, a timeout occurs which requires a recovery. If a timeout occurs, a message (similar to this one), appears on the console of the client, and in /var/log/messages: LustreError: 26597:(client.c:810:ptlrpc_expire_one_request()) @@@ timeout req@a2d45200 x5886/t0 o38->mds_svc_UUID@NID_mds_UUID:12 lens 168/64 ref 1 fl RPC:/0/0 rc 0 Server options Description abort_recov Aborts recovery when starting a target nosvc Starts only MGS/MGC servers exclude Starts with a dead OST Client options Description flock Enables/disables flock support user_xattr/nouser_xattr Enables/disables user-extended attributes retry= Number of times a client will retry to mount the file system29-1 C H A P T E R 29 Lustre Programming Interfaces (man2) This chapter describes public programming interfaces to control various aspects of Lustre from userspace. These interfaces are generally not guaranteed to remain unchanged over time, although we will make an effort to notify the user community well in advance of major changes. This chapter includes the following section: ¦ User/Group Cache Upcall 29.1 User/Group Cache Upcall This section describes user and group upcall. Note – For information on a universal UID/GID, see Environmental Requirements. 29.1.1 Name Use /proc/fs/lustre/mds/mds-service/group_upcall to look up a given user’s group membership.29-2 Lustre 1.8 Operations Manual • December 2010 29.1.2 Description The group upcall file contains the path to an executable that, when properly installed, is invoked to resolve a numeric UID to a group membership list. This utility should complete the mds_grp_downcall_data data structure (see Data structures) and write it to the /proc/fs/lustre/mds/mds-service/group_info pseudo-file. For a sample upcall program, see lustre/utils/l_getgroups.c in the Lustre source distribution. 29.1.2.1 Primary and Secondary Groups The mechanism for the primary/secondary group is as follows: ¦ The MDS issues an upcall (set per MDS) to map the numeric UID to the supplementary group(s). ¦ If there is no upcall or if there is an upcall and it fails, supplementary groups will be added as supplied by the client (as they are now). ¦ The default upcall is /usr/sbin/l_getgroups, which uses the Lustre group-supplied upcall. It looks up the UID in /etc/passwd, and if it finds the UID, it looks for supplementary groups in /etc/group for that username. You are free to enhance l_getgroups to look at an external database for supplementary groups information. ¦ The default group upcall is set by mkfs.lustre. To set the upcall, use echo {path} > /proc/fs/lustre/mds/{mdsname}/group_upcall or tunefs.lustre --param. ¦ To avoid repeated upcalls, the MDS caches supplemental group information. Use /proc/fs/lustre/mds/{mdsname}/group_expire to set the cache time (default is 300 seconds). The kernel waits for the upcall to complete (at most, 5 seconds) and takes the "failure" behavior as described. Set the wait time in /proc/fs/lustre/mds/{mdsname}/group_acquire_expire. Cached entries are flushed by writing to /proc/fs/lustre/mds/{mdsname}/group_flush.Chapter 29 Lustre Programming Interfaces (man2) 29-3 29.1.3 Parameters ¦ Name of the MDS service ¦ Numeric UID 29.1.4 Data structures #include #define MDS_GRP_DOWNCALL_MAGIC 0x6d6dd620 struct mds_grp_downcall_data { __u32 mgd_magic; __u32 mgd_err; __u32 mgd_uid; __u32 mgd_gid; __u32 mgd_ngroups; __u32 mgd_groups[0]; };29-4 Lustre 1.8 Operations Manual • December 201030-1 C H A P T E R 30 Setting Lustre Properties (man3) This chapter describes how to use llapi to set Lustre file properties. 30.1 Using llapi Several llapi commands are available to set Lustre properties, llapi_file_create, llapi_file_get_stripe, and llapi_file_open. These commands are described in the following sections: llapi_file_create llapi_file_get_stripe llapi_file_open llapi_quotactl 30.1.1 llapi_file_create Use llapi_file_create to set Lustre properties for a new file. Synopsis #include #include int llapi_file_create(char *name, long stripe_size, int stripe_offset, int stripe_count, int stripe_pattern);30-2 Lustre 1.8 Operations Manual • December 2010 Description The llapi_file_create() function sets a file descriptor’s Lustre striping information. The file descriptor is then accessed with open (). Note – Currently, only RAID 0 is supported. To use the system defaults, set these values: stripe_size = 0, stripe_offset = -1, stripe_count = 0, stripe_pattern = 0 Option Description llapi_file_create() If the file already exists, this parameter returns to ‘EEXIST’. If the stripe parameters are invalid, this parameter returns to ‘EINVAL’. stripe_size This value must be an even multiple of system page size, as shown by getpagesize (). The default Lustre stripe size is 4MB. stripe_offset Indicates the starting OST for this file. stripe_count Indicates the number of OSTs that this file will be striped across. stripe_pattern Indicates the RAID pattern.Chapter 30 Setting Lustre Properties (man3) 30-3 Examples System default size is 4 MB. char *tfile = TESTFILE; int stripe_size = 65536 To start at default, run: int stripe_offset = -1 To start at the default, run: int stripe_count = 1 To set a single stripe for this example, run: int stripe_pattern = 0 Currently, only RAID 0 is supported. int stripe_pattern = 0; int rc, fd; rc = llapi_file_create(tfile, stripe_size,stripe_offset, stripe_count,stripe_pattern); Result code is inverted, you may return with ’EINVAL’ or an ioctl error. if (rc) { fprintf(stderr,"llapi_file_create failed: %d (%s) 0, rc, strerror(-rc)); return -1; } llapi_file_create closes the file descriptor. You must re-open the descriptor. To do this, run: fd = open(tfile, O_CREAT | O_RDWR | O_LOV_DELAY_CREATE, 0644); if (fd < 0) \ { fprintf(stderr, "Can’t open %s file: %s0, tfile, strerror(errno)); return -1; }30-4 Lustre 1.8 Operations Manual • December 2010 30.1.2 llapi_file_get_stripe Use llapi_file_get_stripe to get striping information. Synopsis int llapi_file_get_stripe(const char *path, struct lov_user_md *lum) Description The llapi_file_get_stripe function returns the striping information to the caller. If it returns a zero (0), the operation was successful; a negative number means there was a failure. Option Description path The path of the file. lum The returned striping information. return A value of zero (0) mean the operation was successful. A value of a negative number means there was a failure. stripe_count Indicates the number of OSTs that this file will be striped across. stripe_pattern Indicates the RAID pattern.Chapter 30 Setting Lustre Properties (man3) 30-5 30.1.3 llapi_file_open The llapi_file_open command opens or creates a file with the specified striping parameters. Synopsis int llapi_file_open(const char *name, int flags, int mode, unsigned long stripe_size, int stripe_offset, int stripe_count, int stripe_pattern) Description The llapi_file_open function opens or creates a file with the specified striping parameters. If it returns a zero (0), the operation was successful; a negative number means there was a failure. Option Description name The name of the file. flags This opens flags. mode This opens modes. stripe_size The stripe size of the file. stripe_offset The stripe offset (stripe_index) of the file. stripe_count The stripe count of the file. stripe_pattern The stripe pattern of the file.30-6 Lustre 1.8 Operations Manual • December 2010 30.1.4 llapi_quotactl Use llapi_quotactl to manipulate disk quotas on a Lustre file system. Synopsis #include #include #include #include int llapi_quotactl(char" " *mnt," " struct if_quotactl" " *qctl) struct if_quotactl { __u32 qc_cmd; __u32 qc_type; __u32 qc_id; __u32 qc_stat; struct obd_dqinfo qc_dqinfo; struct obd_dqblk qc_dqblk; char obd_type[16]; struct obd_uuid obd_uuid; }; struct obd_dqblk { __u64 dqb_bhardlimit; __u64 dqb_bsoftlimit; __u64 dqb_curspace; __u64 dqb_ihardlimit; __u64 dqb_isoftlimit; __u64 dqb_curinodes; __u64 dqb_btime; __u64 dqb_itime; __u32 dqb_valid; __u32 padding; }; struct obd_dqinfo { __u64 dqi_bgrace; __u64 dqi_igrace; __u32 dqi_flags; __u32 dqi_valid; }; struct obd_uuid { char uuid[40]; };Chapter 30 Setting Lustre Properties (man3) 30-7 Description The llapi_quotactl() command manipulates disk quotas on a Lustre file system mount. qc_cmd indicates a command to be applied to UID qc_id or GID qc_id. Option Description LUSTRE_Q_QUOTAON Turns on quotas for a Lustre file system. qc_type is USRQUOTA, GRPQUOTA or UGQUOTA (both user and group quota). The quota files must exist. They are normally created with the llapi_quotacheck(3) call. This call is restricted to the super user privilege. LUSTRE_Q_QUOTAOFF Turns off quotas for a Lustre file system. qc_type is USRQUOTA, GRPQUOTA or UGQUOTA (both user and group quota). This call is restricted to the super user privilege. LUSTRE_Q_GETQUOTA Gets disk quota limits and current usage for user or group qc_id. qc_type is USRQUOTA or GRPQUOTA. UUID may be filled with OBD UUID string to query quota information from a specific node. dqb_valid may be set nonzero to query information only from MDS. If UUID is an empty string and dqb_valid is zero then cluster-wide limits and usage are returned. On return, obd_dqblk contains the requested information (block limits unit is kilobyte). Quotas must be turned on before using this command. LUSTRE_Q_SETQUOTA Sets disk quota limits for user or group qc_id. qc_type is USRQUOTA or GRPQUOTA. dqb_valid must be set to QIF_ILIMITS, QIF_BLIMITS or QIF_LIMITS (both inode limits and block limits) dependent on updating limits. obd_dqblk must be filled with limits values (as set in dqb_valid, block limits unit is kilobyte). Quotas must be turned on before using this command. LUSTRE_Q_GETINFO Gets information about quotas. qc_type is either USRQUOTA or GRPQUOTA. On return, dqi_igrace is inode grace time (in seconds), dqi_bgrace is block grace time (in seconds), dqi_flags is not used by the current Lustre version. LUSTRE_Q_SETINFO Sets quota information (like grace times). qc_type is either USRQUOTA or GRPQUOTA. dqi_igrace is inode grace time (in seconds), dqi_bgrace is block grace time (in seconds), dqi_flags is not used by the current Lustre version and must be zeroed.30-8 Lustre 1.8 Operations Manual • December 2010 Return Values llapi_quotactl() returns: 0 on success -1 on failure and sets error number to indicate the error llapi Errors llapi errors are described below. Errors Description EFAULT qctl is invalid. ENOSYS Kernel or Lustre modules have not been compiled with the QUOTA option. ENOMEM Insufficient memory to complete operation. ENOTTY qc_cmd is invalid. EBUSY Cannot process during quotacheck. ENOENT UUID does not correspond to OBD or mnt does not exist. EPERM The call is privileged and the caller is not the super user. ESRCH No disk quota is found for the indicated user. Quotas have not been turned on for this file system.Chapter 30 Setting Lustre Properties (man3) 30-9 30.1.5 llapi_path2fid Use llapi_path2fid to get the FID from the pathname. Synopsis #include #include int llapi_path2fid(const char *path, unsigned long long *seq, unsigned long *oid, unsigned long *ver) Description The llapi_path2fid function returns the FID (sequence : object ID : version) for the pathname. Return Values llapi_path2fid returns: 0 on success non-zero value on failure30-10 Lustre 1.8 Operations Manual • December 201031-1 C H A P T E R 31 Configuration Files and Module Parameters (man5) This section describes configuration files and module parameters and includes the following sections: ¦ Introduction ¦ Module Options 31.1 Introduction LNET network hardware and routing are now configured via module parameters. Parameters should be specified in the /etc/modprobe.conf file, for example: alias lustre llite options lnet networks=tcp0,elan0 The above option specifies that this node should use all the available TCP and Elan interfaces. Module parameters are read when the module is first loaded. Type-specific LND modules (for instance, ksocklnd) are loaded automatically by the LNET module when LNET starts (typically upon modprobe ptlrpc). Under Linux 2.6, LNET configuration parameters can be viewed under /sys/module/; generic and acceptor parameters under LNET, and LND-specific parameters under the name of the corresponding LND. Under Linux 2.4, sysfs is not available, but the LND-specific parameters are accessible via equivalent paths under /proc.31-2 Lustre 1.8 Operations Manual • December 2010 Important: All old (pre v.1.4.6) Lustre configuration lines should be removed from the module configuration files and replaced with the following. Make sure that CONFIG_KMOD is set in your linux.config so LNET can load the following modules it needs. The basic module files are: modprobe.conf (for Linux 2.6) alias lustre llite options lnet networks=tcp0,elan0 modules.conf (for Linux 2.4) alias lustre llite options lnet networks=tcp0,elan0 For the following parameters, default option settings are shown in parenthesis. Changes to parameters marked with a W affect running systems. (Unmarked parameters can only be set when LNET loads for the first time.) Changes to parameters marked with Wc only have effect when connections are established (existing connections are not affected by these changes.) 31.2 Module Options ¦ With routed or other multi-network configurations, use ip2nets rather than networks, so all nodes can use the same configuration. ¦ For a routed network, use the same “routes” configuration everywhere. Nodes specified as routers automatically enable forwarding and any routes that are not relevant to a particular node are ignored. Keep a common configuration to guarantee that all nodes have consistent routing tables. ¦ A separate modprobe.conf.lnet included from modprobe.conf makes distributing the configuration much easier. ¦ If you set config_on_load=1, LNET starts at modprobe time rather than waiting for Lustre to start. This ensures routers start working at module load time. # lctl # lctl> net down ¦ Remember the lctl ping {nid} command - it is a handy way to check your LNET configuration.Chapter 31 Configuration Files and Module Parameters (man5) 31-3 31.2.1 LNET Options This section describes LNET options. 31.2.1.1 Network Topology Network topology module parameters determine which networks a node should join, whether it should route between these networks, and how it communicates with non-local networks. Here is a list of various networks and the supported software stacks: Note – Lustre ignores the loopback interface (lo0), but Lustre use any IP addresses aliased to the loopback (by default). When in doubt, explicitly specify networks. Network Software Stack openib OpenIB gen1/Mellanox Gold iib Silverstorm (Infinicon) vib Voltaire o2ib OpenIB gen2 cib Cisco mx Myrinet MX gm Myrinet GM-2 elan Quadrics QSNet31-4 Lustre 1.8 Operations Manual • December 2010 ip2nets ("") is a string that lists globally-available networks, each with a set of IP address ranges. LNET determines the locally-available networks from this list by matching the IP address ranges with the local IPs of a node. The purpose of this option is to be able to use the same modules.conf file across a variety of nodes on different networks. The string has the following syntax. :==

[

] {

}

:== [

]

{

} [

]

:==

[ "("

")" ]

:== [ ] :== "tcp" | "elan" | "openib" | ...

:==

[ ","

]

:==

"."

:==

| "*" | "["

"]"

:==

[ ","

]

:== [ "-" [ "/" ] ] }

:== ";" | "\n"

:==

{

}

contains enough information to uniquely identify the network and load an appropriate LND. The LND determines the missing "address-within-network" part of the NID based on the interfaces it can use.

specifies which hardware interface the network can use. If omitted, all interfaces are used. LNDs that do not support the

syntax cannot be configured to use particular interfaces and just use what is there. Only a single instance of these LNDs can exist on a node at any time, and

must be omitted.

entries are scanned in the order declared to see if one of the node's IP addresses matches one of the

expressions. If there is a match,

specifies the network to instantiate. Note that it is the first match for a particular network that counts. This can be used to simplify the match expression for the general case by placing it after the special cases. For example: ip2nets="tcp(eth1,eth2) 134.32.1.[4-10/2]; tcp(eth1) *.*.*.*" 4 nodes on the 134.32.1.* network have 2 interfaces (134.32.1.{4,6,8,10}) but all the rest have 1. ip2nets="vib 192.168.0.*; tcp(eth2) 192.168.0.[1,7,4,12]" This describes an IB cluster on 192.168.0.*. Four of these nodes also have IP interfaces; these four could be used as routers. Note that match-all expressions (For instance, *.*.*.*) effectively mask all other

entries specified after them. They should be used with caution.Chapter 31 Configuration Files and Module Parameters (man5) 31-5 Here is a more complicated situation, the route parameter is explained below. We have: ¦ Two TCP subnets ¦ One Elan subnet ¦ One machine set up as a router, with both TCP and Elan interfaces ¦ IP over Elan configured, but only IP will be used to label the nodes. options lnet ip2nets=”tcp198.129.135.* 192.128.88.98; \ elan 198.128.88.98 198.129.135.3;” \ routes=”tcp 1022@elan# Elan NID of router;\ elan 198.128.88.98@tcp # TCP NID of router “ 31.2.1.2 networks ("tcp") This is an alternative to "ip2nets" which can be used to specify the networks to be instantiated explicitly. The syntax is a simple comma separated list of

s (see above). The default is only used if neither “ip2nets” nor “networks” is specified. 31.2.1.3 routes (“”) This is a string that lists networks and the NIDs of routers that forward to them. It has the following syntax (

is one or more whitespace characters): :== { ; } :== [[]{} So a node on the network tcp1 that needs to go through a router to get to the Elan network: options lnet networks=tcp1 routes="elan 1 192.168.2.2@tcp1” The hopcount is used to help choose the best path between multiply-routed configurations. A simple but powerful expansion syntax is provided, both for target networks and router NIDs as follows. :== "[" { "," } "]" :== |

:== [ "-" [ "/" ] ]31-6 Lustre 1.8 Operations Manual • December 2010 The expansion is a list enclosed in square brackets. Numeric items in the list may be a single number, a contiguous range of numbers, or a strided range of numbers. For example, routes="elan 192.168.1.[22-24]@tcp" says that network elan0 is adjacent (hopcount defaults to 1); and is accessible via 3 routers on the tcp0 network (192.168.1.22@tcp, 192.168.1.23@tcp and 192.168.1.24@tcp). routes="[tcp,vib] 2 [8-14/2]@elan" says that 2 networks (tcp0 and vib0) are accessible through 4 routers (8@elan, 10@elan, 12@elan and 14@elan). The hopcount of 2 means that traffic to both these networks will be traversed 2 routers - first one of the routers specified in this entry, then one more. Duplicate entries, entries that route to a local network, and entries that specify routers on a non-local network are ignored. Equivalent entries are resolved in favor of the route with the shorter hopcount. The hopcount, if omitted, defaults to 1 (the remote network is adjacent). It is an error to specify routes to the same destination with routers on different local networks. If the target network string contains no expansions, then the hopcount defaults to 1 and may be omitted (that is, the remote network is adjacent). In practice, this is true for most multi-network configurations. It is an error to specify an inconsistent hop count for a given target network. This is why an explicit hopcount is required if the target network string specifies more than one network.Chapter 31 Configuration Files and Module Parameters (man5) 31-7 31.2.1.4 forwarding ("") This is a string that can be set either to "enabled" or "disabled" for explicit control of whether this node should act as a router, forwarding communications between all local networks. A standalone router can be started by simply starting LNET (“modprobe ptlrpc”) with appropriate network topology options. Variable Description acceptor The acceptor is a TCP/IP service that some LNDs use to establish communications. If a local network requires it and it has not been disabled, the acceptor listens on a single port for connection requests that it redirects to the appropriate local network. The acceptor is part of the LNET module and configured by the following options: • secure - Accept connections only from reserved TCP ports (< 1023). • all - Accept connections from any TCP port. NOTE: this is required for liblustre clients to allow connections on non-privileged ports. • none - Do not run the acceptor. accept_port (988) Port number on which the acceptor should listen for connection requests. All nodes in a site configuration that require an acceptor must use the same port. accept_backlog (127) Maximum length that the queue of pending connections may grow to (see listen(2)). accept_timeout (5, W) Maximum time in seconds the acceptor is allowed to block while communicating with a peer. accept_proto_version Version of the acceptor protocol that should be used by outgoing connection requests. It defaults to the most recent acceptor protocol version, but it may be set to the previous version to allow the node to initiate connections with nodes that only understand that version of the acceptor protocol. The acceptor can, with some restrictions, handle either version (that is, it can accept connections from both 'old' and 'new' peers). For the current version of the acceptor protocol (version 1), the acceptor is compatible with old peers if it is only required by a single local network.31-8 Lustre 1.8 Operations Manual • December 2010 31.2.2 SOCKLND Kernel TCP/IP LND The SOCKLND kernel TCP/IP LND (socklnd) is connection-based and uses the acceptor to establish communications via sockets with its peers. It supports multiple instances and load balances dynamically over multiple interfaces. If no interfaces are specified by the ip2nets or networks module parameter, all non-loopback IP interfaces are used. The address-within-network is determined by the address of the first IP interface an instance of the socklnd encounters. Consider a node on the “edge” of an InfiniBand network, with a low-bandwidth management Ethernet (eth0), IP over IB configured (ipoib0), and a pair of GigE NICs (eth1,eth2) providing off-cluster connectivity. This node should be configured with "networks=vib,tcp(eth1,eth2)” to ensure that the socklnd ignores the management Ethernet and IPoIB. Variable Description timeout (50,W) Time (in seconds) that communications may be stalled before the LND completes them with failure. nconnds (4) Sets the number of connection daemons. min_reconnectms (1000,W) Minimum connection retry interval (in milliseconds). After a failed connection attempt, this is the time that must elapse before the first retry. As connections attempts fail, this time is doubled on each successive retry up to a maximum of 'max_reconnectms'. max_reconnectms (6000,W) Maximum connection retry interval (in milliseconds). eager_ack (0 on linux, 1 on darwin,W) Boolean that determines whether the socklnd should attempt to flush sends on message boundaries. typed_conns (1,Wc) Boolean that determines whether the socklnd should use different sockets for different types of messages. When clear, all communication with a particular peer takes place on the same socket. Otherwise, separate sockets are used for bulk sends, bulk receives and everything else. min_bulk (1024,W) Determines when a message is considered "bulk". tx_buffer_size, rx_buffer_size (8388608,Wc) Socket buffer sizes. Setting this option to zero (0), allows the system to auto-tune buffer sizes. WARNING: Be very careful changing this value as improper sizing can harm performance. nagle (0,Wc) Boolean that determines if nagle should be enabled. It should never be set in production systems.Chapter 31 Configuration Files and Module Parameters (man5) 31-9 keepalive_idle (30,Wc) Time (in seconds) that a socket can remain idle before a keepalive probe is sent. Setting this value to zero (0) disables keepalives. keepalive_intvl (2,Wc) Time (in seconds) to repeat unanswered keepalive probes. Setting this value to zero (0) disables keepalives. keepalive_count (10,Wc) Number of unanswered keepalive probes before pronouncing socket (hence peer) death. enable_irq_affinity (0,Wc) Boolean that determines whether to enable IRQ affinity. The default is zero (0). When set, socklnd attempts to maximize performance by handling device interrupts and data movement for particular (hardware) interfaces on particular CPUs. This option is not available on all platforms. This option requires an SMP system to exist and produces best performance with multiple NICs. Systems with multiple CPUs and a single NIC may see increase in the performance with this parameter disabled. zc_min_frag (2048,W) Determines the minimum message fragment that should be considered for zero-copy sends. Increasing it above the platform's PAGE_SIZE disables all zero copy sends. This option is not available on all platforms. Variable Description31-10 Lustre 1.8 Operations Manual • December 2010 31.2.3 QSW LND The QSW LND (qswlnd) is connection-less and, therefore, does not need the acceptor. It is limited to a single instance, which uses all Elan "rails" that are present and dynamically load balances over them. The address-with-network is the node's Elan ID. A specific interface cannot be selected in the "networks" module parameter. Variable Description tx_maxcontig (1024) Integer that specifies the maximum message payload (in bytes) to copy into a pre-mapped transmit buffer mtxmsgs (8) Number of "normal" message descriptors for locally-initiated communications that may block for memory (callers block when this pool is exhausted). nnblk_txmsg (512 with a 4K page size, 256 otherwise) Number of "reserved" message descriptors for communications that may not block for memory. This pool must be sized large enough so it is never exhausted. nrxmsg_small (256) Number of "small" receive buffers to post (typically everything apart from bulk data). ep_envelopes_small (2048) Number of message envelopes to reserve for the "small" receive buffer queue. This determines a breakpoint in the number of concurrent senders. Below this number, communication attempts are queued, but above this number, the pre-allocated envelope queue will fill, causing senders to back off and retry. This can have the unfortunate side effect of starving arbitrary senders, who continually find the envelope queue is full when they retry. This parameter should therefore be increased if envelope queue overflow is suspected. nrxmsg_large (64) Number of "large" receive buffers to post (typically for routed bulk data). ep_envelopes_large (256) Number of message envelopes to reserve for the "large" receive buffer queue. For more information on message envelopes, see the ep_envelopes_small option (above). optimized_puts (32768,W) Smallest non-routed PUT that will be RDMA’d. optimized_gets (1,W) Smallest non-routed GET that will be RDMA’d.Chapter 31 Configuration Files and Module Parameters (man5) 31-11 31.2.4 RapidArray LND The RapidArray LND (ralnd) is connection-based and uses the acceptor to establish connections with its peers. It is limited to a single instance, which uses all (both) RapidArray devices present. It load balances over them using the XOR of the source and destination NIDs to determine which device to use for communication. The address-within-network is determined by the address of the single IP interface that may be specified by the "networks" module parameter. If this is omitted, then the first non-loopback IP interface that is up is used instead. Variable Description n_connd (4) Sets the number of connection daemons. min_reconnect_interval (1,W) Minimum connection retry interval (in seconds). After a failed connection attempt, this sets the time that must elapse before the first retry. As connections attempts fail, this time is doubled on each successive retry, up to a maximum of the max_reconnect_interval option. max_reconnect_interval (60,W) Maximum connection retry interval (in seconds). timeout (30,W) Time (in seconds) that communications may be stalled before the LND completes them with failure. ntx (64) Number of "normal" message descriptors for locally-initiated communications that may block for memory (callers block when this pool is exhausted). ntx_nblk (256) Number of "reserved" message descriptors for communications that may not block for memory. This pool must be sized large enough so it is never exhausted. fma_cq_size (8192) Number of entries in the RapidArray FMA completion queue to allocate. It should be increased if the ralnd starts to issue warnings that the FMA CQ has overflowed. This is only a performance issue. max_immediate (2048,W) Size (in bytes) of the smallest message that will be RDMA’d, rather than being included as immediate data in an FMA. All messages greater than 6912 bytes must be RDMA’d (FMA limit).31-12 Lustre 1.8 Operations Manual • December 2010 31.2.5 VIB LND The VIB LND is connection-based, establishing reliable queue-pairs over InfiniBand with its peers. It does not use the acceptor. It is limited to a single instance, using a single HCA that can be specified via the "networks" module parameter. If this is omitted, it uses the first HCA in numerical order it can open. The address-within-network is determined by the IPoIB interface corresponding to the HCA used. Variable Description service_number (0x11b9a2) Fixed IB service number on which the LND listens for incoming connection requests. NOTE: All instances of the viblnd on the same network must have the same setting for this parameter. arp_retries (3,W) Number of times the LND will retry ARP while it establishes communications with a peer. min_reconnect_interval (1,W) Minimum connection retry interval (in seconds). After a failed connection attempt, this sets the time that must elapse before the first retry. As connections attempts fail, this time is doubled on each successive retry, up to a maximum of the max_reconnect_interval option. max_reconnect_interval (60,W) Maximum connection retry interval (in seconds). timeout (50,W) Time (in seconds) that communications may be stalled before the LND completes them with failure. ntx (32) Number of "normal" message descriptors for locally-initiated communications that may block for memory (callers block when this pool is exhausted). ntx_nblk (256 Number of "reserved" message descriptors for communications that may not block for memory. This pool must be sized large enough so it is never exhausted. concurrent_peers (1152) Maximum number of queue pairs and, therefore, the maximum number of peers that the instance of the LND may communicate with. hca_basename ("InfiniHost") Used to construct HCA device names by appending the device number. ipif_basename ("ipoib") Used to construct IPoIB interface names by appending the same device number as is used to generate the HCA device name. local_ack_timeout (0x12,Wc) Used to construct IPoIB interface names by appending the same device number as is used to generate the HCA device name. retry_cnt (7,Wc) Low-level QP parameter. Only change it from the default value if so advised.Chapter 31 Configuration Files and Module Parameters (man5) 31-13 rnr_cnt (6,Wc) Low-level QP parameter. Only change it from the default value if so advised. rnr_nak_timer (0x10,Wc) Low-level QP parameter. Only change it from the default value if so advised. fmr_remaps (1000) Controls how often FMR mappings may be reused before they must be unmapped. Only change it from the default value if so advised cksum (0,W) Boolean that determines if messages (NB not RDMAs) should be check-summed. This is a diagnostic feature that should not normally be enabled. Variable Description31-14 Lustre 1.8 Operations Manual • December 2010 31.2.6 OpenIB LND The OpenIB LND is connection-based and uses the acceptor to establish reliable queue-pairs over InfiniBand with its peers. It is limited to a single instance that uses only IB device '0'. The address-within-network is determined by the address of the single IP interface that may be specified by the "networks" module parameter. If this is omitted, the first non-loopback IP interface that is up, is used instead. It uses the acceptor to establish connections with its peers. Variable Description n_connd (4) Sets the number of connection daemons. The default value is 4. min_reconnect_interval (1,W) Minimum connection retry interval (in seconds). After a failed connection attempt, this sets the time that must elapse before the first retry. As connections attempts fail, this time is doubled on each successive retry, up to a maximum of 'max_reconnect_interval'. max_reconnect_interval (60,W) Maximum connection retry interval (in seconds). timeout (50,W) Time (in seconds) that communications may be stalled before the LND completes them with failure. ntx (64) Number of "normal" message descriptors for locally-initiated communications that may block for memory (callers block when this pool is exhausted). ntx_nblk (256) Number of "reserved" message descriptors for communications that may not block for memory. This pool must be sized large enough so it is never exhausted. concurrent_peers (1024) Maximum number of queue pairs and, therefore, the maximum number of peers that the instance of the LND may communicate with. cksum (0,W) Boolean that determines whether messages (NB not RDMAs) should be check-summed. This is a diagnostic feature that should not normally be enabled.Chapter 31 Configuration Files and Module Parameters (man5) 31-15 31.2.7 Portals LND (Linux) The Portals LND Linux (ptllnd) can be used as a interface layer to communicate with Sandia Portals networking devices. This version is intended to work on Cray XT3 Linux nodes that use Cray Portals as a network transport. Message Buffers When ptllnd starts up, it allocates and posts sufficient message buffers to allow all expected peers (set by concurrent_peers) to send one unsolicited message. The first message that a peer actually sends is a (so-called) "HELLO" message, used to negotiate how much additional buffering to setup (typically 8 messages). If 10000 peers actually exist, then enough buffers are posted for 80000 messages. The maximum message size is set by the max_msg_size module parameter (default value is 512). This parameter sets the bulk transfer breakpoint. Below this breakpoint, payload data is sent in the message itself. Above this breakpoint, a buffer descriptor is sent and the receiver gets the actual payload. The buffer size is set by the rxb_npages module parameter (default value is 1). The default conservatively avoids allocation problems due to kernel memory fragmentation. However, increasing this value to 2 is probably not risky. The ptllnd also keeps an additional rxb_nspare buffers (default value is 8) posted to account for full buffers being handled. Assuming a 4K page size with 10000 peers, 1258 buffers can be expected to be posted at startup, increasing to a maximum of 10008 as peers that are actually connected. By doubling rxb_npages halving max_msg_size, this number can be reduced by a factor of 4.31-16 Lustre 1.8 Operations Manual • December 2010 ME/MD Queue Length The ptllnd uses a single portal set by the portal module parameter (default value of 9) for both message and bulk buffers. Message buffers are always attached with PTL_INS_AFTER and match anything sent with "message" matchbits. Bulk buffers are always attached with PTL_INS_BEFORE and match only specific matchbits for that particular bulk transfer. This scheme assumes that the majority of ME / MDs posted are for "message" buffers, and that the overhead of searching through the preceding "bulk" buffers is acceptable. Since the number of "bulk" buffers posted at any time is also dependent on the bulk transfer breakpoint set by max_msg_size, this seems like an issue worth measuring at scale. TX Descriptors The ptllnd has a pool of so-called "tx descriptors", which it uses not only for outgoing messages, but also to hold state for bulk transfers requested by incoming messages. This pool should scale with the total number of peers. To enable the building of the Portals LND (ptllnd.ko) configure with this option: ./configure --with-portals=

Variable Description ntx (256) Total number of messaging descriptors. concurrent_peers (1152) Maximum number of concurrent peers. Peers that attempt to connect beyond the maximum are not allowed. peer_hash_table_size (101) Number of hash table slots for the peers. This number should scale with concurrent_peers. The size of the peer hash table is set by the module parameter peer_hash_table_size which defaults to a value of 101. This number should be prime to ensure the peer hash table is populated evenly. It is advisable to increase this value to 1001 for ~10000 peers. cksum (0) Set to non-zero to enable message (not RDMA) checksums for outgoing packets. Incoming packets are always check-summed if necessary, independent of this value. timeout (50) Amount of time (in seconds) that a request can linger in a peers-active queue before the peer is considered dead. portal (9) Portal ID to use for the ptllnd traffic. rxb_npages (64 * #cpus) Number of pages in an RX buffer.Chapter 31 Configuration Files and Module Parameters (man5) 31-17 31.2.8 Portals LND (Catamount) The Portals LND Catamount (ptllnd) can be used as a interface layer to communicate with Sandia Portals networking devices. This version is intended to work on the Cray XT3 Catamount nodes using Cray Portals as a network transport. To enable the building of the Portals LND configure with this option: ./configure --with-portals=

The following PTLLND tunables are currently available: credits (128) Maximum total number of concurrent sends that are outstanding to a single peer at a given time. peercredits (8) Maximum number of concurrent sends that are outstanding to a single peer at a given time. max_msg_size (512) Maximum immediate message size. This MUST be the same on all nodes in a cluster. A peer that connects with a different max_msg_size value will be rejected. Variable Description PTLLND_DEBUG (boolean, dflt 0) Enables or disables debug features. PTLLND_TX_HISTORY (int, dflt debug?1024:0) Sets the size of the history buffer. PTLLND_ABORT_ON_PROT OCOL_MISMATCH (boolean, dflt 1) Calls abort action on connecting to a peer running a different version of the ptllnd protocol. PTLLND_ABORT_ON_NAK (boolean, dflt 0) Calls abort action when a peer sends a NAK. (Example: When it has timed out this node.) PTLLND_DUMP_ON_NAK (boolean, dflt debug?1:0) Dumps peer debug and the history on receiving a NAK. Variable Description31-18 Lustre 1.8 Operations Manual • December 2010 The following environment variables can be set to configure the PTLLND’s behavior. PTLLND_WATCHDOG_INTE RVAL (int, dflt 1) Sets intervals to check some peers for timed out communications while the application blocks for communications to complete. PTLLND_TIMEOUT (int, dflt 50) The communications timeout (in seconds). PTLLND_LONG_WAIT (int, dflt debug?5:PTLLND_TIMEOUT) The time (in seconds) after which the ptllnd prints a warning if it blocks for a longer time during connection establishment, cleanup after an error, or cleanup during shutdown. Variable Description PTLLND_PORTAL (9) The portal ID (PID) to use for the ptllnd traffic. PTLLND_PID (9) The virtual PID on which to contact servers. PTLLND_PEERCREDITS (8) The maximum number of concurrent sends that are outstanding to a single peer at any given instant. PTLLND_MAX_MESSAGE_SIZE (512) The maximum messages size. This MUST be the same on all nodes in a cluster. PTLLND_MAX_MSGS_PER_BUFFER (64) The number of messages in a receive buffer. Receive buffer will be allocated of size PTLLND_MAX_MSGS_PER_BUFFER times PTLLND_MAX_MESSAGE_SIZE. PTLLND_MSG_SPARE (256) Additional receive buffers posted to portals. PTLLND_PEER_HASH_SIZE (101) Number of hash table slots for the peers. PTLLND_EQ_SIZE (1024) Size of the Portals event queue (that is, maximum number of events in the queue). Variable DescriptionChapter 31 Configuration Files and Module Parameters (man5) 31-19 31.2.9 MX LND MXLND supports a number of load-time parameters using Linux's module parameter system. The following variables are available: Of the described variables, only hosts is required. It must be the absolute path to the MXLND hosts file. For example: options kmxlnd hosts=/etc/hosts.mxlnd The file format for the hosts file is: IP HOST BOARD EP_ID The values must be space and/or tab separated where: IP is a valid IPv4 address HOST is the name returned by `hostname` on that machine BOARD is the index of the Myricom NIC (0 for the first card, etc.) EP_ID is the MX endpoint ID Variable Description n_waitd Number of completion daemons. max_peers Maximum number of peers that may connect. cksum Enables small message (< 4 KB) checksums if set to a non-zero value. ntx Number of total tx message descriptors. credits Number of concurrent sends to a single peer. board Index value of the Myrinet board (NIC). ep_id MX endpoint ID. polling Use zero (0) to block (wait). A value > 0 will poll that many times before blocking. hosts IP-to-hostname resolution file.31-20 Lustre 1.8 Operations Manual • December 2010 To obtain the optimal performance for your platform, you may want to vary the remaining options. n_waitd (1) sets the number of threads that process completed MX requests (sends and receives). max_peers (1024) tells MXLND the upper limit of machines that it will need to communicate with. This affects how many receives it will pre-post and each receive will use one page of memory. Ideally, on clients, this value will be equal to the total number of Lustre servers (MDS and OSS). On servers, it needs to equal the total number of machines in the storage system. cksum (0) turns on small message checksums. It can be used to aid in troubleshooting. MX also provides an optional checksumming feature which can check all messages (large and small). For details, see the MX README. ntx (256) is the number of total sends in flight from this machine. In actuality, MXLND reserves half of them for connect messages so make this value twice as large as you want for the total number of sends in flight. credits (8) is the number of in-flight messages for a specific peer. This is part of the flow-control system in Lustre. Increasing this value may improve performance but it requires more memory because each message requires at least one page. board (0) is the index of the Myricom NIC. Hosts can have multiple Myricom NICs and this identifies which one MXLND should use. This value must match the board value in your MXLND hosts file for this host. ep_id (3) is the MX endpoint ID. Each process that uses MX is required to have at least one MX endpoint to access the MX library and NIC. The ID is a simple index starting at zero (0). This value must match the endpoint ID value in your MXLND hosts file for this host. polling (0) determines whether this host will poll or block for MX request completions. A value of 0 blocks and any positive value will poll that many times before blocking. Since polling increases CPU usage, we suggest that you set this to zero (0) on the client and experiment with different values for servers.32-1 C H A P T E R 32 System Configuration Utilities (man8) This chapter includes system configuration utilities and includes the following sections: ¦ mkfs.lustre ¦ tunefs.lustre ¦ lctl ¦ mount.lustre ¦ Additional System Configuration Utilities32-2 Lustre 1.8 Operations Manual • December 2010 32.1 mkfs.lustre The mkfs.lustre utility formats a disk for a Lustre service. Synopsis mkfs.lustre

[options] device where is one of the following: Description mkfs.lustre is used to format a disk device for use as part of a Lustre file system. After formatting, a disk can be mounted to start the Lustre service defined by this command. When the file system is created, parameters can simply be added as a --param option to the mkfs.lustre command. See Setting Parameters with mkfs.lustre. Option Description --ost Object Storage Target (OST) --mdt Metadata Storage Target (MDT) --mgs Configuration Management Service (MGS), one per site. This service can be combined with one --mdt service by specifying both types. Option Description --backfstype=fstype Forces a particular format for the backing file system (such as ext3, ldiskfs). --comment=comment Sets a user comment about this disk, ignored by Lustre. --device-size=KB Sets the device size for loop and non-loop devices.Chapter 32 System Configuration Utilities (man8) 32-3 --dryrun Only prints what would be done; it does not affect the disk. --failnode=nid,... Sets the NID(s) of a failover partner. This option can be repeated as needed. --fsname=filesystem_name The Lustre file system of which this service/node will be a part. The default file system name is “lustre”. NOTE: The file system name is limited to 8 characters. --index=index Forces a particular OST or MDT index. --mkfsoptions=opts Formats options for the backing file system. For example, ext3 options could be set here. --mountfsoptions=opts Sets permanent mount options. This is equivalent to the setting in /etc/fstab. --mgsnode=nid,... Sets the NIDs of the MGS node, required for all targets other than the MGS. --param key=value Sets the permanent parameter key to value. This option can be repeated as desired. Typical options might include: --param sys.timeout=40 System obd timeout. --param lov.stripesize=2M Default stripe size. --param lov.stripecount=2 Default stripe count. --param failover.mode=failout Returns errors instead of waiting for recovery. --quiet Prints less information. Option Description32-4 Lustre 1.8 Operations Manual • December 2010 Examples Creates a combined MGS and MDT for file system testfs on node cfs21: mkfs.lustre --fsname=testfs --mdt --mgs /dev/sda1 Creates an OST for file system testfs on any node (using the above MGS): mkfs.lustre --fsname=testfs --ost --mgsnode=cfs21@tcp0 /dev/sdb Creates a standalone MGS on, e.g., node cfs22: mkfs.lustre --mgs /dev/sda1 Creates an MDT for file system myfs1 on any node (using the above MGS): mkfs.lustre --fsname=myfs1 --mdt --mgsnode=cfs22@tcp0 /dev/sda2 --reformat Reformats an existing Lustre disk. --stripe-count-hint=stripes Used to optimize the MDT’s inode size. --verbose Prints more information. Option DescriptionChapter 32 System Configuration Utilities (man8) 32-5 32.2 tunefs.lustre The tunefs.lustre utility modifies configuration information on a Lustre target disk. Synopsis tunefs.lustre [options] device Description tunefs.lustre is used to modify configuration information on a Lustre target disk. This includes upgrading old (pre-Lustre 1.8) disks. This does not reformat the disk or erase the target information, but modifying the configuration information can result in an unusable file system. Caution – Changes made here affect a file system when the target is mounted the next time. With tunefs.lustre, parameters are "additive" -- new parameters are specified in addition to old parameters, they do not replace them. To erase all old tunefs.lustre parameters and just use newly-specified parameters, run: $ tunefs.lustre --erase-params --param= The tunefs.lustre command can be used to set any parameter settable in a /proc/fs/lustre file and that has its own OBD device, so it can be specified as ..=. For example: $ tunefs.lustre --param mdt.group_upcall=NONE /dev/sda132-6 Lustre 1.8 Operations Manual • December 2010 Options The tunefs.lustre options are listed and explained below. Option Description --comment=comment Sets a user comment about this disk, ignored by Lustre. --dryrun Only prints what would be done; does not affect the disk. --erase-params Removes all previous parameter information. --failnode=nid,... Sets the NID(s) of a failover partner. This option can be repeated as needed. --fsname=filesystem_name The Lustre file system of which this service will be a part. The default file system name is “lustre”. --index=index Forces a particular OST or MDT index. --mountfsoptions=opts Sets permanent mount options; equivalent to the setting in /etc/fstab. --mgs Adds a configuration management service to this target. --msgnode=nid,... Sets the NID(s) of the MGS node; required for all targets other than the MGS. --nomgs Removes a configuration management service to this target. --quiet Prints less information.Chapter 32 System Configuration Utilities (man8) 32-7 Examples Changing the MGS’s NID address. (This should be done on each target disk, since they should all contact the same MGS.) tunefs.lustre --erase-param --mgsnode= --writeconf /dev/sda Adding a failover NID location for this target. tunefs.lustre --param="failover.node=192.168.0.13@tcp0" /dev/sda --verbose Prints more information. --writeconf Erases all configuration logs for the file system to which this MDT belongs, and regenerates them. This is very dangerous. All clients and servers should be stopped. All targets must then be restarted to regenerate the logs. No clients should be started until all targets have restarted. In general, this command should only be executed on the MDT, not the OSTs. Option Description32-8 Lustre 1.8 Operations Manual • December 2010 32.3 lctl The lctl utility is used for root control and configuration. With lctl you can directly control Lustre via an ioctl interface, allowing various configuration, maintenance and debugging features to be accessed. Synopsis lctl lctl --device Description The lctl utility can be invoked in interactive mode by issuing the lctl command. After that, commands are issued as shown below. The most common lctl commands are: dl device network list_nids ping {nid} help quit For a complete list of available commands, type help at the lctl prompt. To get basic help on command meaning and syntax, type help command For non-interactive use, use the second invocation, which runs the command after connecting to the device.Chapter 32 System Configuration Utilities (man8) 32-9 Setting Parameters with lctl Lustre parameters are not always accessible using the procfs interface, as it is platform-specific. As a solution, lctl {get,set}_param has been introduced as a platform-independent interface to the Lustre tunables. Avoid direct references to /proc/{fs,sys}/{lustre,lnet}. For future portability, use lctl {get,set}_param . When the file system is running, temporary parameters can be set using the lctl set_param command. These parameters map to items in /proc/{fs,sys}/{lnet,lustre}. The lctl set_param command uses this syntax: lctl set_param [-n] ..= For example: $ lctl set_param ldlm.namespaces.*osc*.lru_size=$((NR_CPU*100)) Many permanent parameters can be set with the lctl conf_param command. In general, the lctl conf_param command can be used to specify any parameter settable in a /proc/fs/lustre file, with its own OBD device. The lctl conf_param command uses this syntax: ..=) For example: $ lctl conf_param testfs-MDT0000.mdt.group_upcall=NONE $ lctl conf_param testfs.llite.max_read_ahead_mb=16 Caution – The lctl conf_param command permanently sets parameters in the file system configuration. To get current Lustre parameter settings, use the lctl get_param command with this syntax: lctl get_param [-n] .. For example: $ lctl get_param -n ost.*.ost_io.timeouts32-10 Lustre 1.8 Operations Manual • December 2010 Network Configuration Option Description network | Starts or stops LNET. Or, select a network type for other lctl LNET commands. list_nids Prints all NIDs on the local node. LNET must be running. which_nid From a list of NIDs for a remote node, identifies the NID on which interface communication will occur. ping {nid} Check’s LNET connectivity via an LNET ping. This uses the fabric appropriate to the specified NID. interface_list Prints the network interface information for a given network type. peer_list Prints the known peers for a given network type. conn_list Prints all the connected remote NIDs for a given network type. active_tx This command prints active transmits. It is only used for the Elan network type.Chapter 32 System Configuration Utilities (man8) 32-11 Device Operations Option Description lctl get_param " [-n|-N|-F] [param_path ...]" Gets the value of a Lustre or LNET parameter from the specified path. NOTE: Lustre tunables are not always accessible using procfs interface, as it is platform-specific. As a solution, lctl {get,set,list}_param has been introduced as a platform-independent interface to Lustre tunables. Avoid direct references to /proc/{fs,sys}/{lustre,lnet}. For future portability, use lctl {get,set,list}_param. -n Prints only the parameter value and not the parameter name. -N Prints only matched parameter names and not the values; especially useful when using patterns. -F When -N is specified, adds '/', '@' or '=' for directories, symlinks and writeable files, respectively. lctl set_param " [-n] [param_path ...]" Sets the value of a Lustre or LNET parameter from the specified path. NOTE: Lustre tunables are not always accessible using procfs interface, as it is platform-specific. As a solution, lctl {get,set,list}_param has been introduced as a platform-independent interface to Lustre tunables. Avoid direct references to /proc/{fs,sys}/{lustre,lnet}. For future portability, use lctl {get,set,list}_param. -n Disables printing of the key name when printing values. lctl list_param " [-F|-R] [param_path ...]" Lists the Lustre or LNET parameter name. -F Adds '/', '@' or '=' for directories, symlinks and writeable files, respectively. -R Recursively lists all parameters under the specified path. If param_path is null, all parameters are shown. conf_param " [-d] .=" Sets a permanent configuration parameter for any device via the MGS. This command must be run on the MGS node. activate Re-activates an import after the de-activate operation.32-12 Lustre 1.8 Operations Manual • December 2010 Virtual Block Device Operations Lustre can emulate a virtual block device upon a regular file. This emulation is needed when you are trying to set up a swap space via the file. Debug deactivate Running lctl deactivate on the MDS stops new objects from being allocated on the OST. Running lctl deactivate on Lustre clients causes them to return -EIO when accessing objects on the OST instead of waiting for recovery. abort_recovery Aborts the recovery process on a re-starting MDT or OST device. Option Description blockdev_attach Attaches a regular Lustre file to a block device. If the device node is non-existent, lctl creates it. We recommend that you create the device node by lctl since the emulator uses a dynamic major number. blockdev_detach Detaches the virtual block device. blockdev_info Provides information on which Lustre file is attached to the device node. Option Description debug_daemon Starts and stops the debug daemon, and controls the output filename and size. debug_kernel [file] [raw] Dumps the kernel debug buffer to stdout or a file. debug_file [output] Converts the kernel-dumped debug log from binary to plain text format. clear Clears the kernel debug buffer. mark Inserts marker text in the kernel debug buffer. Option DescriptionChapter 32 System Configuration Utilities (man8) 32-13 Options Use the following options to invoke lctl. Examples lctl $ lctl lctl > dl 0 UP mgc MGC192.168.0.20@tcp bfbb24e3-7deb-2ffaeab0-44dffe00f692 5 1 UP ost OSS OSS_uuid 3 2 UP obdfilter testfs-OST0000 testfs-OST0000_UUID 3 lctl > dk /tmp/log Debug log: 87 lines, 87 kept, 0 dropped. lctl > quit $ lctl conf_param testfs-MDT0000 sys.timeout=40 $ lctl conf_param testfs-MDT0000.lov.stripesize=2M $ lctl conf_param testfs-OST0000.osc.max_dirty_mb=29.15 $ lctl conf_param testfs-OST0000.ost.client_cache_seconds=15 Option Description --device Device to be used for the operation (specified by name or number). See device_list. --ignore_errors | ignore_errors Ignores errors during script processing.32-14 Lustre 1.8 Operations Manual • December 2010 get_param $ lctl lctl > get_param obdfilter.lustre-OST0000.kbytesavail obdfilter.lustre-OST0000.kbytesavail=249364 lctl > get_param -n obdfilter.lustre-OST0000.kbytesavail 249364 lctl > get_param timeout timeout=20 lctl > get_param -n timeout 20 lctl > get_param obdfilter.*.kbytesavail obdfilter.lustre-OST0000.kbytesavail=249364 obdfilter.lustre-OST0001.kbytesavail=249364 lctl > # lctl get_param ost.* ost.OSS ost.num_refs # lctl get_param -n debug timeout super warning dlmtrace error emerg ha rpctrace vfstrace config console 20 # lctl get_param -N ost.* debug ost.OSS ost.num_refs debug Note – lctl "get_param -NF" is the same as "list_param -F".Chapter 32 System Configuration Utilities (man8) 32-15 set_param $ lctl > set_param obdfilter.*.kbytesavail=0 obdfilter.lustre-OST0000.kbytesavail=0 obdfilter.lustre-OST0001.kbytesavail=0 lctl > set_param -n obdfilter.*.kbytesavail=0 $ lctl > set_param fail_loc=0 fail_loc=0 # lctl set_param fail_loc=0 timeout=20 fail_loc=0 timeout=20 # lctl set_param -n fail_loc=0 timeout=20 0 20 list_param # lctl list_param ost.* ost.OSS ost.num_refs # lctl list_param -F ost.* debug ost.OSS/ ost.num_refs debug= # lctl list_param -R mdt mdt mdt.lustre-MDT0000 mdt.lustre-MDT0000.capa mdt.lustre-MDT0000.capa_count mdt.lustre-MDT0000.capa_key_timeout mdt.lustre-MDT0000.capa_timeout mdt.lustre-MDT0000.commit_on_sharing mdt.lustre-MDT0000.evict_client Note – lctl list_param -R shows all parameters.32-16 Lustre 1.8 Operations Manual • December 2010 32.4 mount.lustre The mount.lustre utility starts a Lustre client or target service. Synopsis mount -t lustre [-o options] directory Description The mount.lustre utility starts a Lustre client or target service. This program should not be called directly; rather, it is a helper program invoked through mount(8), as shown above. Use the umount(8) command to stop Lustre clients and targets. There are two forms for the device option, depending on whether a client or a target service is started: Option Description :/ This mounts the Lustre file system, , by contacting the Management Service at on the pathname given by . The format for is defined below. A mounted file system appears in fstab(5) and is usable, like any local file system, providing a full POSIX-compliant interface. This starts the target service defined by the mkfs.lustre command on the physical disk . A mounted target service file system is only useful for df(1) operations and appears in fstab(5) to show the device is in use.Chapter 32 System Configuration Utilities (man8) 32-17 Options In addition to the standard mount options, Lustre understands the following client-specific options: Option Description :=[:] The MGS specification may be a colon-separated list of nodes. :=[,] Each node may be specified by a comma-separated list of NIDs. Option Description flock Enables flock support (slower, performance impact for use, coherent between nodes). localflock Enables local flock support using only client-local flock (faster, for applications that require flock, but do not run on multiple nodes). noflock Disables flock support entirely. Applications calling flock get an error. It is up to the administrator to choose either localflock (fastest, low impact, not coherent between nodes) or flock (slower, performance impact for use, coherent between nodes). user_xattr Enables get/set of extended attributes by regular users. nouser_xattr Disables use of extended attributes by regular users. Root and system processes can still use extended attributes. acl Enables ACL support. noacl Disables ACL support.32-18 Lustre 1.8 Operations Manual • December 2010 In addition to the standard mount options and backing disk type (e.g. ext3) options, Lustre understands the following server-specific options: Examples Starts a client for the Lustre file system testfs at mount point /mnt/myfilesystem. The Management Service is running on a node reachable from this client via the cfs21@tcp0 NID. mount -t lustre cfs21@tcp0:/testfs /mnt/myfilesystem Starts the Lustre target service on /dev/sda1. mount -t lustre /dev/sda1 /mnt/test/mdt Starts the testfs-MDT0000 service (using the disk label), but aborts the recovery process. mount -t lustre -L testfs-MDT0000 -o abort_recov /mnt/test/mdt Option Description nosvc Starts only the MGC (and MGS, if co-located) for a target service, not the actual service. nomgs Starts only the MDT (with a co-located MGS), without starting the MGS. exclude= Starts a client or MDT with a colon-separated list of known inactive OSTs. abort_recov Aborts client recovery and immediately starts the target service. md_stripe_cache_size Sets the stripe cache size for server-side disk with a striped RAID configuration. The default value is 16 KB. recovery_time_soft= Allows seconds for clients to reconnect for recovery after a server crash. This timeout is incrementally extended if it is about to expire and the server is still handling new connections from recoverable clients. The default soft recovery timeout is 300 seconds (5 minutes). recovery_time_hard= The server is allowed to incrementally extend its timeout, up to a hard maximum of seconds. The default hard recovery timeout is 900 seconds (15 minutes).Chapter 32 System Configuration Utilities (man8) 32-19 Note – In Lustre 1.8.3 and earlier releases, if the Service Tags tool (from the sun-servicetag package) can be found in /opt/sun/servicetag/bin/stclient, an inventory service tag is created reflecting the Lustre service being provided. If this tool cannot be found, mount.lustre silently ignores it and no service tag is created. The stclient(1) tool only creates the local service tag. No information is sent to the asset management system until you run the Registration Client to collect the tags and then upload them to the inventory system. Service tags have been discontinued in Lustre 1.8.4 and later releases. For more information, see Service Tags. 32.5 Additional System Configuration Utilities This section describes additional system configuration utilities that were added in Lustre 1.6. 32.5.1 lustre_rmmod.sh The lustre_rmmod.sh utility removes all Lustre and LNET modules (assuming no Lustre services are running). It is located in /usr/bin. Note – The lustre_rmmod.sh utility does not work if Lustre modules are being used or if you have manually run the lctl network up command.32-20 Lustre 1.8 Operations Manual • December 2010 32.5.2 e2scan The e2scan utility is an ext2 file system-modified inode scan program. The e2scan program uses libext2fs to find inodes with ctime or mtime newer than a given time and prints out their pathname. Use e2scan to efficiently generate lists of files that have been modified. The e2scan tool is included in the e2fsprogs package, located at: http://downloads.lustre.org/public/tools/e2fsprogs/ Synopsis e2scan [options] [-f file] block_device Description When invoked, the e2scan utility iterates all inodes on the block device, finds modified inodes, and prints their inode numbers. A similar iterator, using libext2fs(5), builds a table (called parent database) which lists the parent node for each inode. With a lookup function, you can reconstruct modified pathnames from root. Options Option Description -b inode buffer blocks Sets the readahead inode blocks to get excellent performance when scanning the block device. -o output file If an output file is specified, modified pathnames are written to this file. Otherwise, modified parameters are written to stdout. -t inode | pathname Sets the e2scan type if type is inode. The e2scan utility prints modified inode numbers to stdout. By default, the type is set as pathname. The e2scan utility lists modified pathnames based on modified inode numbers. -u Rebuilds the parent database from scratch. Otherwise, the current parent database is used.Chapter 32 System Configuration Utilities (man8) 32-21 32.5.3 Utilities to Manage Large Clusters The following utilities are located in /usr/bin. lustre_config.sh The lustre_config.sh utility helps automate the formatting and setup of disks on multiple nodes. An entire installation is described in a comma-separated file and passed to this script, which then formats the drives, updates modprobe.conf and produces high-availability (HA) configuration files. lustre_createcsv.sh The lustre_createcsv.sh utility generates a CSV file describing the currently-running installation. lustre_up14.sh The lustre_up14.sh utility grabs client configuration files from old MDTs. When upgrading Lustre from 1.4.x to 1.6.x, if the MGS is not co-located with the MDT or the client name is non-standard, this utility is used to retrieve the old client log. For more information, see Upgrading and Downgrading Lustre. 32.5.4 Application Profiling Utilities The following utilities are located in /usr/bin. lustre_req_history.sh The lustre_req_history.sh utility (run from a client), assembles as much Lustre RPC request history as possible from the local node and from the servers that were contacted, providing a better picture of the coordinated network activity. llstat.sh The llstat.sh utility (improved in Lustre 1.6), handles a wider range of /proc files, and has command line switches to produce more graphable output. plot-llstat.sh The plot-llstat.sh utility plots the output from llstat.sh using gnuplot.32-22 Lustre 1.8 Operations Manual • December 2010 32.5.5 More /proc Statistics for Application Profiling The following utilities provide additional statistics. vfs_ops_stats The client vfs_ops_stats utility tracks Linux VFS operation calls into Lustre for a single PID, PPID, GID or everything. /proc/fs/lustre/llite/*/vfs_ops_stats /proc/fs/lustre/llite/*/vfs_track_[pid|ppid|gid] extents_stats The client extents_stats utility shows the size distribution of I/O calls from the client (cumulative and by process). /proc/fs/lustre/llite/*/extents_stats, extents_stats_per_process offset_stats The client offset_stats utility shows the read/write seek activity of a client by offsets and ranges. /proc/fs/lustre/llite/*/offset_stats Lustre 1.6 included per-client and improved MDT statistics: ¦ Per-client statistics tracked on the servers Each MDT and OST now tracks LDLM and operations statistics for every connected client, for comparisons and simpler collection of distributed job statistics. /proc/fs/lustre/mds|obdfilter/*/exports/ ¦ Improved MDT statistics More detailed MDT operations statistics are collected for better profiling. /proc/fs/lustre/mds/*/statsChapter 32 System Configuration Utilities (man8) 32-23 32.5.6 Testing / Debugging Utilities Lustre offers the following test and debugging utilities. loadgen The Load Generator (loadgen) is a test program designed to simulate large numbers of Lustre clients connecting and writing to an OST. The loadgen utility is located at lustre/utils/loadgen (in a build directory) or at /usr/sbin/loadgen (from an RPM). Loadgen offers the ability to run this test: 1. Start an arbitrary number of (echo) clients. 2. Start and connect to an echo server, instead of a real OST. 3. Create/bulk_write/delete objects on any number of echo clients simultaneously. Currently, the maximum number of clients is limited by MAX_OBD_DEVICES and the amount of memory available. Usage The loadgen utility can be run locally on the OST server machine or remotely from any LNET host. The device command can take an optional NID as a parameter; if unspecified, the first local NID found is used. The obdecho module must be loaded by hand before running loadgen. # cd lustre/utils/ # insmod ../obdecho/obdecho.ko # ./loadgen loadgen> h This is a test program used to simulate large numbers of clients. The echo obds are used, so the obdecho module must be loaded.32-24 Lustre 1.8 Operations Manual • December 2010 Typical usage would be: loadgen> dev lustre-OST0000 set the target device loadgen> start 20 start 20 echo clients loadgen> wr 10 5 have 10 clients do simultaneous brw_write tests 5 times each Available commands are: device dl echosrv start verbose wait write help exit quit For more help type: help command-name loadgen> loadgen> device lustre-OST0000 192.168.0.21@tcp Added uuid OSS_UUID: 192.168.0.21@tcp Target OST name is 'lustre-OST0000' loadgen> loadgen> st 4 start 0 to 4 ./loadgen: running thread #1 ./loadgen: running thread #2 ./loadgen: running thread #3 ./loadgen: running thread #4 loadgen> wr 4 5 Estimate 76 clients before we run out of grant space (155872K / 2097152) 1: i0 2: i0 4: i0 3: i0 1: done (0) 2: done (0) 4: done (0) 3: done (0) wrote 25MB in 1.419s (17.623 MB/s) loadgen> The loadgen utility prints periodic status messages; message output can be controlled with the verbose command.Chapter 32 System Configuration Utilities (man8) 32-25 To insure that a file can be written to (a requirement of write cache), OSTs reserve ("grants"), chunks of space for each newly-created file. A grant may cause an OST to report that it is out of space, even though there is plenty of space on the disk, because the space is "reserved" by other files. The loadgen utility estimates the number of simultaneous open files as the disk size divided by the grant size and reports that number when the write tests are first started. Echo Server The loadgen utility can start an echo server. On another node, loadgen can specify the echo server as the device, thus creating a network-only test environment. loadgen> echosrv loadgen> dl 0 UP obdecho echosrv echosrv 3 1 UP ost OSS OSS 3 On another node: loadgen> device echosrv cfs21@tcp Added uuid OSS_UUID: 192.168.0.21@tcp Target OST name is 'echosrv' loadgen> st 1 start 0 to 1 ./loadgen: running thread #1 loadgen> wr 1 start a test_brw write test on X clients for Y iterations usage: write [] loadgen> wr 1 1 loadgen> 1: i0 1: done (0) wrote 1MB in 0.029s (34.023 MB/s)32-26 Lustre 1.8 Operations Manual • December 2010 Scripting The threads all perform their actions in non-blocking mode; use the wait command to block for the idle state. For example: #!/bin/bash ./loadgen << EOF device lustre-OST0000 st 1 wr 1 10 wait quit EOF Feature Requests The loadgen utility is intended to grow into a more comprehensive test tool; feature requests are encouraged. The current feature requests include: ¦ Locking simulation ¦ Many (echo) clients cache locks for the specified resource at the same time. ¦ Many (echo) clients enqueue locks for the specified resource simultaneously. ¦ obdsurvey functionality ¦ Fold the Lustre I/O kit’s obdsurvey script functionality into loadgen llog_reader The llog_reader utility translates a Lustre configuration log into human-readable form. lr_reader The lr_reader utility translates a last received (last_rcvd) file into human-readable form.Chapter 32 System Configuration Utilities (man8) 32-27 The following utilites are part of the Lustre I/O kit. For more information, see Lustre I/O Kit. sgpdd_survey The sgpdd_survey utility tests 'bare metal' performance, bypassing as much of the kernel as possible. The sgpdd_survey tool does not require Lustre, but it does require the sgp_dd package. Caution – The sgpdd_survey utility erases all data on the device. obdfilter_survey The obdfilter_survey utility is a shell script that tests performance of isolated OSTS, the network via echo clients, and an end-to-end test. ior-survey The ior-survey utility is a script used to run the IOR benchmark. Lustre includes IOR version 2.8.6. ost_survey The ost_survey utility is an OST performance survey that tests client-to-disk performance of the individual OSTs in a Lustre file system. stats-collect The stats-collect utility contains scripts used to collect application profiling information from Lustre clients and servers.32-28 Lustre 1.8 Operations Manual • December 2010 32.5.7 Flock Feature Lustre now includes the flock feature, which provides file locking support. Flock describes classes of file locks known as ‘flocks’. Flock can apply or remove a lock on an open file as specified by the user. However, a single file may not, simultaneously, have both shared and exclusive locks. By default, the flock utility is disabled on Lustre. Two modes are available. A call to use flock may be blocked if another process is holding an incompatible lock. Locks created using flock are applicable for an open file table entry. Therefore, a single process may hold only one type of lock (shared or exclusive) on a single file. Subsequent flock calls on a file that is already locked converts the existing lock to the new lock mode. 32.5.7.1 Example $ mount -t lustre –o flock mds@tcp0:/lustre /mnt/client You can check it in /etc/mtab. It should look like, mds@tcp0:/lustre /mnt/client lustre rw,flock 00 local mode In this mode, locks are coherent on one node (a single-node flock), but not across all clients. To enable it, use -o localflock. This is a client-mount option. NOTE: This mode does not impact performance and is appropriate for single-node databases. consistent mode In this mode, locks are coherent across all clients. To enable it, use the -o flock. This is a client-mount option. CAUTION: This mode affects the performance of the file being flocked and may affect stability, depending on the Lustre version used. Consider using a newer Lustre version which is more stable. If the consistent mode is enabled and no applications are using flock, then it has no effect.Chapter 32 System Configuration Utilities (man8) 32-29 32.5.8 l_getgroups The l_getgroups utility handles Lustre user / group cache upcall. Synopsis l_getgroups [-v] [-d | mdsname] uid l_getgroups [-v] -s Options Description The group upcall file contains the path to an executable file that, when properly installed, is invoked to resolve a numeric UID to a group membership list. This utility should complete the mds_grp_downcall_data structure and write it to the /proc/fs/lustre/mds/mds service/group_info pseudo-file. The l_getgroups utility is the reference implementation of the user or group cache upcall. Files The l_getgroups files are located at: /proc/fs/lustre/mds/mds-service/group_upcall Option Description --d Debug - prints values to stdout instead of Lustre. -s Sleep - mlock memory in core and sleeps forever. -v Verbose - Logs start/stop to syslog. mdsname MDS device name.32-30 Lustre 1.8 Operations Manual • December 2010 32.5.9 llobdstat The llobdstat utility displays OST statistics. Synopsis llobdstat ost_name [interval] Description The llobdstat utility displays a line of OST statistics for a given OST at specified intervals (in seconds). Example # llobdstat liane-OST0002 1 /usr/bin/llobdstat on /proc/fs/lustre/obdfilter/liane-OST0002/stats Processor counters run at 2800.189 MHz Read: 1.21431e+07, Write: 9.93363e+08, create/destroy: 24/1499, stat: 34, punch: 18 [NOTE: cx: create, dx: destroy, st: statfs, pu: punch ] Timestamp Read-delta ReadRate Write-delta WriteRate -------------------------------------------------------- 1217026053 0.00MB 0.00MB/s 0.00MB 0.00MB/s 1217026054 0.00MB 0.00MB/s 0.00MB 0.00MB/s 1217026055 0.00MB 0.00MB/s 0.00MB 0.00MB/s 1217026056 0.00MB 0.00MB/s 0.00MB 0.00MB/s 1217026057 0.00MB 0.00MB/s 0.00MB 0.00MB/s 1217026058 0.00MB 0.00MB/s 0.00MB 0.00MB/s 1217026059 0.00MB 0.00MB/s 0.00MB 0.00MB/s st:1 Files The llobdstat files are located at: /proc/fs/lustre/obdfilter//stats Option Description ost_name Name of the OBD for which statistics are requested. interval Time interval (in seconds) after which statistics are refreshed.Chapter 32 System Configuration Utilities (man8) 32-31 32.5.10 llstat The llstat utility displays Lustre statistics. Synopsis llstat [-c] [-g] [-i interval] stats_file Description The llstat utility displays statistics from any of the Lustre statistics files that share a common format and are updated at a specified interval (in seconds). To stop statistics printing, type CTRL-C.h Options Option Description -c Clears the statistics file. -i Specifies the interval polling period (in seconds). -g Specifies graphable output format. -h Displays help information. stats_file Specifies either the full path to a statistics file or a shorthand reference, mds or ost32-32 Lustre 1.8 Operations Manual • December 2010 Example To monitor /proc/fs/lustre/ost/OSS/ost/stats at 1 second intervals, run; llstat -i 1 ost Files The llstat files are located at: /proc/fs/lustre/mdt/MDS/*/stats /proc/fs/lustre/mds/*/exports/*/stats /proc/fs/lustre/mdc/*/stats /proc/fs/lustre/ldlm/services/*/stats /proc/fs/lustre/ldlm/namespaces/*/pool/stats /proc/fs/lustre/mgs/MGS/exports/*/stats /proc/fs/lustre/ost/OSS/*/stats /proc/fs/lustre/osc/*/stats /proc/fs/lustre/obdfilter/*/exports/*/stats /proc/fs/lustre/obdfilter/*/stats /proc/fs/lustre/llite/*/statsChapter 32 System Configuration Utilities (man8) 32-33 32.5.11 lst The lst utility starts LNET self-test. Synopsis lst Description LNET self-test helps site administrators confirm that Lustre Networking (LNET) has been correctly installed and configured. The self-test also confirms that LNET, the network software and the underlying hardware are performing as expected. Each LNET self-test runs in the context of a session. A node can be associated with only one session at a time, to ensure that the session has exclusive use of the nodes on which it is running. A single node creates, controls and monitors a single session. This node is referred to as the self-test console. Any node may act as the self-test console. Nodes are named and allocated to a self-test session in groups. This allows all nodes in a group to be referenced by a single name. Test configurations are built by describing and running test batches. A test batch is a named collection of tests, with each test composed of a number of individual point-to-point tests running in parallel. These individual point-to-point tests are instantiated according to the test type, source group, target group and distribution specified when the test is added to the test batch. Modules To run LNET self-test, load following modules: libcfs, lnet, lnet_selftest and any one of the klnds (ksocklnd, ko2iblnd...). To load all necessary modules, run modprobe lnet_selftest, which recursively loads the modules on which lnet_selftest depends. There are two types of nodes for LNET self-test: console and test. Both node types require all previously-specified modules to be loaded. (The userspace test node does not require these modules). Test nodes can either be in kernel or in userspace. A console user can invite a kernel test node to join the test session by running lst add_group NID, but the user cannot actively add a userspace test node to the test-session. However, the console user can passively accept a test node to the test session while the test node runs lst client to connect to the console.32-34 Lustre 1.8 Operations Manual • December 2010 Utilities LNET self-test includes two user utilities, lst and lstclient. lst is the user interface for the self-test console (run on console node). It provides a list of commands to control the entire test system, such as create session, create test groups, etc. lstclient is the userspace self-test program which is linked with userspace LNDs and LNET. A user can invoke lstclient to join a self-test session: lstclient -sesid CONSOLE_NID group NAME Example This is an example of an LNET self-test script which simulates the traffic pattern of a set of Lustre servers on a TCP network, accessed by Lustre clients on an IB network (connected via LNET routers), with half the clients reading and half the clients writing. #!/bin/bash export LST_SESSION=$$ lst new_session read/write lst add_group servers 192.168.10.[8,10,12-16]@tcp lst add_group readers 192.168.1.[1-253/2]@o2ib lst add_group writers 192.168.1.[2-254/2]@o2ib lst add_batch bulk_rw lst add_test --batch bulk_rw --from readers --to servers brw read check=simple size=1M lst add_test --batch bulk_rw --from writers --to servers brw write check=full size=4K # start running lst run bulk_rw # display server stats for 30 seconds lst stat servers & sleep 30; kill $! # tear down lst end_sessionChapter 32 System Configuration Utilities (man8) 32-35 32.5.12 plot-llstat The plot-llstat utility plots Lustre statistics. Synopsis plot-llstat results_filename [parameter_index] Options Description The plot-llstat utility generates a CSV file and instruction files for gnuplot from llstat output. Since llstat is generic in nature, plot-llstat is also a generic script. The value of parameter_index can be 1 for count per interval, 2 for count per second (default setting) or 3 for total count. The plot-llstat utility creates a .dat (CSV) file using the number of operations specified by the user. The number of operations equals the number of columns in the CSV file. The values in those columns are equal to the corresponding value of parameter_index in the output file. The plot-llstat utility also creates a .scr file that contains instructions for gnuplot to plot the graph. After generating the .dat and .scr files, the plot llstat tool invokes gnuplot to display the graph. Example llstat -i2 -g -c lustre-OST0000 > log plot-llstat log 3 Option Description results_filename Output generated by plot-llstat parameter_index Value of parameter_index can be: 1 - count per interval 2 - count per second (default setting) 3 - total count32-36 Lustre 1.8 Operations Manual • December 2010 32.5.13 routerstat The routerstat utility prints Lustre router statistics. Synopsis routerstat [interval] Description The routerstat utility watches LNET router statistics. If no interval is specified, then statistics are sampled and printed only one time. Otherwise, statistics are sampled and printed at the specified interval (in seconds). Options The routerstat output includes the following fields: Files Routerstat extracts statistics data from: /proc/sys/lnet/stats Field Description M msgs_alloc(msgs_max) E errors S send_length/send_count R recv_length/recv_count F route_length/route_count D drop_length/drop_countChapter 32 System Configuration Utilities (man8) 32-37 32.5.14 ll_recover_lost_found_objs The ll_recover_lost_found_objs utility helps recover Lustre OST objects (file data) from a lost and found directory back to their correct locations. Running the ll_recover_lost_found_objs tool is not strictly necessary to bring an OST back online, it just avoids losing access to objects that were moved to the lost and found directory due to directory corruption. Synopsis $ ll_recover_lost_found_objs [-hv] -d directory Description The first time Lustre writes to an object, it saves the MDS inode number and the objid as an extended attribute on the object, so in case of directory corruption of the OST, it is possible to recover the objects. Running e2fsck fixes the corrupted OST directory, but it puts all of the objects into a lost and found directory, where they are inaccessible to Lustre. Use the ll_recover_lost_found_objs utility to recover all (or at least most) objects from a lost and found directory back to their place in the O/0/d* directories. To use ll_recover_lost_found_objs, mount the file system locally (using the -t ldiskfs command), run the utility and then unmount it again. The OST must not be mounted by Lustre when ll_recover_lost_found_objs is run. Options Example ll_recover_lost_found_objs -d /mnt/ost/lost+found Field Description -h Prints a help message -v Increases verbosity -d directory Sets the lost and found directory path32-38 Lustre 1.8 Operations Manual • December 201033-1 C H A P T E R 33 System Limits This chapter describes various limits on the size of files and file systems. These limits are imposed by either the Lustre architecture or the Linux VFS and VM subsystems. In a few cases, a limit is defined within the code and could be changed by re-compiling Lustre. In those cases, the selected limit is supported by Lustre testing and may change in future releases. This chapter includes the following sections: ¦ Maximum Stripe Count ¦ Maximum Stripe Size ¦ Minimum Stripe Size ¦ Maximum Number of OSTs and MDTs ¦ Maximum Number of Clients ¦ Maximum Size of a File System ¦ Maximum File Size ¦ Maximum Number of Files or Subdirectories in a Single Directory ¦ MDS Space Consumption ¦ Maximum Length of a Filename and Pathname ¦ Maximum Number of Open Files for Lustre File Systems ¦ OSS RAM Size 33.1 Maximum Stripe Count The maximum number of stripe count is 160. This limit is hard-coded, but is near the upper limit imposed by the underlying ext3 file system. It may be increased in future releases. Under normal circumstances, the stripe count is not affected by ACLs.33-2 Lustre 1.8 Operations Manual • December 2010 33.2 Maximum Stripe Size For a 32-bit machine, the product of stripe size and stripe count (stripe_size * stripe_count) must be less than 2^32. The ext3 limit of 2TB for a single file applies for a 64-bit machine. (Lustre can support 160 stripes of 2 TB each on a 64-bit system.) 33.3 Minimum Stripe Size Due to the 64 KB PAGE_SIZE on some 64-bit machines, the minimum stripe size is set to 64 KB. 33.4 Maximum Number of OSTs and MDTs You can set the maximum number of OSTs by a compile option. The limit of 1020 OSTs in Lustre release 1.4.7 is increased to a maximum of 8150 OSTs in 1.6.0. Testing is in progress to move the limit to 4000 OSTs. The maximum number of MDSs will be determined after accomplishing MDS clustering. 33.5 Maximum Number of Clients Currently, the number of clients is limited to 131072. We have tested up to 22000 clients.Chapter 33 System Limits 33-3 33.6 Maximum Size of a File System For i386 systems with 2.6 kernels, the block devices are limited to 16 TB. Each OST or MDT can have a file system up to 16 TB, regardless of whether 32-bit or 64-bit kernels are on the server. You can have multiple OST file systems on a single node. Currently, the largest production Lustre file system has 448 OSTs in a single file system. There is a compile-time limit of 8150 OSTs in a single file system, giving a theoretical file system limit of nearly 64 PB. Several production Lustre file systems have around 200 OSTs in a single file system. The largest file system in production is at least 1.3 PB (184 OSTs). All these facts indicate that Lustre would scale just fine if more hardware is made available. 33.7 Maximum File Size Individual files have a hard limit of nearly 16 TB on 32-bit systems imposed by the kernel memory subsystem. On 64-bit systems this limit does not exist. Hence, files can be 64-bits in size. Lustre imposes an additional size limit of up to the number of stripes, where each stripe is 2 TB. A single file can have a maximum of 160 stripes, which gives an upper single file limit of 320 TB for 64-bit systems. The actual amount of data that can be stored in a file depends upon the amount of free space in each OST on which the file is striped. 33.8 Maximum Number of Files or Subdirectories in a Single Directory Lustre uses the ext3 hashed directory code, which has a limit of about 25 million files. On reaching this limit, the directory grows to more than 2 GB depending on the length of the filenames. The limit on subdirectories is the same as the limit on regular files in all later versions of Lustre due to a small ext3 format change. In fact, Lustre is tested with ten million files in a single directory. On a properly-configured dual-CPU MDS with 4 GB RAM, random lookups in such a directory are possible at a rate of 5,000 files / second.33-4 Lustre 1.8 Operations Manual • December 2010 33.9 MDS Space Consumption A single MDS imposes an upper limit of 4 billion inodes. The default limit is slightly less than the device size of 4 KB, meaning 512 MB inodes for a file system with MDS of 2 TB. This can be increased initially, at the time of MDS file system creation, by specifying the --mkfsoptions='-i 2048' option on the --add mds config line for the MDS. For newer releases of e2fsprogs, you can specify '-i 1024' to create 1 inode for every 1 KB disk space. You can also specify '-N {num inodes}' to set a specific number of inodes. The inode size (-I) should not be larger than half the inode ratio (-i). Otherwise, mke2fs will spin trying to write more number of inodes than the inodes that can fit into the device. For more information, see Options for Formatting the MDT and OSTs. 33.10 Maximum Length of a Filename and Pathname This limit is 255 bytes for a single filename, the same as in an ext3 file system. The Linux VFS imposes a full pathname length of 4096 bytes.Chapter 33 System Limits 33-5 33.11 Maximum Number of Open Files for Lustre File Systems Lustre does not impose maximum number of open files, but practically it depends on amount of RAM on the MDS. There are no "tables" for open files on the MDS, as they are only linked in a list to a given client's export. Each client process probably has a limit of several thousands of open files which depends on the ulimit. 33.12 OSS RAM Size For a single OST, there is no strict rule to size the OSS RAM. However, as a guideline for Lustre 1.8 installations, 2 GB per OST is a reasonable RAM size. For details on determining the memory needed for an OSS node, see OSS Memory Requirements33-6 Lustre 1.8 Operations Manual • December 2010Glossary-1 Glossary A ACL Access Control List - An extended attribute associated with a file which contains authorization directives. Administrative OST failure A configuration directive given to a cluster to declare that an OST has failed, so errors can be immediately returned. C CFS Cluster File Systems, Inc., a United States corporation founded in 2001 by Peter J. Braam to develop, maintain and support Lustre. CMD Clustered metadata, a collection of metadata targets implementing a single file system namespace. Completion Callback An RPC made by an OST or MDT to another system, usually a client, to indicate that the lock request is now granted. Configlog An llog file used in a node, or retrieved from a management server over the network with configuration instructions for Lustre systems at startup time. Configuration Lock A lock held by every node in the cluster to control configuration changes. When callbacks are received, the nodes quiesce their traffic, cancel the lock and await configuration changes after which they reacquire the lock before resuming normal operation.Glossary-2 Lustre 1.8 Operations Manual • December 2010 D Default stripe pattern Information in the LOV descriptor that describes the default stripe count used for new files in a file system. This can be amended by using a directory stripe descriptor or a per-file stripe descriptor. Direct I/O A mechanism which can be used during read and write system calls. It bypasses the kernel. I/O cache to memory copy of data between kernel and application memory address spaces. Directory stripe descriptor An extended attribute that describes the default stripe pattern for files underneath that directory. E EA Extended Attribute. A small amount of data which can be retrieved through a name associated with a particular inode. Lustre uses EAa to store striping information (location of file data on OSTs). Examples of extended attributes are ACLs, striping information, and crypto keys. Eviction The process of eliminating server state for a client that is not returning to the cluster after a timeout or if server failures have occurred. Export The state held by a server for a client that is sufficient to transparently recover all in-flight operations when a single failure occurs. Extent Lock A lock used by the OSC to protect an extent in a storage object for concurrent control of read/write, file size acquisition and truncation operations. F Failback The failover process in which the default active server regains control over the service. Failout OST An OST which is not expected to recover if it fails to answer client requests. A failout OST can be administratively failed, thereby enabling clients to return errors when accessing data on the failed OST without making additional network requests.Glossary-3 Failover The process by which a standby computer server system takes over for an active computer server after a failure of the active node. Typically, the standby computer server gains exclusive access to a shared storage device between the two servers. FID Lustre File Identifier. A collection of integers which uniquely identify a file or object. The FID structure contains a sequence, identity and version number. Fileset A group of files that are defined through a directory that represents a file system’s start point. FLDB FID Location Database. This database maps a sequence of FIDs to a server which is managing the objects in the sequence. Flight Group Group or I/O transfer operations initiated in the OSC, which is simultaneously going between two endpoints. Tuning the flight group size correctly leads to a full pipe. G Glimpse callback An RPC made by an OST or MDT to another system, usually a client, to indicate to tthat an extent lock it is holding should be surrendered if it is not in use. If the system is using the lock, then the system should report the object size in the reply to the glimpse callback. Glimpses are introduced to optimize the acquisition of file sizes. Group Lock Group upcall I Import The state held by a client to fully recover a transaction sequence after a server failure and restart. Intent Lock A special locking operation introduced by Lustre into the Linux kernel. An intent lock combines a request for a lock, with the full information to perform the operation(s) for which the lock was requested. This offers the server the option of granting the lock or performing the operation and informing the client of the operation result without granting a lock. The use of intent locks enables metadata operations (even complicated ones), to be implemented with a single RPC from the client to the server.Glossary-4 Lustre 1.8 Operations Manual • December 2010 IOV I/O vector. A buffer destined for transport across the network which contains a collection (a/k/a as a vector) of blocks with data. K Kerberos An authentication mechanism, optionally available in an upcoming Lustre version as a GSS backend. L LBUG A bug that Lustre writes into a log indicating a serious system failure. LDLM Lustre Distributed Lock Manager. lfs The Lustre File System configuration tool for end users to set/check file striping, etc. See lfs. lfsck Lustre File System Check. A distributed version of a disk file system checker. Normally, lfsck does not need to be run, except when file systems are damaged through multiple disk failures and other means that cannot be recovered using file system journal recovery. liblustre Lustre library. A user-mode Lustre client linked into a user program for Lustre fs access. liblustre clients cache no data, do not need to give back locks on time, and can recover safely from an eviction. They should not participate in recovery. Llite Lustre lite. This term is in use inside the code and module names to indicate that code elements are related to the Lustre file system. Llog Lustre log. A log of entries used internally by Lustre. An llog is suitable for rapid transactional appends of records and cheap cancellation of records through a bitmap. Llog Catalog Lustre log catalog. An llog with records that each point at an llog. Catalogs were introduced to give llogs almost infinite size. llogs have an originator which writes records and a replicator which cancels record (usually through an RPC), when the records are not needed. LMV Logical Metadata Volume. A driver to abstract in the Lustre client that it is working with a metadata cluster instead of a single metadata server.Glossary-5 LND Lustre Network Driver. A code module that enables LNET support over a particular transport, such as TCP and various kinds of InfiniBand, Elan or Myrinet. LNET Lustre Networking. A message passing network protocol capable of running and routing through various physical layers. LNET forms the underpinning of LNETrpc. Load-balancing MDSs A cluster of MDSs that perform load balancing of on system requests. Lock Client A module that makes lock RPCs to a lock server and handles revocations from the server. Lock Server A system that manages locks on certain objects. It also issues lock callback requests, calls while servicing or, for objects that are already locked, completes lock requests. LOV Logical Object Volume. The object storage analog of a logical volume in a block device volume management system, such as LVM or EVMS. The LOV is primarily used to present a collection of OSTs as a single device to the MDT and client file system drivers. LOV descriptor A set of configuration directives which describes which nodes are OSS systems in the Lustre cluster, providing names for their OSTs. Lustre The name of the project chosen by Peter Braam in 1999 for an object-based storage architecture. Now the name is commonly associated with the Lustre file system. Lustre client An operating instance with a mounted Lustre file system. Lustre file A file in the Lustre file system. The implementation of a Lustre file is through an inode on a metadata server which contains references to a storage object on OSSs. Lustre lite A preliminary version of Lustre developed for LLNL in 2002. With the release of Lustre 1.0 in late 2003, Lustre Lite became obsolete. Lvfs A library that provides an interface between Lustre OSD and MDD drivers and file systems; this avoids introducing file system-specific abstractions into the OSD and MDD drivers. M Mballoc Multi-Block-Allocate. Lustre functionality that enables the ldiskfs file system to allocate multiple blocks with a single request to the block allocator. Normally, an ldiskfs file system only allocates only one block per request.Glossary-6 Lustre 1.8 Operations Manual • December 2010 MDC MetaData Client - Lustre client component that sends metadata requests via RPC over LNET to the Metadata Target (MDT). MDD MetaData Disk Device - Lustre server component that interfaces with the underlying Object Storage Device to manage the Lustre file system namespace (directories, file ownership, attributes). MDS MetaData Server - Server node that is hosting the Metadata Target (MDT). MDT Metadata Target. A metadata device made available through the Lustre meta-data network protocol. Metadata Write-back Cache A cache of metadata updates (mkdir, create, setattr, other operations) which an application has performed, but have not yet been flushed to a storage device or server. MGS Management Service. A software module that manages the startup configuration and changes to the configuration. Also, the server node on which this system runs. Mountconf The Lustre configuration protocol (introduced in version 1.6) which formats disk file systems on servers with the mkfs.lustre program, and prepares them for automatic incorporation into a Lustre cluster. N NAL An older, obsolete term for LND. NID Network Identifier. Encodes the type, network number and network address of a network interface on a node for use by Lustre. NIO API A subset of the LNET RPC module that implements a library for sending large network requests, moving buffers with RDMA. O OBD Object Device. The base class of layering software constructs that provides Lustre functionality. OBD API See Storage Object API. OBD type Module that can implement the Lustre object or metadata APIs. Examples of OBD types include the LOV, OSC and OSD.Glossary-7 Obdfilter An older name for the OSD device driver. Object device An instance of an object that exports the OBD API. Object storage Refers to a storage-device API or protocol involving storage objects. The two most well known instances of object storage are the T10 iSCSI storage object protocol and the Lustre object storage protocol (a network implementation of the Lustre object API). The principal difference between the Lustre and T10 protocols is that Lustre includes locking and recovery control in the protocol and is not tied to a SCSI transport layer. opencache A cache of open file handles. This is a performance enhancement for NFS. Orphan objects Storage objects for which there is no Lustre file pointing at them. Orphan objects can arise from crashes and are automatically removed by an llog recovery. When a client deletes a file, the MDT gives back a cookie for each stripe. The client then sends the cookie and directs the OST to delete the stripe. Finally, the OST sends the cookie back to the MDT to cancel it. Orphan handling A component of the metadata service which allows for recovery of open, unlinked files after a server crash. The implementation of this feature retains open, unlinked files as orphan objects until it is determined that no clients are using them. OSC Object Storage Client. The client unit talking to an OST (via an OSS). OSD Object Storage Device. A generic, industry term for storage devices with more extended interface than block-oriented devices, such as disks. Lustre uses this name to describe to a software module that implements an object storage API in the kernel. Lustre also uses this name to refer to an instance of an object storage device created by that driver. The OSD device is layered on a file system, with methods that mimic create, destroy and I/O operations on file inodes. OSS Object Storage Server. A server OBD that provides access to local OSTs. OST Object Storage Target. An OSD made accessible through a network protocol. Typically, an OST is associated with a unique OSD which, in turn is associated with a formatted disk file system on the server containing the storage objects. P Pdirops A locking protocol introduced in the VFS by CFS to allow for concurrent operations on a single directory inode.Glossary-8 Lustre 1.8 Operations Manual • December 2010 pool OST pools allows the administrator to associate a name with an arbitrary subset of OSTs in a Lustre cluster. A group of OSTs can be combined into a named pool with unique access permissions and stripe characteristics. Portal A concept used by LNET. LNET messages are sent to a portal on a NID. Portals can receive packets when a memory descriptor is attached to the portal. Portals are implemented as integers. Examples of portals are the portals on which certain groups of object, metadata, configuration and locking requests and replies are received. PTLRPC An RPC protocol layered on LNET. This protocol deals with stateful servers and has exactly-once semantics and built in support for recovery. R Recovery The process that re-establishes the connection state when a client that was previously connected to a server reconnects after the server restarts. Reply The concept of re-executing a server request after the server lost information in its memory caches and shut down. The replay requests are retained by clients until the server(s) have confirmed that the data is persistent on disk. Only requests for which a client has received a reply are replayed. Re-sent request A request that has seen no reply can be re-sent after a server reboot. Revocation Callback An RPC made by an OST or MDT to another system, usually a client, to revoke a granted lock. Rollback The concept that server state is in a crash lost because it was cached in memory and not yet persistent on disk. Root squash A mechanism whereby the identity of a root user on a client system is mapped to a different identity on the server to avoid root users on clients gaining broad permissions on servers. Typically, for management purposes, at least one client system should not be subject to root squash. routing LNET routing between different networks and LNDs. RPC Remote Procedure Call. A network encoding of a request.Glossary-9 S Storage Object API The API that manipulates storage objects. This API is richer than that of block devices and includes the create/delete of storage objects, read/write of buffers from and to certain offsets, set attributes and other storage object metadata. Storage Objects A generic concept referring to data containers, similar/identical to file inodes. Stride A contiguous, logical extent of a Lustre file written to a single OST. Stride size The maximum size of a stride, typically 4 MB. Stripe count The number of OSTs holding objects for a RAID0-striped Lustre file. Striping metadata The extended attribute associated with a file that describes how its data is distributed over storage objects. See also default stripe pattern. T T10 object protocol An object storage protocol tied to the SCSI transport layer. Lustre does not use T10. W Wide striping Strategy of using many OSTs to store stripes of a single file. This obtains maximum bandwidth to a single file through parallel utilization of many OSTs.Glossary-10 Lustre 1.8 Operations Manual • December 2010Index-1 Index A access control list (ACL), 26-1 ACL, using, 26-1 ACLs examples, 26-3 Lustre support, 26-2 adaptive timeouts con?guring, 21-6 interpreting, 21-8 introduction, 21-5 adding clients, 4-10 OSTs, 4-10 adding multiple LUNs on a single HBA, 27-5 allocating quotas, 9-7 B benchmark Bonnie++, 17-2 IOR, 17-3 IOzone, 17-5 bonding, 12-1 con?guring Lustre, 12-11 module parameters, 12-5 references, 12-11 requirements, 12-2 setting up, 12-5 bonding NICs, 12-4 Bonnie++ benchmark, 17-2 building Lustre SNMP module, 14-2 C calculating OSS memory requirements, 3-8 capacity, system, 1-13 Cisco Topspin (cib), 2-2 client read/write extents survey, 21-17 offset survey, 21-15 clients adding, 4-10 command ?lefrag, 28-18 lfsck, 28-16 llapi, 30-1 mount, 28-20 command lfs, 28-2, 28-13 complicated con?gurations, multihomed servers, 7- 1 components, Lustre, 1-5 con?guration module setup, 4-10 con?guration example, Lustre, 4-5 con?guration, more complex failover, 4-29 con?guring adaptive timeouts, 21-6 LNET, 2-5 root squash, 26-4 con?guring Lustre, 4-2 COW I/O, 18-16 Cray Seastar, 2-2Index-2 Lustre 1.8 Operations Manual • December 2010 D debug_mb, 24-3 debugging adding debugging to source code, 24-11 buffer, 24-3 controlling the kernel debug log, 24-8 daemon, 24-6 ?nding Lustre UUID of an OST, 24-16 ?nding memory leaks, 24-10 lctl tool, 24-8 looking at disk content, 24-15 messages, 24-2 printing to /var/log/messages, 24-10 Ptlrpc request history, 24-16 sample lctl run, 24-11 tcpdump, 24-16 tools, 24-4 tracing lock traf?c, 24-10 debugging tools, 3-5 designing a Lustre network, 2-3 DIRECT I/O, 18-16 Directory statahead, using, 21-20 downed routers, 2-12 downgrade 1.8.x to 1.6.x, 13-8 complete ?le system, 13-9 rolling, 13-11 E e2fsprogs, 3-4 Elan (Quadrics Elan), 2-2 Elan to TCP routing modprobe.conf, 7-5 start clients, 7-5 start servers, 7-5 end-to-end client checksums, 25-22 environmental requirements, 3-6 error messages, 23-3 external journal, creating, 10-5 F failover, 8-1 con?guring, 4-29 power equipment, 8-7 ?le formats, quotas, 9-11 File readahead, using, 21-20 ?le striping, 25-2 ?le system name, 4-12 ?lefrag command, 28-18 ?ock utility, 32-28 free space management adjusting weighting between free space and location, 25-14 round-robin allocator, 25-13 weighted allocator, 25-13 G getting Lustre parameters, 4-21 GM and MX (Myrinet), 2-2 H HA software, 3-4 handling timeouts, 28-20 HBA, adding SCSI LUNs, 27-5 I I/O options end-to-end client checksums, 25-22 I/O tunables, 21-12 improving Lustre metadata performance with large directories, 27-6 In?nicon In?niBand (iib), 2-2 installing Lustre SNMP module, 14-2 POSIX, 16-2 installing Lustre from RPMs, 3-10 from source code, 3-14 installing Lustre, debugging tools, 3-5 installing Lustre, environmental requirements, 3-6 installing Lustre, HA software, 3-4 installing Lustre, memory requirements, 3-7 installing Lustre, prerequisites, 3-2 installing Lustre, required software, 3-4 installing Lustre, required tools / utilities, 3-4 interconnects, supported, 3-3 interoperability, 13-2 interpreting adaptive timeouts, 21-8 IOR benchmark, 17-3Index-3 IOzone benchmark, 17-5 K Kerberos Lustre setup, 11-2 Lustre-Kerberos ?avors, 11-11 key features, 1-3 L lctl, 32-8 lctl tool, 24-8 lfs command, 28-2, 28-13 lfsck command, 28-16 llapi, 25-24 llapi command, 30-1 llog_reader utility, 32-26 llstat.sh utility, 32-21 LND, 2-1 LNET, 1-15 con?guring, 2-5 routers, 2-11 starting, 2-13 stopping, 2-14 LNET self-test commands, 18-26 concepts, 18-21 Load balancing with In?niBand modprobe.conf, 7-6 locking proc entries, 21-31 logs, 23-3 lr_reader utility, 32-26 LUNs, adding, 27-5 Lustre administration, aborting recovery, 4-27 administration, failout / failover mode for OSTs, 4-16 administration, ?le system name, 4-12 administration, ?nding nodes in the ?le system, 4-15 administration, mounting a server, 4-13 administration, mounting a server without Lustre service, 4-16 administration, removing and restoring OSTs, 4- 25 administration, running multiple Lustre ?le systems, 4-17 administration, setting Lustre parameters, 4-19 administration, working with inactive OSTs, 4-14 adminstration, running writeconf, 4-21 adminstration, unmounting a server, 4-14 components, 1-5 con?guration example, 4-5 con?guring, 4-2 downgrading, 1.8.x to 1.6.x, 13-8 installing, debugging tools, 3-5 installing, environmental requirements, 3-6 installing, HA software, 3-4 installing, memory requirements, 3-7 installing, prerequisites, 3-2 installing, required software, 3-4 installing, required tools / utilities, 3-4 interoperability, 13-2 key features, 1-3 operational scenarios, 4-30 parameters, getting, 4-21 parameters, setting, 4-19 scaling, 4-10 system capacity, 1-13 upgrading, 1.6.x to 1.8.x, 13-3 upgrading, 1.8.x to next minor version, 13-8 VBR, delayed recovery, 19-14 VBR, introduction, 19-13 VBR, tips, 19-15 VBR, working with, 19-15 Lustre I/O kit downloading, 18-2 obd?lter_survey tool, 18-5 ost_survey tool, 18-11 PIOS I/O modes, 18-16 PIOS tool, 18-14 prerequisites to using, 18-2 running tests, 18-2 sgpdd_survey tool, 18-3 Lustre Network Driver (LND), 2-1 Lustre Networking (LNET), 1-15 Lustre SNMP module building, 14-2 installing, 14-2 using, 14-3 lustre_con?g.sh utility, 32-21 lustre_createcsv.sh utility, 32-21 lustre_req_history.sh utility, 32-21 lustre_up14.sh utility, 32-21Index-4 Lustre 1.8 Operations Manual • December 2010 M man1 ?lefrag, 28-18 lfs, 28-2, 28-13 lfsck, 28-16 mount, 28-20 man2 user/group cache upcall, 29-1 man3 llapi, 30-1 man5 LNET options, 31-3 module options, 31-2 MX LND, 31-19 OpenIB LND, 31-14 Portals LND (Catamount), 31-17 Portals LND (Linux), 31-15 QSW LND, 31-10 RapidArray LND, 31-11 VIB LND, 31-12 man8 extents_stats utility extents_stats utility, 32-22 lctl, 32-8 llog_reader utility, 32-26 llstat.sh, 32-21 lr_reader utility, 32-26 lustre_con?g.sh, 32-21 lustre_createcsv.sh utility, 32-21 lustre_req_history.sh, 32-21 lustre_up14.sh utility, 32-21 mkfs.lustre, 32-2 mount.lustre, 32-16 offset_stats utility, 32-22 plot-llstat.sh, 32-21 tunefs.lustre, 32-5 vfs_ops_stats utility vfs_ops_stats utility, 32-22 mballoc history, 21-27 mballoc3 tunables, 21-29 MDT/OST formatting overriding default formatting options, 20-6 planning for inodes, 20-5 sizing the MDT, 20-5 Mellanox-Gold In?niBand (openib), 2-2 memory requirements, 3-7 mkfs.lustre, 32-2 mod5 SOCKLND kernel TCP/IP LND, 31-8 modprobe.conf, 7-1, 7-5, 7-6 module parameters, 2-5 module parameters, routing, 2-8 module setup, 4-10 mount command, 28-20 mount.lustre, 32-16 multihomed server Lustre complicated con?gurations, 7-1 modprobe.conf, 7-1 start clients, 7-4 start server, 7-3 multiple NICs, 12-4 MX LND, 31-19 Myrinet, 2-2 N network bonding, 12-1 networks, supported cib (Cisco Topspin), 2-2 Cray Seastar, 2-2 Elan (Quadrics Elan), 2-2 GM and MX (Myrinet), 2-2 iib (In?nicon In?niBand), 2-2 o2ib (OFED), 2-2 openib (Mellanox-Gold In?niBand), 2-2 ra (RapidArray), 2-2 TCP, 2-2 vib (Voltaire In?niBand), 2-2 NIC bonding, 12-4 multiple, 12-4 O o2ib (OFED), 2-2 obd?lter_survey tool, 18-5 OFED, 2-2 offset_stats utility, 32-22 OpenIB LND, 31-14 operating systems, supported, 3-3 operating tips data migration script, simple, 27-3 Operational scenarios, 4-30 OSSIndex-5 memory, determining, 3-8 OSS read cache, 21-22 OST removing and restoring, 4-25 OST block I/O stream, watching, 21-19 ost_survey tool, 18-11 OSTs adding, 4-10 P performance tips, 23-5 performing direct I/O, 25-21 Perl, 3-4 PIOS examples, 18-20 PIOS I/O mode COW I/O, 18-16 DIRECT I/O, 18-16 POSIX I/O, 18-16 PIOS I/O modes, 18-16 PIOS parameter ChunkSize(c), 18-17 Offset(o), 18-18 RegionCount(n), 18-17 RegionSize(s), 18-17 ThreadCount(t), 18-17 PIOS tool, 18-14 platforms, supported, 3-3 plot-llstat.sh utility, 32-21 Portals LND Catamount, 31-17 Linux, 31-15 POSIX installing, 16-2 POSIX I/O, 18-16 power equipment, 8-7 prerequisites, 3-2 proc entries debug support, 21-34 free space distribution, 21-11 LNET information, 21-9 locating ?lesystems and servers, 21-2 locking, 21-31 timeouts, 21-3 Q QSW LND, 31-10 Quadrics Elan, 2-2 quota limits, 9-11 quota statistics, 9-12 quotas administering, 9-4 allocating, 9-7 creating ?les, 9-4 enabling, 9-2 ?le formats, 9-11 granted cache, 9-10 known issues, 9-10 limits, 9-11 statistics, 9-12 working with, 9-1 R ra (RapidArray), 2-2 RAID creating an external journal, 10-5 formatting options, 10-4 handling degraded arrays, 10-6 insights into disk performance measurement, 10- 6 performance tradeoffs, 10-4 reliability best practices, 10-3 selecting storage for MDS or OSTs, 10-2 software RAID, 10-7 RapidArray, 2-2 RapidArray LND, 31-11 readahead, tuning, 21-20 recovery mode, failure types client failure, 19-2 MDS failure/failover, 19-3 network partition, 19-5 OST failure, 19-4 recovery, aborting, 4-27 required software, 3-4 required tools / utilities, 3-4 root squash con?guring, 26-4 tuning, 26-5 root squash, using, 26-4 round-robin allocator, 25-13 routers, downed, 2-12Index-6 Lustre 1.8 Operations Manual • December 2010 routers, LNET, 2-11 routing, 2-8 routing, elan to TCP, 7-5 RPC stream tunables, 21-12 RPC stream, watching, 21-14 RPMs, installing Lustre, 3-10 running a client and OST on the same machine, 27-5 S scaling Lustre, 4-10 server mounting, 4-13, 4-14 Service tags introduction, 5-1 using, 5-3 setting SCSI I/O sizes, 23-23 setting Lustre parameters, 4-19 sgpdd_survey tool, 18-3 simple con?guration CSV ?le, con?guring Lustre, 6-4 network, combined MGS/MDT, 6-1 network, separate MGS/MDT, 6-3 TCP network, Lustre simple con?gurations, 6-1 SOCKLND kernel TCP/IP LND, 31-8 software RAID, support, 10-7 source code, installing Lustre, 3-14 starting LNET, 2-13 statahead, tuning, 21-21 stopping LNET, 2-14 striping advantages, 25-2 disadvantages, 25-3 size, 25-4 striping using llapi, 25-24 supported interconnects, 3-3 operating systems, 3-3 platforms, 3-3 supported networks cib (Cisco Topspin), 2-2 Cray Seastar, 2-2 Elan (Quadrics Elan), 2-2 GM and MX (Myrinet), 2-2 iib (In?nicon In?niBand), 2-2 o2ib (OFED), 2-2 openib (Mellanox-Gold In?niBand), 2-2 ra (RapidArray), 2-2 TCP, 2-2 vib (Voltaire In?niBand), 2-2 system capacity, 1-13 T TCP, 2-2 timeouts, handling, 28-20 Troubleshooting number of OSTs needed for sustained throughput, 23-23 troubleshooting consideration in connecting a SAN with Lustre, 23-16 default striping, 23-11 drawbacks in doing multi-client O_APPEND writes, 23-22 erasing a ?le system, 23-12 error messages, 23-3 handling timeouts on initial Lustre setup, 23-20 handling/debugging "bind address already in use" error, 23-17 handling/debugging "Lustre Error xxx went back in time", 23-21 handling/debugging error "28", 23-18 identifying a missing OST, 23-8 log message ’out of memory’ on OST, 23-22 logs, 23-3 Lustre Error "slow start_page_write", 23-21 OST object missing or damaged, 23-7 OSTs become read-only, 23-8 reclaiming reserved disk space, 23-16 recovering from an unavailable OST, 23-5 replacing an existing OST or MDS, 23-18 setting SCSI I/O sizes, 23-23 slowdown occurs during Lustre startup, 23-22 triggering watchdog for PID NNN, 23-19 write performance better than read performance, 23-6 tunables RPC stream, 21-12 tunables, lockless, 20-9 tunefs.lustre, 32-5Index-7 Tuning directory statahead, 21-21 ?le readahead, 21-20 tuning formatting the MDT and OST, 20-5 large-scale, 20-8 LNET tunables, 20-4 lockless tunables, 20-9 MDS threads, 20-3 module options, 20-2 root squash, 26-5 U upgrade 1.6.x to 1.8.x, 13-3 1.8.x to next minor version, 13-8 complete ?le system, 13-4 rolling, 13-6 using Lustre SNMP module, 14-3 usocklnd, using, 2-7 utilities, third-party e2fsprogs, 3-4 Perl, 3-4 V VBR, delayed recovery, 19-14 VBR, introduction, 19-13 VBR, tips, 19-15 VBR, working with, 19-15 Version-based recovery (VBR), 19-13 VIB LND, 31-12 Voltaire In?niBand (vib), 2-2 W weighted allocator, 25-13 weighting, adjusting between free space and location, 25-14 writeconf, 4-21Index-8 Lustre 1.8 Operations Manual • December 2010 TM Cray Linux Environment™ (CLE) 4.0 Software Release Overview S–2425–40© 2011 Cray Inc. All Rights Reserved. This document or parts thereof may not be reproduced in any form unless permitted by contract or by written permission of Cray Inc. U.S. GOVERNMENT RESTRICTED RIGHTS NOTICE The Computer Software is delivered as "Commercial Computer Software" as defined in DFARS 48 CFR 252.227-7014. All Computer Software and Computer Software Documentation acquired by or for the U.S. Government is provided with Restricted Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7014, as applicable. Technical Data acquired by or for the U.S. Government, if any, is provided with Limited Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7013, as applicable. Cray, LibSci, and PathScale are federally registered trademarks and Active Manager, Cray Apprentice2, Cray Apprentice2 Desktop, Cray C++ Compiling System, Cray CX, Cray CX1, Cray CX1-iWS, Cray CX1-LC, Cray CX1000, Cray CX1000-C, Cray CX1000-G, Cray CX1000-S, Cray CX1000-SC, Cray CX1000-SM, Cray CX1000-HN, Cray Fortran Compiler, Cray Linux Environment, Cray SHMEM, Cray X1, Cray X1E, Cray X2, Cray XD1, Cray XE, Cray XEm, Cray XE5, Cray XE5m, Cray XE6, Cray XE6m, Cray XK6, Cray XMT, Cray XR1, Cray XT, Cray XTm, Cray XT3, Cray XT4, Cray XT5, Cray XT5 h , Cray XT5m, Cray XT6, Cray XT6m, CrayDoc, CrayPort, CRInform, ECOphlex, Gemini, Libsci, NodeKARE, RapidArray, SeaStar, SeaStar2, SeaStar2+, The Way to Better Science, Threadstorm, and UNICOS/lc are trademarks of Cray Inc. AMD, AMD Opteron, and Opteron are trademarks of Advanced Micro Devices, Inc. DDN is a trademark of DataDirect Networks. GNU is a trademark of The Free Software Foundation. GPFS and IBM are trademarks of International Business Machines Corporation. InfiniBand is a trademark of InfiniBand Trade Association. Intel is a trademark of Intel Corporation or its subsidiaries in the United States and other countries. LSF, Platform, Platform Computing, and Platform LSF are trademarks of Platform Computing Corporation. LSI is a trademark of LSI Logic Corporation. Linux is a trademark of Linus Torvalds. Moab and TORQUE are trademarks of Adaptive Computing Enterprises, Inc. Lustre, MySQL, MySQL Pro, NFS, and Solaris are trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Novell, SUSE, and openSUSE are trademarks of Novell, Inc. PBS Professional is a trademark of Altair Grid Technologies. PGI is a trademark of The Portland Group Compiler Technology, STMicroelectronics, Inc. PanFS is a trademark of Panasas, Inc. QLogic is a trademark of QLogic Corporation. UNIX is a trademark of The Open Group. All other trademarks are the property of their respective owners. RECORD OF REVISION S–2425–40 Published June 2011 Supports the 4.0 release of the Cray Linux Environment (CLE) operating system running on Cray XE systems. 3.1 Published June 2010 Supports the 3.1 release of the Cray Linux Environment (CLE) operating system running on Cray XT and Cray XE systems. 3.0 Published March 2010 Supports the 3.0 release of the Cray Linux Environment (CLE) operating system running on Cray XT6 systems.Contents Page Introduction [1] 7 1.1 Emphasis for the CLE 4.0 Release . . . . . . . . . . . . . . . . . . . 7 1.2 Supported System Configurations . . . . . . . . . . . . . . . . . . . 8 1.3 Description of the CLE 4.0 Release Software . . . . . . . . . . . . . . . . 9 1.4 CLE 4.0 Support Policy . . . . . . . . . . . . . . . . . . . . . . 9 Software Enhancements [2] 11 2.1 Software Enhancements in CLE 4.0 . . . . . . . . . . . . . . . . . . . 11 2.1.1 Cray XK6 Hardware Support (Deferred Implementation until CLE 4.0.UP01) . . . . . . 11 2.1.2 SUSE Linux Enterprise Server (SLES) 11 SP (Service Pack) 1 Upgrade . . . . . . . . 12 2.1.3 Cluster Compatibility Mode (CCM) Platform LSF support (Deferred Implementation) . . . . 13 2.1.4 Lustre Upgraded to 1.8.4 . . . . . . . . . . . . . . . . . . . . 14 2.2 Software Enhancements in CLE 3.1.UP03 . . . . . . . . . . . . . . . . . 14 2.2.1 dumpd Daemon to Initiate Automatic Dump and Reboot of Nodes . . . . . . . . . 14 2.2.2 CCM (Cluster Compatibility Mode) Enhancements in CLE 3.1.UP03 . . . . . . . . 16 2.2.3 Configuring a Virtual Local Area Network Interface . . . . . . . . . . . . . 17 2.2.4 Enhanced Node Placement Scheme for Gemini Systems . . . . . . . . . . . . 17 2.2.5 RSIP Daemon Log File (syslog) Logging . . . . . . . . . . . . . . . 18 2.2.6 New Cray OpenSM Startup Scripts . . . . . . . . . . . . . . . . . . 18 2.2.7 New Options for the lustre_control.sh Utility . . . . . . . . . . . . 19 2.2.8 CLEinstall Program Enhancements . . . . . . . . . . . . . . . . 20 2.2.9 Application Level Placement Scheduler (ALPS) Enhancements . . . . . . . . . . 21 2.2.10 DVS bulk_rw Mode and DVS-specific ioctl() Commands . . . . . . . . . 22 2.3 Software Enhancements in CLE 3.1.UP02 . . . . . . . . . . . . . . . . . 23 2.3.1 DVS POSIX Atomicity . . . . . . . . . . . . . . . . . . . . . 23 2.3.2 Changes to GPCD (Gemini Performance Counters Device) to Improve Performance when Accessing Memory Mapped Registers Using CrayPat . . . . . . . . . . . . . . . . . 24 2.3.3 Repurposed Compute Node Support for PBS Professional . . . . . . . . . . . 25 2.3.4 New xtverifydefaults Command . . . . . . . . . . . . . . . . 25 2.3.5 Core Specialization is NUMA-aware . . . . . . . . . . . . . . . . . 26 S–2425–40 3Cray Linux Environment™ (CLE) 4.0 Software Release Overview Page 2.4 Software Enhancements in CLE 3.1.UP01 . . . . . . . . . . . . . . . . . 27 2.4.1 Gemini Network Resiliency and Congestion Management . . . . . . . . . . . 27 2.4.2 Repurposed Compute Nodes . . . . . . . . . . . . . . . . . . . 28 2.4.3 Lustre Upgraded to Version 1.8.2 . . . . . . . . . . . . . . . . . . 29 2.4.4 Application Completion Reporting . . . . . . . . . . . . . . . . . . 30 2.4.5 Topology and NID Ordering on Cray XE Systems . . . . . . . . . . . . . 31 Compatibilities and Differences [3] 33 3.1 Binary Compatibility . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 Changes to the Application Level Placement Scheduler (ALPS) . . . . . . . . . . . 34 3.2.1 Configuration of ALPS Shared Directory . . . . . . . . . . . . . . . . 34 3.2.2 Extraction of Some 3rd Party Software from ALPS . . . . . . . . . . . . . 34 3.3 Commands Removed from the Release . . . . . . . . . . . . . . . . . . 34 3.4 Software Packages/Releases That Must be Reinstalled . . . . . . . . . . . . . . 35 3.5 lustre_control.sh -c mount/unmount No Longer Requires Passwordless SSH to Mount/Unmount Clients . . . . . . . . . . . . . . . . . . . . . . . 35 3.6 Installation and Configuration Changed Functionality for System Administrators . . . . . . 35 3.6.1 Supported Upgrade Path . . . . . . . . . . . . . . . . . . . . . 36 3.6.2 System Management Workstation (SMW) Upgrade Requirements . . . . . . . . . 36 3.6.3 Installation Time Required . . . . . . . . . . . . . . . . . . . . 36 3.6.4 Changes to the CLEinstall.conf Installation Configuration File . . . . . . . . 36 Documentation [4] 37 4.1 Accessing Product Documentation . . . . . . . . . . . . . . . . . . . 37 4.2 Cray-developed Books Provided with This Release . . . . . . . . . . . . . . . 38 4.2.1 Additional Cray-developed Release Documents . . . . . . . . . . . . . . 38 4.3 Third-party Books Provided with This Release . . . . . . . . . . . . . . . . 39 4.4 Changes to Man Pages . . . . . . . . . . . . . . . . . . . . . . 39 4.4.1 Removed Cray Man Pages . . . . . . . . . . . . . . . . . . . . 39 4.4.2 Changed Cray Man Pages in CLE 4.0 . . . . . . . . . . . . . . . . . 39 4.5 Other Related Documents Available . . . . . . . . . . . . . . . . . . . 39 4.6 Additional Documentation Resources . . . . . . . . . . . . . . . . . . 40 Release Contents [5] 41 5.1 Hardware Requirements . . . . . . . . . . . . . . . . . . . . . . 41 5.2 Software Requirements . . . . . . . . . . . . . . . . . . . . . . 41 5.2.1 Release Level Requirements for Other Cray Software Products . . . . . . . . . . 41 5.2.2 Third-party Software Requirements . . . . . . . . . . . . . . . . . 42 5.3 Supported Upgrade Path . . . . . . . . . . . . . . . . . . . . . . 43 4 S–2425–40Contents Page 5.4 Contents of the Release Package . . . . . . . . . . . . . . . . . . . . 43 5.4.1 CLE 4.0 Software Components . . . . . . . . . . . . . . . . . . . 43 5.5 Licensing . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Tables Table 1. CLE 4.0 Installation Support by Hardware Platform . . . . . . . . . . . . 9 Table 2. Books Provided with This Release . . . . . . . . . . . . . . . . . 38 Table 3. Other Related Documents Available . . . . . . . . . . . . . . . . . 40 Table 4. Additional Documentation Resources . . . . . . . . . . . . . . . . 40 Table 5. Minimum Release Level Requirements for Other Software Products with CLE 4.0 . . . . 41 Table 6. Minimum Release Level Requirements for Third-party Compilers with CLE 4.0 . . . . . 42 Table 7. Third-party Batch System Software Products Available for Cray Systems . . . . . . 42 S–2425–40 5Introduction [1] This document provides an overview of the Cray Linux Environment (CLE) 4.0 operating system release package and highlights new functionality and changes from previous CLE releases. The CLE 4.0 release supports Cray XE systems. Throughout this document, any reference to Cray systems includes all supported Cray systems unless otherwise noted. For a complete description of hardware platforms supported with the CLE 4.0 release, see Table 1. Chapter 2, Software Enhancements on page 11 and Chapter 3, Compatibilities and Differences on page 33 describe changes made since the 3.1 version of the CLE operating system software. This information is provided as a service to users and administrators who are familiar with the CLE 3.1 release. This document is focused on the differences between the CLE 3.1 and CLE 4.0 releases. If you have a Cray XE system running a CLE 3.1 update package, not all features and differences described in this document are new to you. The features from the 3.1 update packages are listed starting with Software Enhancements in CLE 3.1.UP03 on page 14. This document does not describe hardware, software, or installation of related products, such as the Cray Compiling Environment or products that Cray does not provide. To determine the release levels of other software products that are compatible with CLE 4.0, see Software Requirements on page 41. 1.1 Emphasis for the CLE 4.0 Release The CLE 4.0 release provides the following key enhancements: • Cray XK6 Hardware Support (Deferred Implementation until CLE 4.0.UP01) A future update package will provide CLE support for Cray XK6 blades featuring NVIDIA Tesla-based GPGPU (General Purpose Graphics Processing Unit) accelerators that serve as highly threaded coprocessors within compute nodes on Cray XK6 blades. • SUSE Linux Enterprise Server (SLES) 11 Service Pack 1 (SP1) Upgrade. (4.0 Change) Cray's customized version of the Linux operating system is upgraded to SLES 11 SP1 from SLES 11. S–2425–40 7Cray Linux Environment™ (CLE) 4.0 Software Release Overview • Cluster Compatibility Mode (CCM) Platform LSF support (Deferred implementation) CCM is modified to support the Platform LSF workload management system. • Lustre File System Upgrade. The Lustre file system from Oracle is upgraded to version 1.8.4. Note: The Lustre file system upgrade was introduced in CLE 3.1.UP03, however it is emphasized here as a high-impact feature for those upgrading from an earlier release level. The SMW 6.0 release also provides several key enhancements that directly impact the operation of your Cray system running the CLE 4.0 release. For more information, see README provided with the SMW Release Software. Note: The CLE 4.0.UP00 release supports Cray XE systems only. 1.2 Supported System Configurations Cray Linux Environment (CLE) release 4.0 supports Cray XE systems. The base release (CLE 4.0.UP00) supports initial, upgrade and migration software installations on the following platforms: Cray XE6, Cray XE6m, Cray XE5, and Cray XE5m systems. CLE 4.0 update packages will support additional Cray hardware platforms. Installation types in this table are defined as follows: Initial A new or fresh software installation involves installing and configuring the entire system and is generally performed for new hardware. If an initial installation is performed on an existing system, the previous configuration is lost. Upgrade A software upgrade installation involves moving to the next release of a software package. In certain cases, such as when CLE requires a newer version of SLES, upgrading involves installing both new CLE and underlying operating systems (e.g. a migration including an upgrade). Migration A CLE migration installation involves moving to a new release level of the CLE software package. A migration is inclusive of an upgrade.A migration is required when the target release includes a newer level of SLES. For example, upgrading to CLE 4.0 from a system that is running CLE 3.1 requires a migration from SLES 11 to SLES 11 SP1. Update A software update installation involves applying an update package for a major release that is already running on your system. 8 S–2425–40Introduction [1] Table 1. CLE 4.0 Installation Support by Hardware Platform Hardware Platform Installation Type Target Availability Cray XE6 or Cray XE6m Initial or Migration June 2011 (base release) Cray XE5 or Cray XE5m Initial or Migration June 2011 (base release) Cray XK6 blades and systems Initial, Migration, or Update September 2011 (update package) 1.3 Description of the CLE 4.0 Release Software CLE is a Linux-based operating system that runs on Cray systems. The CLE 4.0 release includes Cray's customized version of the SLES 11 operating system. All software is installed by means of scripts and RPM Package Manager (RPM) files. RPMs include related security fixes. Important: The base CLE 4.0 release supports initial software installations or migrations and upgrades from CLE 3.1 and its update packages running on Cray XE systems. For complete information about the release package, including detailed information about prerequisites for other Cray software products and the supported upgrade path, see Chapter 5, Release Contents on page 41. 1.4 CLE 4.0 Support Policy Cray continually enhances the Cray Linux Environment (CLE) with new releases and periodically discontinues support for older releases. Our current policy is to support the latest major release of CLE and the previous major release. Note: The previous major release for Cray XE systems was CLE 3.1. • During the 12 months after initial release, CLE software is supported with update packages at approximately three- to six-month intervals, depending on need. • Cray will provide patches for available critical and urgent bug fixes for a period of 18 months following an initial (generally available) CLE release. • Beyond 18 months, support is limited to critical fixes on a best-effort basis. All applicable recommended and security-related SUSE Linux updates released by Novell are included in the CLE releases and update packages. Security-related patches are also available through Field Notices (FNs). Contact your Cray representative for information about current software availability and release schedules. S–2425–40 9Cray Linux Environment™ (CLE) 4.0 Software Release Overview 10 S–2425–40Software Enhancements [2] This chapter describes the software enhancements made to the Cray Linux Environment (CLE) since the CLE 3.1 base release. This information is provided as a service to users and administrators who are familiar with earlier CLE versions. For information about issues that you may encounter when using, installing or maintaining CLE 4.0 (when compared to previous CLE releases), see Chapter 3, Compatibilities and Differences on page 33. In addition to the documentation noted in each feature description, see Cray-developed Books Provided with This Release on page 38. 2.1 Software Enhancements in CLE 4.0 2.1.1 Cray XK6 Hardware Support (Deferred Implementation until CLE 4.0.UP01) Who will use this feature? End users, programmers, site analysts, system administrators What does this feature do? A future update package of CLE will provide support for Cray XK6 systems. The Cray XK6 system is a "hybrid" massively parallel processing system. Each Cray XK blade consists of four compute nodes with up to 64 integer cores per blade. Each compute node has an AMD Opteron 6200 Series processor with 16 or 32 GB of memory and an NVIDIA Tesla-based GPGPU (General Purpose Graphics Processing Unit) or GPU processor with 6GB of memory. Cray XK6 blades can be used on Cray XE systems. For optimal use of compute node resources in mixed Cray XE systems with Cray XK6 compute blades, the system administrator can elect to assign Cray XK6 compute nodes to a batch queue, allowing users to make reservations for either scalar-only or accelerator-based compute node pools. S–2425–40 11Cray Linux Environment™ (CLE) 4.0 Software Release Overview Initially, Cray will provide programming environment support with compilers from NVIDIA that support CUDA (Compute Unified Device Architecure) and OpenCL (Open Computing Language) programming models. Cray will also provide NVIDIA's CUDA Toolkit that includes some GPU-optimized libraries relevant to scientific computing, profiling tools, and a debugger. In future programming environment releases, Cray will provide compiler and language support in addition to libraries optimized for use with accelerators that could provide greater performance when using applications that target accelerators. How does this feature benefit customers? Running applications with a Cray XK6 allows for programmers and end users to potentially enhance the performance of their applications when they adapt their code to incorporate the use of the NVIDIA GPUs. Does this feature provide any performance improvements? Yes, provided that the application is ported to use the GPUs, there is a possibility of significant performance improvements for certain applications. Customer-visible software and hardware requirements: GPU accelerators for Cray systems will be supported with the CLE 4.0.UP01 and SMW 6.0.UP01 update packages. Additional software and hardware requirements will be given with those release packages. 2.1.2 SUSE Linux Enterprise Server (SLES) 11 SP (Service Pack) 1 Upgrade How does this feature benefit customers? This helps the CLE software stack keep pace with the SLES product life-cycle and ensures that CLE releases include relevant security and bug fixes. Additionally, some new features supported by the CLE 4.0 release require specific functionality that is available with SLES 11 SP1. What does this feature do? The update to SLES 11 SP1 aligns the CLE software stack with newer versions of the SLES operating system. The CLE 4.0 software stack is based on the SLES 11 SP1 version of the Linux operating system and a Linux 2.6.32 kernel. Cray-specific kernel features, the user-level environment, software subsystems, and the installation/build system are now based on SLES 11 SP1. 12 S–2425–40Software Enhancements [2] Where can I find more information about this feature? For information about the contents of SLES 11 SP1 and Linux in general, refer to the following third-party and open-source websites: • SLES 11 Documentation — See http://www.novell.com/linux • Release notes specific to SLES 11 SP1 — See http://www.novell.com/linux/releasenotes/x86_64/SUSE-SLES/11-SP1/ • The Linux Documentation Project — See http://www.tldp.org Updated Linux man pages are included with the CLE 4.0 release. For complete information regarding changes to specific commands due to the upgrade to SLES 11 SP1, see the associated man pages. To access Linux man pages, use the man command on a login node. 2.1.3 Cluster Compatibility Mode (CCM) Platform LSF support (Deferred Implementation) Deferred Implementation: This feature is deferred to a future CLE 4.0 update package. How does this feature benefit customers? Sites that use Platform LSF as a workload management system for their Cray system can now use Cluster Compatibility Mode (CCM). What does this feature do? Cluster Compatibility Mode (CCM) allows ISV (independent software vendor) cluster applications to run on Cray's MPP architectures. CCM is tightly coupled to the batch system. The user running an ISV cluster application makes a reservation request with the batch system for a CCM application and then runs the application using ccmrun. Initially CCM supported Moab with TORQUE and PBS Professional. It is now modified to work with Platform LSF. S–2425–40 13Cray Linux Environment™ (CLE) 4.0 Software Release Overview Where can I find more information? Note: Standard Cray publications do not currently document this feature. Please see the CLE 4.0 README and the LSF-README.txt for more information on configuration of Platform LSF for CCM. The following documentation is provided with Platform LSF software: • Administering Platform LSF Guide • Platform LSF Command Reference Guide Also see: http://www.platform.com for more information. 2.1.4 Lustre Upgraded to 1.8.4 How does this feature benefit customers? A large number of existing bugs are fixed with this new version of Lustre. What does this feature do? When you update or upgrade your system to CLE 4.0, your Lustre file system software is automatically upgraded to version 1.8.4. This change is transparent to both users and administrators. You can view version-specific change logs at http://wiki.lustre.org/index.php/Change_Log_1.8. Note: Lustre version 1.8.4 was introduced in CLE 3.1.UP03, so if you are upgrading from that CLE release level, your Lustre version will not change. 2.2 Software Enhancements in CLE 3.1.UP03 2.2.1 dumpd Daemon to Initiate Automatic Dump and Reboot of Nodes Who will use this feature? System administrators and site analysts. How can this dumpd enhancement help me? System administrators can use the new dumpd functionality to decrease compute node down time and, if necessary, access evidence that can help identify the problem that caused the node to be unhealthy. 14 S–2425–40Software Enhancements [2] What does this enhancement do? When a compute node fails a node health test, Node Health Checker (NHC) can request a dump or reboot of the node depending on the action associated with that test in the node health configuration file. The following are new node health actions, with descriptions of the steps taken for each action: dump Sets the compute node's state to admindown and requests a dump from the SMW, in accordance with the maxdumps configuration variable. reboot Sets the compute node's state to unavail and requests a reboot from the SMW. dumpreboot Sets the compute node's state to unavail and requests a dump and reboot from the SMW. Requests for dumps and reboots can be made of the dumpd daemon in two ways: • Node Health Checker (NHC) can automatically call dumpd (if so configured in the NHC and dumpd configuration files). • System administrators can use the dumpd-request script on the SMW or the shared root, provided dumpd is enabled in the /etc/opt/cray-xt-dumpd/dumpd.conf file on the SMW. You can define additional actions in this configuration file. In each case, dumpd must be enabled in the /etc/opt/cray-xt-dumpd/dumpd.conf file; otherwise, it will not handle any requests. The dumpd binary sits and waits for requests from NHC (or some other entity using the dumpd-request tool on the shared root). When dumpd gets a request, it creates a database entry in the mznhc database for the request, and calls the script /opt/cray-xt-dumpd/default/bin/executor on the SMW to read the configuration file found at /etc/opt/cray-xt-dumpd/dumpd.conf and perform the requested actions. System administrators can use the dumpd-dbadmin script to view or delete entries in the mznhc database in a convenient manner. The dumpd-dbadmin tool can be found in /opt/cray-xt-dumpd/default/bin. The NHC and dumpd configuration files enable administrators to completely control when and how dumpd responds to nodes set to admindown by NHC, as well as how many nodes are dumped. S–2425–40 15Cray Linux Environment™ (CLE) 4.0 Software Release Overview Where can I find more information? For more information, see Managing System Software for Cray XE and Cray XT Systems and the intro_NHC(8), dumpd(8), dumpd-request(8), and dumpd-dbadmin(8) man pages. In addition, see the configuration files described in the intro_NHC(8) man page for specific system variables and examples. 2.2.2 CCM (Cluster Compatibility Mode) Enhancements in CLE 3.1.UP03 Who will use CCM enhancements? End users, application developers, and system administrators. How can the CCM enhancements help me? You can use new command line options to override default behavior when you launch your cluster-based job. Additional enhancements in CCM improve performance and functionality for ISV applications. What do the CCM enhancements do? • The ccmrun command includes several new options to enable and disable the following functionality: SSH daemon, portmap and xinetd daemons for rsh, name service caching daemon, and RSIP (Realm Specific Internet Protocol). Most options describe a default behavior for CCM, but allow you to override that behavior with non-default configurations or environments. • CCM makes the head node of a job the first processing element (PE[0]) in the associated node list for the reservation. This is a requirement of many ISV applications. • ccmlogin propagates the exact login node environment settings to the CCM interactive environment via SSH. • CCM handles application termination and job cleanup more efficiently. • CCM and installation configuration files include parameters to customize your Workload Management (WLM). Where can I find more information? The changes visible to the end user are documented in man pages ccmrun(1) and ccmlogin(1). Administrators should refer to Installing and Configuring Cray Linux Environment (CLE) Software and Managing System Software for Cray XE and Cray XT Systems for information on how to install and complete the setup of CCM. 16 S–2425–40Software Enhancements [2] 2.2.3 Configuring a Virtual Local Area Network Interface Who will use this feature? System administrators and site analysts. How does this feature help me? A Virtual Local Area Network (VLAN) enables you to set up a LAN to have different broadcast domains, which often improves performance by reducing unnecessary network traffic. VLANs enable you to set up virtual workgroups, making it easier to move users from one network to another. VLANs may also provide better network security. What does this feature do? CLE 3.1 now supports the 802.1Q VLAN standard. A new procedure is provided to help you configure a 802.1Q VLAN interface for your Cray system. Where can I find more information? The new procedure is located in Managing System Software for Cray XE and Cray XT Systems (S–2393–3103). 2.2.4 Enhanced Node Placement Scheme for Gemini Systems Who will use this feature? All users and administrators on Cray XE systems. How does this feature help me? A wide variety of applications perform better as compared to their performance on earlier versions of CLE 3.1. What does this feature do? The script that orders the nodes for allocation by ALPS improves node placement on a Gemini based system interconnection network. This script is invoked only when the ALPS_NIDORDER configuration variable is set to -O2. Most sites with Cray XE systems should use this configuration option. The CLE 3.1.UP03 release also includes modifications to the node ordering script to support Cray XE6m and Cray XE5m systems larger than three cabinets. S–2425–40 17Cray Linux Environment™ (CLE) 4.0 Software Release Overview Where can I find more information? The apbridge(8) man page and Managing System Software for Cray XE and Cray XT Systems describe ALPS_NIDORDER options. 2.2.5 RSIP Daemon Log File (syslog) Logging Who will use this feature? System administrators and site analysts. How can the rsipd log file help me? Log file (syslog) logging of RSIP daemon (rsipd) messages enables the system administrator to more easily access this information and reduces the default log level (as described in the rsipd.conf file) by entering fewer rsipd messages in the syslog, thus eliminating considerable "spam". What does this feature do? During the RSIP daemon (rsipd) startup process (before disassociating from a controlling terminal), messages are logged to both the syslog and to stderr. This action now includes log file (syslog) logging. While the rsipd daemon is running, you can raise or lower the log level by sending a SIGUSR1 or a SIGUSR2 signal, respectively. The former use of the SIGUSR1 signal to have the rsipd daemon dump the server state to the default status file /var/run/rsipd.stat is replaced by the SIGWINCH signal. Where can I find more information? The rsipd(8) man page is updated in the CLE 3.1.UP03 release. 2.2.6 New Cray OpenSM Startup Scripts Note: This feature was originally announced as deferred. It is now included in CLE. Who will use this feature? System administrators. 18 S–2425–40Software Enhancements [2] How can this feature help me? A new package, cray-opensm-init, supports starting two subnet managers (opensm) for dual-port Host Channel Adapter (HCA) direct-attached storage configurations. Site modifications of the base OpenFabrics Enterprise Distribution (OFED) Session Manager script (opensmd) are no longer required. Note: This package is only intended for dual-port direct-attached storage configurations. It is not recommended as a general solution for starting two subnet managers on the same host. What does this feature do? The new package provides two startup scripts, opensmd-port1 and opensmd-port2, which are pre-configured with the appropriate log file, cache directory, and temporary directory locations. These scripts can be modified as needed by changing their configuration files: /etc/sysconfig/opensm-port1 and /etc/sysconfig/opensm-port2. The original /etc/init.d/opensmd and /etc/sysconfig/opensm are not changed by the new package. To use this new feature, system administrators must disable opensmd and then enable opensmd-port1 and opensmd-port2 by using the chkconfig command. Where can I find more information? For more information, contact your Cray service representative. 2.2.7 New Options for the lustre_control.sh Utility Who will use this feature? System administrators. How can this feature help me? By using new lustre_control.sh options, you can now mount or unmount Lustre across all compute nodes, specify a list of nodes, and specify a mount point. S–2425–40 19Cray Linux Environment™ (CLE) 4.0 Software Release Overview What does this feature do? The -c, -m, and -n options to lustre_control.sh add functionality to the mount_clients and umount_clients actions. The -c option sends mount and umount commands strictly to compute node Lustre clients. However, this option only works if passwordless ssh is enabled on your system. The -m option allows you to specify a mount point other than the MOUNT_POINT variable from the filesystem.fs_defs file. The -n option enables you to specify a list of nodes that receive the mount_clients and umount_clients actions. Where can I find more information? For more information, see the lustre_control.sh(8) man page or Managing Lustre for the Cray Linux Environment (CLE). 2.2.8 CLEinstall Program Enhancements Who will use this feature? System administrators. How can these enhancements to CLEinstall help me? For all types of CLE software installations and upgrades, CLEinstall provides additional verification of configuration information and automation for some types of configuration changes, including new features and some hardware configuration changes. What does this feature do? • The CLEinstall program verifies and updates the /etc/hosts file for Cray hostnames and aliases. Host name aliases in /etc/hosts are assigned based on the class name and order of NIDs that are specified by each node_class[idx] parameter. Note: CLEinstall modifies Cray system entries in /etc/hosts each time you update or upgrade your CLE software. • For initial installations, CLEinstall creates the /etc/opt/cray/sdb/node_classes file based on the node_class[idx] parameters you specified in CLEinstall.conf. For update or upgrade installations, CLEinstall verifies that they match. 20 S–2425–40Software Enhancements [2] • Each time you upgrade or update CLE software, CLEinstall checks for new service nodes. Any new nodes are initialized in the shared root so that you can use xtopview -n NID to customize the nodes, as needed. • When you add or remove hardware (service nodes, cabinets, or chassis within cabinets), you can run CLEinstall to modify your software configuration for the changes. Use the --xthwinv option with CLEinstall to apply the new hardware component information to the specified system set and modify other relevant configuration files. You can do this without changing the operating system release level. Note: This is in addition to functionality that already existed to change CLEinstall.conf parameters without changing the OS level. • If your system runs CLE 3.1 and has been hardware upgraded from a Cray XT system to a Cray XE system, the CLEinstall program performs a number of additional configuration steps to update some system configuration files for the new hardware. Where can I find more information? These changes are documented in the CLEinstall.conf(5) man page or Installing and Configuring Cray Linux Environment (CLE) Software. The procedures for Adding or removing cabinets or chassis within cabinets and Adding or removing a service node in Managing System Software for Cray XE and Cray XT Systems (S–2393–3103) are updated to take advantage of CLEinstall enhancements. 2.2.9 Application Level Placement Scheduler (ALPS) Enhancements Who will use these ALPS Enhancements? End users and application developers. How can the ALPS enhancements help me? ALPS provides an output environment variable, ALPS_APP_DEPTH, which may assist application programmers when launching multiple programs within a job. The aprun(1) man page language under the MPMD section was modified to clarify that the -m option should be specified in the first program executable segment and this value is inherited for subsequent programming segments. S–2425–40 21Cray Linux Environment™ (CLE) 4.0 Software Release Overview What do the ALPS enhancements do? ALPS_APP_DEPTH can assist application programmers when using MPMD mode. For each programming segment within the ALPS job, there may be varying values for depth (-d). Program executables or libraries linked against these executables may make a getenv() call, requesting the value for ALPS_APP_DEPTH to determine how many threads were requested locally and modify program behavior appropriately. Where can I find more information? The changes are documented in the aprun(1) man page. 2.2.10 DVS bulk_rw Mode and DVS-specific ioctl() Commands Who will use Cray Data Virtualization Service (Cray DVS) Enhancements? End users and application developers. How can the DVS enhancements help me? DVS provides ioctl commands to user space applications which can query DVS configuration data for a specific file that is visible within the name space of a DVS mount point. DVS now allows a bulk read/write option, bulk_rw, providing potential performance improvements. What do the DVS enhancements do? DVS ioctl commands are defined in dvs_ioctl.h. To use these DVS-specific ioctl() commands, include dvs_ioctl.h in your source and load the DVS module in your compiling environment. The following commands are available to application users: DVS_GET_REMOTE_FS_MAGIC Returns the magic value of the underlying file system. DVS_GET_FILE_BLK_SIZE Returns the DVS block size in bytes for a file that is visible on the DVS mount point. 22 S–2425–40Software Enhancements [2] DVS_GET_FILE_STRIPE_WIDTH Returns the DVS stripe width for a file that is visible on the DVS mount point. bulk_rw allows DVS to execute read and write operations and eliminates the need for temporary data transfer buffers to or from DVS servers. bulk_rw performs RDMA (Remote Direct Memory Access) operations directly to or from file pages in the Linux kernel. There is also an environment variable, DVS_BULK_RW=which overrides the behavior of a -o bulk_rw or -o nobulk_rw mount options as specified in /etc/fstab for compute node images. Where can I find more information? The changes are documented in the dvs(5) man page. 2.3 Software Enhancements in CLE 3.1.UP02 2.3.1 DVS POSIX Atomicity Who will use DVS atomic stripe parallel mode? System administrators. How can atomic stripe parallel mode help me? These enhancements can prevent interleaving of data, which could occur when an application performs I/O on a shared file without using file locking on a DVS stripe parallel mount point. What does atomic stripe parallel mode do? Atomic stripe parallel, a Cray DVS mode, adheres to POSIX read/write atomicity rules while still allowing for possible parallelism within a file. It is similar to stripe parallel mode in that the server used to perform the read, write, or metadata operation is selected using an internal hash involving the underlying file or directory inode number and the offset of data into the file relative to the DVS block size. S–2425–40 23Cray Linux Environment™ (CLE) 4.0 Software Release Overview Where can I find more information? For information on Cray DVS, see Introduction to Cray Data Virtualization Service. For information on how to set up Cray DVS, see Installing and Configuring Cray Linux Environment (CLE) Software and Managing System Software for Cray XE and Cray XT Systems. 2.3.2 Changes to GPCD (Gemini Performance Counters Device) to Improve Performance when Accessing Memory Mapped Registers Using CrayPat Who will use this enhancement? End users. How can this enhancement help me? CLE has been modified to decrease MMR (Memory Mapped Register) access time, thus giving a significant performance improvement when using CrayPat. Users should be able to make traces that involve MMR data throughout their program with less overhead than previous releases. Another direct end user benefit of this improvement is that Gemini network counter event collection can be done more frequently during application execution. The user can now monitor a half-dozen counters without drastically degrading performance. As a reference, when tracing function enter/return, the overhead to count one Gemini event is approximately equivalent to four processor counter events. What is modified to increase performance? The Gemini application specific integrated circuit contains MMRs, also known as network performance counters, that are used within CrayPat to capture network performance data during trace experiments. The performance improvement introduced in this update package decreases the access time experienced when using CrayPat to view Gemini MMR data. This is done via a modification to a kernel-level interface, gpcd (Gemini Performance Counters Device); thus, end users do not need to take any further action to take advantage of this benefit. Where can I find more information? For more information on Gemini MMRs used as performance counters, see Using the Cray Gemini Hardware Counters. For more information on CrayPat, see Cray Performance Analysis Tools Release Overview and Installation Guide and Using Cray Performance Analysis Tools. 24 S–2425–40Software Enhancements [2] 2.3.3 Repurposed Compute Node Support for PBS Professional Who will use repurposed compute node support for PBS Professional? System administrators. How can this help me? You can potentially improve performance or maintainability of PBS Professional batch system services by moving the MOM node to a dedicated node instead of sharing resources with other services on an existing service node. What is repurposed compute node support for PBS? Cray tests and supports repurposing a compute node as a PBS Professional MOM node. CLE 3.1.UP01 introduced new functionality to repurpose compute nodes as service nodes. For the initial release of this feature, support for repurposing compute nodes as a PBS Professional MOM nodes was not available. For more information on repurposed compute nodes, see Repurposed Compute Nodes on page 28. Where can I find more information? For information on how to setup and configure compute nodes as PBS Professional MOM nodes, see Repurposing Compute Nodes as Service Nodes on Cray XE and Cray XT Systems and PBS Professional documentation available from Altair Engineering, Inc. at http://www.altair.com. 2.3.4 New xtverifydefaults Command Who will use this command? System administrators, automatically via the xtopview utility. How can this command help me? The xtopview utility automatically invokes a new command, xtverifydefaults, to verify that the default software version for Cray packages is the same on all service nodes. In rare cases, the shared root links to Cray packages can become inconsistent so that not all service nodes point to the same default software version. This situation can be difficult to detect. By default, the xtopview utility invokes the xtverifydefaults command during exit processing to correct any inconsistent links that may have been generated during the xtopview session. System administrators can also run the command manually from within xtopview. S–2425–40 25Cray Linux Environment™ (CLE) 4.0 Software Release Overview What does this command do? The xtverifydefaults command verifies and optionally fixes the default links in the shared root for Cray system software packages. If the version of software a link points to is different from the version that the default view link of the same name points to, the link is considered inconsistent. The xtverifydefaults command flags and optionally fixes inconsistent links. All of the node and class view default links for Cray packages (e.g., those that start with "cray-") are forced to match the default view default links. Where can I find more information? See the xtverifydefaults(8) man page. 2.3.5 Core Specialization is NUMA-aware Who will use NUMA-aware core specialization? End users. How can NUMA-aware core specialization help me? This enhancement may help to "even out" the distribution of the application-generated service processes when the aprun -r core specialization option is greater than 1. What is NUMA-aware core specialization? Core specialization may improve application performance by binding system processes and daemons to a set of cores in each processor. Initially, only one specialized core per node was allowed. In CLE 3.1.UP01, this limitation was removed. This feature provides more support for core specialization for specialized cores greater than 1. ALPS allocates specialized cores round-robin across the NUMA nodes on a node (starting with the highest-numbered core on the highest-numbered NUMA node), unless the user specifies the aprun -cc cpu_list option. In that case, specialized cores are allocated from the highest-numbered cores, avoiding the cores in cpu_list. Where can I find more information? For more information on how to use core specialization, see the aprun(1) man page and Workload Management and Application Placement for the Cray Linux Environment. 26 S–2425–40Software Enhancements [2] 2.4 Software Enhancements in CLE 3.1.UP01 2.4.1 Gemini Network Resiliency and Congestion Management Who will use these enhancements? End users, system administrators, and site analysts. How can these enhancements help me? Network congestion is inherent in any High Performance Computing (HPC) high speed interconnect. Cray's software and hardware are designed to minimize the effects of network congestion. This release includes system software enhancements, administrative recommendations, and user documentation, which together help to prevent network congestion or to protect user jobs and data if congestion occurs. What are the network resiliency and congestion management enhancements? The CLE 3.1.UP01 and SMW 5.1.UP01 releases provide the following software enhancements and documentation to improve network resiliency and manage network congestion on Cray XE systems with a Gemini based system interconnection network. • New user documentation describes some cases where congestion can occur and how to modify PGAS and SHMEM applications to minimize the potential for inducing network congestion. • The Hardware Supervisory System (HSS) includes new software that monitors the Gemini network and, if necessary, throttles traffic to alleviate the complications due to congestion. • System administrators can use new information about Gemini High Speed Network (HSN) routing to identify and potentially modify unrouteable network configurations. • To improve network resiliency on the HSN, system software on the L0 automatically brings up lanes that have been downed. • System software changes include selective link speed increases and improvements to the HSN routing algorithm (zone routing). These adjustments can improve global HSN communication performance by as much as 15 to 25 percent. For more information, contact your Cray service representative. S–2425–40 27Cray Linux Environment™ (CLE) 4.0 Software Release Overview 2.4.2 Repurposed Compute Nodes Who will use this feature? System administrators and site analysts. How can repurposed compute nodes help me? You can improve performance or maintainability of some Cray system services by moving the service to a dedicated node instead of sharing resources with other services on an existing service node. You do not need to make hardware configuration changes (swapping modules) to configure additional nodes with a service node role; you can implement a flexible approach to service node configuration on an as-needed basis. The addition of service nodes can increase throughput for some functions. For example, multiple DVS servers working in parallel can increase I/O throughput; multiple MOM nodes can allow additional jobs to run in parallel. For some services, configuring additional service nodes can increase redundancy and minimize the negative impact of a failed service node. What does this feature do? The CLE 3.1.UP01 release includes new functionality that supports booting compute node hardware with service node images. By using this functionality to change the role of a compute node, you can add additional service nodes for services that do not require external connectivity. Some services on Cray systems have resource requirements or limitations (for example, memory, processing power or response time) that you can address by configuring a dedicated service node, such as a Cray Data Virtualization Service (Cray DVS) node or a batch system management (MOM) node. On Cray systems, service I/O node hardware (on a service blade) is equipped with Peripheral Component Interconnect (PCI) protocol card slots to support external devices. Compute node hardware (on a compute blade) does not have PCI slots. For services that do not require external connectivity, you can configure the service to run on a single, dedicated compute node and avoid using traditional service I/O node hardware. When you configure a node on a compute blade to boot a service node image and perform a service node role, that node is referred to as a repurposed compute node. 28 S–2425–40Software Enhancements [2] Cray tests and supports the following services on repurposed compute nodes: DVS servers, Moab with TORQUE MOM nodes, and PBS Professional MOM nodes. Note: Support for repurposing a compute node as a PBS Professional MOM node was originally deferred but is now supported. Where can I find more information? A new Cray publication, Repurposing Compute Nodes as Service Nodes on Cray XE and Cray XT Systems (S–0029–3101) describes the feature and provides implementation procedures. Note: The information in this new document supersedes procedures to manually configure compute nodes as compute node root servers for DSL (using DVS) in the following documentation: • Installing and Configuring Cray Linux Environment (CLE) Software (S–2444–31) • Managing System Software for Cray XE and Cray XT Systems (S–2393–31) 2.4.3 Lustre Upgraded to Version 1.8.2 Who will use this feature? All users and administrators on Cray systems that use the Lustre file system. How can this new version of Lustre help me? A large number of existing problems are fixed with this new version of Lustre. What does upgrading to Lustre 1.8.2 do? When you update a CLE 3.0 or CLE 3.1 system to CLE 3.1.UP01, your Lustre file system software is automatically upgraded from version 1.8.1 to version 1.8.2. This change is transparent to both users and administrators. Where can I find more information? Refer to http://wiki.lustre.org. For information about upgrades to Lustre 1.8 from Lustre 1.6, see Cray Linux Environment (CLE) 3.1 Software Release Overview (S–2425–31). S–2425–40 29Cray Linux Environment™ (CLE) 4.0 Software Release Overview 2.4.4 Application Completion Reporting Who will use this feature? System administrators and site analysts. How can ACR help me? Application Completion Reporting (ACR) provides a way to track, manage, and report application data, which gives system administrators an effortless way to see how system resources are being allocated. Data provided by ACR reports can make it easier to see where system resources are being used so that recommendations can be made for application improvement, different allocation schemes, or use of different time slices. What does ACR do? ACR extends the data persistence schema from Cray Management Services (CMS) and provides three commands to examine and report application data. • Use the mzjob command to examine reservation data; you can specify the format of the search output so that you can use it in other programs or scripts. • Use the mzreport command to examine application completion status information; the command supports claims (applications), jobs (reservations), or user completion status. • Use mz2attr to read and display node attributes. ACR requires CLE 3.1.UP01 or later and SMW 5.1.UP01 or later; support for this feature was deferred in earlier SMW and CLE releases. Where can I find more information? Using Cray Management Services (CMS) (S–2484–5101) and the mzjob(8), mzreport(8) and mz2attr(8) man pages. 30 S–2425–40Software Enhancements [2] 2.4.5 Topology and NID Ordering on Cray XE Systems Who will use this feature? End users, site-analysts, and system administrators. How can this feature help me? The Application Level Placement Scheduler (ALPS) "xyz-by2" node ordering feature provides performance improvements for larger applications that run on Cray XE systems with no user interaction. System administrators can implement the new ordering with only a single configuration option change to the /etc/sysconfig/alps file on the boot node. What does this feature do? The node ordering functionality directs ALPS to use a node ordering option that results in better application performance for larger applications run on Cray XE systems. Cray's first method for application placement on nodes along the interconnect torus was to use the node ID (NID) number list and place applications on nodes in this numerical order. Later, the simple "xyz" placement method was added, which reordered the sequence of NID numbers, used in assigning placements, in a natural 3D communications sequence (that is, following the torus, not the chassis and cabinet layout). The original numerical ordering method provides good "packing" for small numbers of nodes, but for large applications, there are "clumps" of nodes that spread across the machine, which causes inter-application traffic contention on the interconnect. With the "xyz" placement, available with previous releases of CLE, the applications are better isolated, but smaller applications are spread more thinly, losing some of the advantage. New with this release, the "xyz-by2" NID reordering method combines the incidental small node packing of the simple NID number method with the inter-application interaction reduction of the "xyz" method. To implement this node ordering, system administrators must set the ALPS_NIDORDER="-O2" parameter in the /etc/sysconfig/alps configuration file and make identical changes to the file of the same name on the shared root. If configured, a custom ordering is created when ALPS is started on the boot node, triggered in the /etc/init.d/alps script file, and used by the apbridge component of ALPS. You should change the ALPS_NIDORDER parameter only at system reboot time. S–2425–40 31Cray Linux Environment™ (CLE) 4.0 Software Release Overview Previously, this feature was supplied to selected customers via a software patch. While the internal mechanism used to implement this feature remains the same, the configuration method has changed. The preferred method is to set the ALPS_NIDORDER parameter in the /etc/sysconfig/alps file. The configuration instructions provided with the patch are obsolete and should no longer be used. Where can I find more information? The apbridge(8) man page was updated for the 3.1.UP01 release. 32 S–2425–40Compatibilities and Differences [3] This chapter compares Cray Linux Environment (CLE) 4.0 with CLE 3.1 and lists compatibility issues and functionality changes. Note: The CLE 4.0 Limitations document describes temporary limitations of the release. The CLE 4.0 Errata document describes any installation and configuration changes identified after documentation for this release was packaged; it also includes a list of customer-filed critical and urgent bug reports that are closed with this release. These documents are included with the release package and are also available from your Cray representative. 3.1 Binary Compatibility Binary compatibility is maintained from CLE 3.1 to CLE 4.0 for dynamically linked binaries. Applications targeted for a Cray XT system that has a SeaStar based system interconnection network will not run on a Cray XE system that has a Gemini based system interconnection network. For applications that use static libraries, some CLE 3.1 binaries will fail when run with CLE 4.0, due to differences between SLES 11 and SLES 11 SP1. Cray tested a significant number of applications that were compiled and statically linked under CLE 3.1 and successfully ran them under CLE 4.0. However, this does not guarantee that all statically linked applications from CLE 3.1 are compatible with CLE 4.0. Note: While relinking or recompiling may not be required, doing so may result in improved application performance. It is recommended that you recompile applications that are written in the C++ programming language or Partitioned Global Address Space (PGAS) programming models when you migrate applications from previous release versions. S–2425–40 33Cray Linux Environment™ (CLE) 4.0 Software Release Overview 3.2 Changes to the Application Level Placement Scheduler (ALPS) 3.2.1 Configuration of ALPS Shared Directory Since /etc/sysconfig is for the exclusive use of init.d scripts, ALPS runtime components no longer parse in this directory for the location of the ALPS shared directory. This setting, sharedDir, is located in the apsched section of the /etc/alps.conf configuration file on the shared root. This change is reflected in the apsched(8) man page. 3.2.2 Extraction of Some 3rd Party Software from ALPS As part of a standardization process some software has been extracted from the ALPS shared libraries and this may affect consumers of these libraries. Developers who use ALPS shared libraries should be aware that: • ALPS RPMs no longer include the library, xmlrpc. • ALPS RPMs now depend on the library xmlrpc-epi, which is found in the new RPM cray-libxmlrpc-epi0. • The associated development RPM for xmlrpc-epi is called cray-libxmlrpc-epi-devel. • Any reference in an application's build system to xmlrpc or -lxmlrpc should be changed to xmlrpc-epi or -lxmlrpc-epi. Note: As a part of this standardization process, ALPS will be moving from /usr to /opt/cray/alps in a future release. Additionally, the ALPS configuration file /etc/alps.conf will be relocated to /etc/opt/cray/alps/alps.conf ALPS locations remain unchanged with the CLE 4.0 release. 3.3 Commands Removed from the Release The xtps, xtwho, and xtkill commands were previously deprecated and have been removed from the CLE release. Comparable functionality for these commands are available using pdsh and the equivalent Linux command. For more information, see the pdsh(1) man page. 34 S–2425–40Compatibilities and Differences [3] The xtuname command was previously deprecated and has been removed from the CLE release. Comparable functionality for xtuname come with xtnce and rca-helper commands. rca-helper may require the user to load a rca module (e.g. module load rca) and xtnce must be run from the boot node. The equivalent command options are as follows: -N (node) % rca-helper -i -C (class) % xtnce r` ca-helper -i - ` S (boot string) % cat /proc/cmdline 3.4 Software Packages/Releases That Must be Reinstalled If you are migrating from CLE 3.1, the following programming tools must be reinstalled to ensure compatibility with CLE 4.0. This is due to changes in CLE 4.0 included with the upgrade to SLES 11 SP1. • STAT 1.0 and later versions • PGI Compilers • GCC compilers • Intel compilers • Cray Performance Measurement and Analysis Tools (CPMAT) 5.2.1 or later Note: Prior to migrating CLE you must uninstall previous performance tools components and Pathscale compilers. After you finish your CLE 4.0 installation, you should reinstall the most current Pathscale Compiler and CPMAT. 3.5 lustre_control.sh -c mount/unmount No Longer Requires Passwordless SSH to Mount/Unmount Clients Previously (CLE 3.1.UP03), the -c option of mount_clients and umount_clients lustre_control.sh actions required passwordless ssh to execute properly. This is no longer necessary. 3.6 Installation and Configuration Changed Functionality for System Administrators In addition to the feature information described in Chapter 2, Software Enhancements, system administrators should also note the following compatibility issues and differences that are associated with the CLE 4.0 release. S–2425–40 35Cray Linux Environment™ (CLE) 4.0 Software Release Overview For detailed initial and update installation procedures, see Installing and Configuring Cray Linux Environment (CLE) Software. For temporary limitations of this release and changes identified after the documentation for this release was packaged, see the CLE 4.0 Limitations document provided with the release package. Additional information may be included in the CLE 4.0 README document provided with update packages. 3.6.1 Supported Upgrade Path The CLE 4.0.UP00 release supports initial system installations and migration/upgrade installations from CLE 3.1. Note: You must be running release version CLE 3.1 or later on your Cray XE system to migrate to the CLE 4.0 release. 3.6.2 System Management Workstation (SMW) Upgrade Requirements You must install or upgrade the System Management Workstation (SMW) to the SMW 6.0 release before you install or migrate and upgrade to CLE 4.0. For information about the content of the SMW 6.0 release, see SMW README document included in the SMW release package. 3.6.3 Installation Time Required The time required to install the CLE 4.0 release depends on a large number of site-specific variables. As with past releases, much of the installation or upgrade requires a dedicated system. The time required to install CLE 4.0 is comparable to installation times experienced with CLE 3.1. The time required to migrate from CLE 3.1 to CLE 4.0 takes approximately the same time as an initial installation. This is due to the migration of the base operating system from SLES 11 to SLES SP1. 3.6.4 Changes to the CLEinstall.conf Installation Configuration File The CLEinstall.conf configuration file is modified to allow system administrators installing CLE to tune the USE_KERNEL_NFSD_NUMBER parameter. This may be necessary at some sites to prevent RPC timeouts on service nodes during boot. However, in most cases, the default value used by CRAYCLEinstall.sh script is appropriate. For more information, please contact your Cray service representative. Any changes to the CLEinstall.conf file are also reflected in the CLEinstall.conf(5) man page. 36 S–2425–40Documentation [4] This chapter describes the documentation that supports the Cray Linux Environment (CLE) 4.0 release. 4.1 Accessing Product Documentation With each software release, Cray provides books and man pages, and in some cases, third-party documentation. These documents are provided in the following ways: CrayPort CrayPort is the external Cray website for registered users that offers documentation for each product. CrayPort has portal pages for each product that contains links to all of the documents that are associated to that product. CrayPort enables you to quickly access and search Cray books, man pages, and in some cases, third-party documentation. You access CrayPort by using the following URL: http://crayport.cray.com CrayDoc CrayDoc is the Cray documentation delivery system. CrayDoc enables you to quickly access and search Cray books, man pages, and in some cases, third-party documentation. Access the HTML and PDF documentation via CrayDoc at the following locations. • The local network location defined by your system administrator • The CrayDoc public website: http://docs.cray.com Man pages Man pages are textual help files available from the command line on Cray machines. To access man pages, enter the man command followed by the name of the man page. For more information about man pages, see the man(1) man page by entering: % man man Third-party documentation Third-party documentation that is not provided through CrayPort or CrayDoc is included with the third-party product. S–2425–40 37Cray Linux Environment™ (CLE) 4.0 Software Release Overview 4.2 Cray-developed Books Provided with This Release The books provided with this release are listed in Table 2, which also indicates whether each book was updated. Books are provided in HTML and PDF formats. Table 2. Books Provided with This Release Book Title Most Recent Document Updated Cray Linux Environment (CLE) Software Release Overview (this document) S–2425–40 Yes Installing and Configuring Cray Linux Environment (CLE) Software S–2444–40 Yes Managing System Software for Cray XE and Cray XT Systems S–2393–3103 No Managing Lustre for the Cray Linux Environment (CLE) S–0010–40 Yes Introduction to Cray Data Virtualization Service S–0005–3103 No Writing a Node Health Checker (NHC) Plugin Test S–0023–40 Yes Workload Management and Application Placement for the Cray Linux Environment S–2496–3103 No Using the GNI and DMAPP APIs S–2446–3103 No CrayDoc Installation and Administration Guide S–2340–411 No Repurposing Compute Nodes as Service Nodes on Cray XE and Cray XT Systems S–0029–3101 No 4.2.1 Additional Cray-developed Release Documents Two additional documents are provided with the CLE 4.0 release package. These documents are also available from your Cray representative. CLE 4.0 Limitations Describes temporary limitations of the release. CLE 4.0 Errata Describes any installation and configuration changes that were identified after documentation for this release was packaged; also includes a list of customer-filed critical and urgent bug reports closed with this release. You should also contact your Cray representative about CLE-related information addressed in Field Notices (FNs). 38 S–2425–40Documentation [4] 4.3 Third-party Books Provided with This Release The CLE 4.0 release package includes the following book from Oracle: Lustre Operations Manual (S–6540–1813) 4.4 Changes to Man Pages Updated Linux man pages are included with the CLE 4.0 release. For complete information regarding changes to specific commands due to the migration to SLES 11 SP1, see the associated man pages. To access Linux man pages, use the man command on a login node. 4.4.1 Removed Cray Man Pages The following Cray man pages were removed with this release: • xtps(1) • xtuname(1) • xtkill(1) • xtwho(1) 4.4.2 Changed Cray Man Pages in CLE 4.0 Most Cray-specific man pages have been updated to reflect changes in file locations. Source for Cray-specific man pages is included in the associated RPM and is installed in a new location, /opt/cray/share/man. The following Cray man pages have additional updates and enhancements: • apsched(8): reflects changes to options that specify the location of the ALPS shared directory. • intro_NHC(8), xtcheckhealth(8), and xtcleanup_after(8): Reflects the NHC enhancements for this release. • lustre_control.sh(1): Removes requirement for passwordless sshd sessions with invoked with the -c option. 4.5 Other Related Documents Available The following publications contain additional information that may be helpful in setting up your Cray system; they are not provided with this release but are supplied with other products purchased from Cray. You can access these publications from the CrayPort website. You can also order the printed form of release overviews and installation guides from Cray. S–2425–40 39Cray Linux Environment™ (CLE) 4.0 Software Release Overview Table 3. Other Related Documents Available Book Title Number Installing Cray System Management Workstation (SMW) Software S–2480–60 Using Cray Management Services (CMS) S–2484–5101 Using and Configuring System Environment Data Collections (SEDC) S–2491–60 Cray Application Developer's Environment Installation Guide S–2465 Cray Compiling Environment Release Overview and Installation Guide S–5212 4.6 Additional Documentation Resources Table 4 lists additional resources for obtaining documentation not included with this release package. Table 4. Additional Documentation Resources Product Documentation Source Linux Documentation for SLES and Linux is at http://www.novell.com/linux and documentation for the Linux Documentation Project is at http://www.tldp.org Lustre Additional Lustre documentation is available at http://wiki.lustre.org/index.php/Lustre_Documentation and http://www.oracle.com/us/products/servers-storage/storage/storage-software MySQL MySQL documentation is available at http://www.mysql.com/documentation RPM RPM documentation is available at http://www.rpm.org PBS Professional Documentation for the PBS Professional work load manager system software is available from Altair Engineering, Inc. at http://www.altair.com Moab with TORQUE Documentation for Moab workload manager and TORQUE resource manager software is available from Adaptive Computing: http://www.adaptivecomputing.com/ Platform LSF Documentation for Platform LSF software is available from Platform Computing Corporation at http://www.platform.com/ Berkeley Lab Checkpoint/Restart (BLCR) Documentation for BLCR is available from Berkeley Lab at http://upc-bugs.lbl.gov/blcr/doc/html/ 40 S–2425–40Release Contents [5] 5.1 Hardware Requirements The Cray Linux Environment (CLE) 4.0 release supports new or initial installations on Cray XE6, Cray XE6m, Cray XE5, Cray XE5m, systems. Upgrade installations from CLE 3.1 are supported for Cray XE6, Cray XE6m, Cray XE5, Cray XE5m systems. 5.2 Software Requirements The following sections list the required or recommended release levels for products that run on Cray systems but are released separately from CLE 4.0. 5.2.1 Release Level Requirements for Other Cray Software Products The product versions listed in Table 5 are the minimum release level required for verified compatibility with CLE 4.0. Support for these products is provided in the form of updates to the latest released version only. Unless otherwise noted in the associated release documentation, Cray recommends that you continue to upgrade these releases as updates become available. Table 5. Minimum Release Level Requirements for Other Software Products with CLE 4.0 Product Minimum Release Level Release Information System Management Workstation (SMW) Release 6.0 or later. SMW README Cray Application Developer's Environment (CADE) Release 6.0 or later. Cray Application Developer's Environment Installation Guide (S–2465) Cray Performance, Measurement, and Analysis Tools (CPMAT) Release 5.2.1 or later is required. Cray Performance Analysis Tools Release Overview and Installation Guide (S–2474) Cray Compiling Environment (CCE) Release 7.4.0 or later. Cray Compiling Environment Release Overview and Installation Guide (S–5212) S–2425–40 41Cray Linux Environment™ (CLE) 4.0 Software Release Overview 5.2.2 Third-party Software Requirements Third-party compiler products are available for Cray systems as noted in Table 6. The release level indicated has been tested with CLE 4.0. Cray recommends that you continue to upgrade these products as updates become available. Table 6. Minimum Release Level Requirements for Third-party Compilers with CLE 4.0 Product Minimum Release Level Release/Ordering Information PGI Compiler Release 11.6 or later. Contact your Cray representative for licensing/purchasing information. For product information see The Portland Group, Inc.: http://www.pgroup.com PathScale Compiler 4.0.9 or later. Contact your Cray representative for licensing/purchasing information. http://www.pathscale.com Intel Compiler Release 12.0.174 or later. Intel Corporation. See: http://software.intel.com GCC (GNU Compiler Collection) Release 4.5.3 or later. The GNU Project. See: http://gcc.gnu.org Batch system software products are available for Cray systems as indicated in Table 7. Information regarding supported and certified batch system software release levels is available on the CrayPort website at http://crayport.cray.com. Click on 3rd Party Batch SW in the menu bar. Table 7. Third-party Batch System Software Products Available for Cray Systems Product Minimum Release Level Release/Ordering Information Moab and TORQUE: Moab Version 5.3.4 or later. TORQUE 2.3.4 or later. Contact your Cray representative for licensing/purchasing information. For product information see Adaptive Computing: http://www.adaptivecomputing.com/ PBS Professional: Release 10.2 or later. Contact your Cray representative for licensing/purchasing information. For product information see Altair Engineering, Inc.: http://www.altair.com/ Platform LSF: Release 7.0.4 Contact your Cray representative for version/licensing/purchasing information. For product information contact Platform Computing Corporation. See: http://www.platform.com/ 42 S–2425–40Release Contents [5] 5.3 Supported Upgrade Path The CLE 4.0.UP00 release supports new system installations and migration installations from CLE 3.1 or its update packages. Note: You must be running release version CLE 3.1 or later on your Cray XE system to migrate to the CLE 4.0 release. The System Management Workstation (SMW) must be running the SMW 6.0 release before you install the CLE 4.0 operating system release package. 5.4 Contents of the Release Package The release package includes: • All necessary RPMs and installation utilities for the components listed in CLE 4.0 Software Components on page 43 • CrayDoc software suite and the documentation, described in Chapter 4, Documentation on page 37 • A printed copy of this release overview • A printed copy of the Installing and Configuring Cray Linux Environment (CLE) Software • A printed copy of the CLE 4.0 Limitations • A printed copy of the CLE 4.0 Errata • A printed copy of the CLE 4.0 README 5.4.1 CLE 4.0 Software Components The CLE 4.0 release includes, but is not limited to, the following system software products: • Cray's customized version of the SLES 11 operating system with a Linux 2.6.32 kernel • CNL compute node operating system • Lustre file system (Version 1.8.4) from Oracle. • Application Level Placement Scheduler (ALPS) • Cray Data Virtualization Service (Cray DVS) • Checkpoint/Restart (CPR) • Cluster Compatibility Mode (CCM) • Comprehensive System Accounting (CSA) S–2425–40 43Cray Linux Environment™ (CLE) 4.0 Software Release Overview • Cray Audit • Dynamic Shared Objects and Libraries (DSL) • Linux ldump and lcrash Utilities • MySQL Pro • Node Health Checker (NHC) • OpenFabrics InfiniBand • Realm-Specific Internet Protocol (RSIP) 5.5 Licensing The CLE release is covered under a software license agreement for Cray software. Upgrades to this product are provided only when a software support agreement for this Cray software is in place. Cray licenses the following as separate products for Cray systems under a Cray license agreement: • Cray XE OS binary (which provides rights to the CLE operating system and its components) Note: Source Code Option: The Cray XE OS license for Cray XE systems is binary by default. Certain U.S. customers may be eligible to obtain a buildable OS source license on Cray XE systems for an additional fee. For more information regarding source code, contact your sales representative. • Lustre Parallel File System For more information about licensing and pricing, contact your Cray sales representative, or send e-mail to crayinfo@cray.com. 44 S–2425–40 CrayDoc Glossary blade Product: Cray XMT 1) A field-replaceable physical entity. A Cray XMT service blade consists of AMD Opteron sockets, memory, Cray SeaStar chips, PCI-X or PCIe cards, and a blade control processor. A Cray XMT compute blade consists of Threadstorm processors, memory, Cray SeaStar chips, and a blade control processor. 2) From a system management perspective, a logical grouping of nodes and blade control processor that monitors the nodes on that blade. blade Product: Cray XE series, Cray XT series 1) A field-replaceable physical entity. A service blade consists of AMD Opteron sockets, memory, Cray network application-specific integrated circuit (ASIC) chips, PCI cards, and a blade control processor. A compute blade consists of AMD Opteron sockets, memory, Cray network application-specific integrated circuit (ASIC) chips, and a blade control processor. 2) From a system management perspective, a logical grouping of nodes and blade control processor that monitors the nodes on that blade. blade control processor Product: Cray X2, Cray XMT, Cray XT series, Cray XE series A microprocessor on a blade that communicates with a cabinet control processor through the HSS network to monitor and control the nodes on the blade. See also blade, L0 controller, Hardware Supervisory System (HSS). compute blade Product: Cray XT3 See blade. Last changed: 10-20-2011 assign(3f) NAME ASSIGN, ASNUNIT, ASNFILE, ASNRM — Provides library interface to assign processing SYNOPSIS CALL ASSIGN(cmd, ier) CALL ASNUNIT(iunit, astring, ier) CALL ASNFILE(fname, astring, ier) CALL ASNRM(ier) IMPLEMENTATION Cray Linux Environment (CLE) DESCRIPTION ASSIGN provides an interface to assign(1) processing from Fortran. ASNUNIT and ASNFILE assign attributes to units and files, respectively. As with the assign(1) command, these attributes are examined only during OPEN processing. Setting or changing these attributes does not have an effect on Fortran units or files that are already open. ASNRM removes all entries currently in the assign environment. All arguments must be of default kind unless documented otherwise. The default kind is KIND=4 for integer, real, complex, and logical arguments. These routines support the following arguments: cmd A Fortran character variable containing a complete assign(1) command in the format also acceptable to ishell(3f). The -V option cannot be processed by the ASSIGN routine. ier A KIND=4 integer variable that is assigned the exit status on return. 0 indicates a normal return; >0 indicates a specific error status. iunit A KIND=4 integer variable or constant containing the unit number to which attributes are assigned. astring A Fortran character variable containing any attribute options and option values that could be passed to assign(1). Control options -I, -O, and -R can also be passed. fname A character variable or constant containing the file name to which attributes are assigned. Users are encouraged to use the ASSIGN library routines rather than a shell command for the assign command. EXAMPLES Example 1. The following is equivalent to assign -s unblocked f:file INTEGER(KIND=4) IER CALL ASSIGN('assign -s unblocked f:file' ,ier) Example 2. The following has the same effect as assign -I -n 2 u:99 INTEGER(KIND=4) IUN, IER IUN = 99 CALL ASNUNIT(IUN,'-I -n 2',IER) SEE ALSO asnctl(3f), asnqfile(3f), asnqunit(3f) assign(1) TM Cray XT™ System Overview S–2423–22© 2004–2009 Cray Inc. All Rights Reserved. This document or parts thereof may not be reproduced in any form unless permitted by contract or by written permission of Cray Inc. U.S. GOVERNMENT RESTRICTED RIGHTS NOTICE The Computer Software is delivered as "Commercial Computer Software" as de?ned in DFARS 48 CFR 252.227-7014. All Computer Software and Computer Software Documentation acquired by or for the U.S. Government is provided with Restricted Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7014, as applicable. Technical Data acquired by or for the U.S. Government, if any, is provided with Limited Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7013, as applicable. Cray, LibSci, and UNICOS are federally registered trademarks and Active Manager, Cray Apprentice2, Cray Apprentice2 Desktop, Cray C++ Compiling System, Cray CX1, Cray Fortran Compiler, Cray Linux Environment, Cray SeaStar, Cray SeaStar2, Cray SeaStar2+, Cray SHMEM, Cray Threadstorm, Cray X1, Cray X1E, Cray X2, Cray XD1, Cray XMT, Cray XR1, Cray XT, Cray XT3, Cray XT4, Cray XT5, Cray XT5 h , Cray XT5m, CrayDoc, CrayPort, CRInform, ECOphlex, Libsci, NodeKARE, RapidArray, UNICOS/lc, UNICOS/mk, and UNICOS/mp are trademarks of Cray Inc. AMD is a trademark of Advanced Micro Devices, Inc. Copyrighted works of Sandia National Laboratories include: Catamount/QK, Compute Processor Allocator (CPA), and xtshowmesh. GCC is a trademark of the Free Software Foundation, Inc.. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Linux is a trademark of Linus Torvalds. Lustre was developed and is maintained by Cluster File Systems, Inc. under the GNU General Public License. MySQL is a trademark of MySQL AB. Opteron is a trademark of Advanced Micro Devices, Inc. PBS Pro and PBS Professional are trademarks of Altair Grid Technologies. SUSE is a trademark of SUSE LINUX Products GmbH, a Novell business. The Portland Group and PGI are trademarks of The Portland Group Compiler Technology, STMicroelectronics, Inc.. TotalView is a trademark of TotalView Technologies, LLC. All other trademarks are the property of their respective owners. Version 1.0 Published December 2004 Draft documentation to support Cray XT3 early-production systems. Version 1.1 Published June 2005 Supports Cray XT3 systems running the Cray XT3 Programming Environment 1.1, System Management Workstation (SMW) 1.1, and UNICOS/lc 1.1 releases. Version 1.2 Published August 2005 Supports Cray XT3 systems running the Cray XT3 Programming Environment 1.2, System Management Workstation (SMW) 1.2, and UNICOS/lc 1.2 releases. Version 1.3 Published November 2005 Supports Cray XT3 systems running the Cray XT3 Programming Environment 1.3, System Management Workstation (SMW) 1.3, and UNICOS/lc 1.3 releases. Version 1.4 Published April 2006 Supports Cray XT3 systems running the Cray XT3 Programming Environment 1.4, System Management Workstation (SMW) 1.4, and UNICOS/lc 1.4 releases. Version 1.5 Published October 2006 Supports general availability (GA) release of Cray XT systems running the Cray XT Programming Environment 1.5, UNICOS/lc 1.5, and System Management Workstation 1.5 releases. Version 2.0 Published October 2007 Supports general availability (GA) release of Cray XT systems running the Cray XT Programming Environment 2.0, UNICOS/lc 2.0, and System Management Workstation 3.0 releases. Version 2.1 Published November 2008 Supports general availability (GA) release of Cray XT systems running the Cray XT Programming Environment, Cray Linux Environment (CLE) 2.1, and System Management Workstation 3.1 releases. Version 2.2 Published July 2009 Supports general availability (GA) release of Cray XT systems running the Cray XT Programming Environment, Cray Linux Environment (CLE) 2.2, and System Management Workstation 4.0 releases.Contents Page Introduction [1] 7 1.1 Cray XT Features . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2 Related Publications . . . . . . . . . . . . . . . . . . . . . . . 11 1.2.1 Publications for Application Developers . . . . . . . . . . . . . . . . 11 1.2.2 Publications for System Administrators . . . . . . . . . . . . . . . . 14 Hardware Overview [2] 15 2.1 Basic Hardware Components . . . . . . . . . . . . . . . . . . . . . 15 2.1.1 AMD Opteron Processor . . . . . . . . . . . . . . . . . . . . 15 2.1.2 DIMM Memory . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.3 Cray SeaStar Chip . . . . . . . . . . . . . . . . . . . . . . 18 2.1.4 System Interconnection Network . . . . . . . . . . . . . . . . . . 19 2.1.5 RAID Disk Storage Subsystems . . . . . . . . . . . . . . . . . . 19 2.2 Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.1 Compute Nodes . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.2 Service Nodes . . . . . . . . . . . . . . . . . . . . . . . 21 2.3 Blades, Chassis, and Cabinets . . . . . . . . . . . . . . . . . . . . 22 2.3.1 Blades . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.2 Chassis and Cabinets . . . . . . . . . . . . . . . . . . . . . 22 Software Overview [3] 25 3.1 CLE Operating System . . . . . . . . . . . . . . . . . . . . . . 25 3.1.1 CNL . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 Lustre File System . . . . . . . . . . . . . . . . . . . . . . . 26 3.3 Cray Data Virtualization Service (Cray DVS) . . . . . . . . . . . . . . . . 27 3.4 Development Environment . . . . . . . . . . . . . . . . . . . . . 27 3.4.1 User Environment . . . . . . . . . . . . . . . . . . . . . . 27 3.4.2 Compiling Programs . . . . . . . . . . . . . . . . . . . . . . 28 3.4.2.1 Cray Compiler Commands . . . . . . . . . . . . . . . . . . . 28 3.4.2.2 PGI Compiler Commands . . . . . . . . . . . . . . . . . . . 29 S–2423–22 3Cray XT™ System Overview Page 3.4.2.3 GCC Compiler Commands . . . . . . . . . . . . . . . . . . 29 3.4.2.4 PathScale Compiler Commands . . . . . . . . . . . . . . . . . 30 3.4.3 Using Library Functions . . . . . . . . . . . . . . . . . . . . . 30 3.4.4 Linking Applications . . . . . . . . . . . . . . . . . . . . . 32 3.4.5 Running Applications . . . . . . . . . . . . . . . . . . . . . 32 3.4.5.1 Running Applications Interactively . . . . . . . . . . . . . . . . 33 3.4.5.2 Running Batch Jobs . . . . . . . . . . . . . . . . . . . . 34 3.4.6 Debugging Applications . . . . . . . . . . . . . . . . . . . . . 35 3.4.7 Monitoring and Managing Applications . . . . . . . . . . . . . . . . 36 3.4.8 Measuring Performance . . . . . . . . . . . . . . . . . . . . . 36 3.4.8.1 Performance API . . . . . . . . . . . . . . . . . . . . . 36 3.4.8.2 CrayPat . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4.8.3 Cray Apprentice2 . . . . . . . . . . . . . . . . . . . . . 37 3.4.9 Optimizing Applications . . . . . . . . . . . . . . . . . . . . 38 3.4.10 Using Data Visualization Tools . . . . . . . . . . . . . . . . . . 38 3.5 System Administration . . . . . . . . . . . . . . . . . . . . . . 39 3.5.1 System Management Workstation . . . . . . . . . . . . . . . . . . 39 3.5.2 Shared-root File System . . . . . . . . . . . . . . . . . . . . . 39 3.5.3 Lustre File System Administration . . . . . . . . . . . . . . . . . . 39 3.5.4 Con?guration and Source Files . . . . . . . . . . . . . . . . . . . 40 3.5.5 System Log . . . . . . . . . . . . . . . . . . . . . . . . 41 3.5.6 CMS Log Manager . . . . . . . . . . . . . . . . . . . . . . 41 3.5.7 Service Database . . . . . . . . . . . . . . . . . . . . . . . 41 3.5.8 System Accounting . . . . . . . . . . . . . . . . . . . . . . 41 3.5.9 System Activity Reports . . . . . . . . . . . . . . . . . . . . . 42 Cray Hardware Supervisory System (HSS) [4] 43 4.1 HSS Hardware . . . . . . . . . . . . . . . . . . . . . . . . 43 4.1.1 HSS Network . . . . . . . . . . . . . . . . . . . . . . . 44 4.1.2 System Management Workstation . . . . . . . . . . . . . . . . . . 44 4.1.3 Hardware Controllers . . . . . . . . . . . . . . . . . . . . . 45 4.2 HSS Software . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2.1 Software Monitors . . . . . . . . . . . . . . . . . . . . . . 45 4.2.2 HSS Administrator Interfaces . . . . . . . . . . . . . . . . . . . 46 4.3 HSS Actions . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3.1 System Startup and Shutdown . . . . . . . . . . . . . . . . . . . 46 4.3.2 Event Probing . . . . . . . . . . . . . . . . . . . . . . . 47 4.3.3 Event Logging . . . . . . . . . . . . . . . . . . . . . . . 47 4 S–2423–22Contents Page 4.3.4 Event Handling . . . . . . . . . . . . . . . . . . . . . . . 47 Glossary 49 Figures Figure 1. Cray XT5 System . . . . . . . . . . . . . . . . . . . . . 11 Figure 2. Single-core Processor . . . . . . . . . . . . . . . . . . . . 16 Figure 3. Dual-core Processor . . . . . . . . . . . . . . . . . . . . . 17 Figure 4. Quad-core Processor . . . . . . . . . . . . . . . . . . . . . 17 Figure 5. Hex-core Processor . . . . . . . . . . . . . . . . . . . . . 17 Figure 6. Cray SeaStar Chip . . . . . . . . . . . . . . . . . . . . . 19 Figure 7. Cray XT3 and Cray XT4 Compute Nodes . . . . . . . . . . . . . . . 20 Figure 8. Cray XT5 Compute Node . . . . . . . . . . . . . . . . . . . 20 Figure 9. Service Node . . . . . . . . . . . . . . . . . . . . . . . 21 Figure 10. Chassis and Cabinet (front view) . . . . . . . . . . . . . . . . . 23 Figure 11. Launching Applications Interactively . . . . . . . . . . . . . . . . 34 Figure 12. Running Batch Jobs . . . . . . . . . . . . . . . . . . . . 35 Figure 13. HSS Components . . . . . . . . . . . . . . . . . . . . . 44 S–2423–22 5Introduction [1] This document provides an overview of Cray XT systems. The intended audiences are application developers and system administrators. Prerequisite knowledge is a familiarity with the concepts of high-performance computing and the architecture of parallel processing systems. The UNICOS/lc operating system was renamed Cray Linux Environment (CLE). The transition to the new CLE name began in version 2.1 and is complete with this release. Note: Functionality marked as deferred in this documentation is planned to be implemented in a later release. 1.1 Cray XT Features Cray XT supercomputer systems are massively parallel processing (MPP) systems. Cray has combined commodity and open source components with custom-designed components to create a system that can operate ef?ciently at an immense scale. Cray XT systems are based on the Red Storm technology that was developed jointly by Cray Inc. and the U.S. Department of Energy Sandia National Laboratories. Cray XT systems are designed to run applications that require large-scale processing, high network bandwidth, and complex communications. Typical applications are those that create detailed simulations in both time and space, with complex geometries that involve many different material components. These long-running, resource-intensive applications require a system that is programmable, scalable, reliable, and manageable. The Cray XT series consists of Cray XT3, Cray XT4, and Cray XT5 systems. The primary differences among the systems are the type and speed of their compute node components (see Compute Nodes on page 20). S–2423–22 7Cray XT™ System Overview The major features of Cray XT systems are scalability and resiliency: • Cray XT systems are designed to scale from fewer than 100 to more than 250,000 processors. The ability to scale to such proportions stems from the design of system components: – The basic component is the node. There are two types of nodes. Service nodes provide support functions, such as managing the user's environment, handling I/O, and booting the system. Compute nodes run user applications. Because processors are inserted into standard sockets, customers can upgrade nodes as faster processors become available. – Cray XT systems use a simple memory model. Every instance of a distributed application has its own processors and local memory. Remote memory is the memory on the nodes running the associated application instances. There is no shared memory. – The system interconnection network links compute and service nodes. This is the data-routing resource that Cray XT systems use to maintain high communication rates as the number of nodes increases. Most Cray XT systems use a full 3D torus network topology. • Cray XT resiliency features include: – The Node Health Checker (NHC), which performs tests to determine if compute nodes allocated to an application are healthy enough to support running subsequent applications. If not, NHC removes any nodes incapable of running an application from the resource pool. – Tools that assist administrators in recovering from system or node failures, including a hot backup utility, and boot node failover, and single or multiple compute node reboots. – Error correction code (ECC) technology, which detects and corrects multiple-bit data transfer errors. – Lustre failover. When administrators enable Lustre automatic failover, Lustre services switch to standby services when the primary node fails or when Lustre services are temporarily shut down for maintenance. – Cray XT system cabinets with only one moving part (a blower that cools the components) and redundant power supplies, reducing the likelihood of cabinet failure. – Cray XT system processor boards (called blades) with redundant voltage regulator modules (VRMs) or VRMs with redundant circuitry. VRMs are the solid state components most likely to fail. 8 S–2423–22Introduction [1] – Diskless nodes. The availability of a node is not tied to the availability of a moving part. – Multiple redundant RAID controllers, with automatic failover capability and multiple Fibre Channel connections to disk storage. The major components of Cray XT systems are: • Application development tools, comprising: – Cray Application Development Environment (CADE), comprising: • Message Passing Toolkit (MPI, SHMEM) • Math and science libraries (LibSci, PETSc, ACML, FFTW, Fast_mv) • Data modeling and management tools (NetCDF, HDF5) • GNU debugger (lgdb) • GCC C, C++, and Fortran compilers • Java (for developing service node programs) • Checkpoint/restart • CrayPat performance analysis tool – Cray Compiling Environment (CCE), comprising the Cray C, C++ (Deferred implementation), and Fortran compilers – Optional products, comprising: • C, C++, and Fortran compilers from PGI and PathScale • glibc library (the compute node subset) • UPC and OpenMP parallel programming models • aprun application launch utility • Workload management Systems (PBS Professional, Moab/Torque) • TotalView debugger • Cray Apprentice2 performance data visualization tool • Cray Application Development Supplement (CADES) for stand-alone Linux application development platforms • Operating system services. The Cray XT operating system, CLE, is tailored to the requirements of service and compute nodes. A full-featured SUSE Linux operating system runs on service nodes, and a lightweight kernel, CNL, runs on compute nodes. S–2423–22 9Cray XT™ System Overview • Parallel ?le system. The Cray XT parallel ?le system, Lustre, which scales to thousands of clients and petabytes of data. • System management and administration tools: – System Management Workstation (SMW), the single point of control for system administration. – Hardware Supervisory System (HSS), which monitors the system and handles component failures. HSS is independent of computation and service hardware components and has its own network. – Cray Management Services (CMS), which provides the infrastructure to the Application Level Placement Scheduler (ALPS) for a fast cache of node attributes, reservations, and claims. – Comprehensive System Accounting (CSA), a software package that performs standard system accounting processing. CSA is open-source software that includes changes to the Linux kernel so that the CSA can collect more types of system resource usage data than under standard Fourth Berkeley Software Distribution (BSD) process accounting. An additional CSA interface allows the project database to use customer-supplied user, account, and project information residing on a separate Lightweight Directory Access Protocol (LDAP) server. 10 S–2423–22Introduction [1] Figure 1. Cray XT5 System 1.2 Related Publications The Cray XT system runs with a combination of Cray proprietary, third-party, and open source products, as documented in the following publications. 1.2.1 Publications for Application Developers • Cray XT System Overview (this manual) • Cray XT Programming Environment User's Guide • Cray Application Developer's Environment Installation Guide • Cray XT System Software Release Overview • Cray C and C++ Reference Manual • Cray Fortran Reference Manual • Cray compiler command options man pages (craycc(1), crayftn(1)) • PGI User's Guide S–2423–22 11Cray XT™ System Overview • PGI Tools Guide • PGI Fortran Reference • PGI compiler command options man pages: pgcc(1), pgCC(1), pgf95(1) • GCC manuals: http://gcc.gnu.org/onlinedocs/ • GCC compiler command options man pages: gcc(1), g++(1), gfortran(1) • PathScale manuals: http://www.pathscale.com/docs.html • PathScale compiler command options man pages: pathcc(1), pathCC(1), path95(1), eko(7) • Cray XT compiler driver commands man pages: cc(1), CC(1), ftn(1) • Modules utility man pages: module(1), modulefile(4) • Application launch command man page: aprun(1) • Parallel programming models: – Cray MPICH2 man pages (read the intro_mpi(3) man page ?rst) – Cray SHMEM man pages (read the intro_shmem(3) man page ?rst) – OpenMP documentation: http://www.openmp.org/ – Cray UPC man pages (read the intro_upc(3c) man page ?rst) Uni?ed Parallel C (UPC) documents: Berkeley UPC website (http://upc.lbl.gov/docs/) and Intrepid UPC website (http://www.intrepid.com/upc/cray_xt3_upc.html). • Cray scienti?c library, XT-LibSci, documentation: – Basic Linear Algebra Subroutines (BLAS) man pages – LAPACK linear algebra man pages – ScaLAPACK parallel linear algebra man pages – Basic Linear Algebra Communication Subprograms (BLACS) man pages – Iterative Re?nement Toolkit (IRT) man pages (read the intro_irt(3) man page ?rst) – SuperLU sparse solver routines guide (SuperLU Users' Guide) • AMD Core Math Library (ACML) manual • FFTW 2.1.5 and 3.1.1 man pages (read the intro_fftw2(3) or intro_fftw3(3) man page ?rst) 12 S–2423–22Introduction [1] • Portable, Extensible Toolkit for Scienti?c Computation (PETSc) library, an open source library of sparse solvers. See the intro_petsc(3) man page and http://www-unix.mcs.anl.gov/petsc/petsc-as/index.html • NetCDF documentation (http://www.unidata.ucar.edu/software/netcdf/) • HDF5 documentation (http://www.hdfgroup.org/HDF5/whatishdf5.html) • Lustre lfs(1) man page • PBS Professional 9.0 User's Guide • PBS Professional man pages (qsub(1B) , qstat(1B), and qdel(1B)) • Moab/Torque documentation (http://www.clusterresources.com/) • TotalView documentation (http://www.totalviewtech.com/) • GNU debugger documentation (see the lgdb(1) man page and the GDB User Manual at http://www.gnu.org/software/gdb/documentation/). • PAPI man pages (read the intro_papi(3) man page ?rst) • PAPI manuals (see http://icl.cs.utk.edu/papi/) • Using Cray Performance Analysis Tools • CrayPat man pages (read the intro_craypat(1) man page ?rst) • Cray Apprentice2 man page (app2(1)) • CLE man pages • SUSE LINUX man pages • Linux documentation (see the Linux Documentation Project at http://www.tldp.org and SUSE documentation at http://www.suse.com) S–2423–22 13Cray XT™ System Overview 1.2.2 Publications for System Administrators • Cray XT System Overview (this manual) • Cray XT System Software Release Overview • Cray Application Developer's Environment Installation Guide • Cray XT System Software Installation and Con?guration Guide • Cray System Management Workstation (SMW) Software Installation Guide • Cray XT System Management manual • Using Cray Management Services (CMS) • CLE man pages • SUSE LINUX man pages • HSS man pages (read the xtcli(8) man page ?rst) • Lustre documentation (see Managing Lustre on a Cray XT System and http://manual.lustre.org) • Linux documentation (see the Linux Documentation Project at http://www.tldp.org and SUSE documentation at http://www.suse.com) 14 S–2423–22Hardware Overview [2] Cray XT system hardware consists of computation components, service components, the system interconnection network, RAID disk storage systems, and HSS components. This chapter describes all hardware components except HSS hardware, which is described in Chapter 4, Cray Hardware Supervisory System (HSS) on page 43. 2.1 Basic Hardware Components The Cray XT system include the following hardware components: • AMD Opteron processors • Dual in-line memory modules (DIMMs) • System interconnection network including Cray SeaStar chips • RAID disk storage subsystems 2.1.1 AMD Opteron Processor Cray XT systems use AMD Opteron processors. Each Cray XT3 compute node has one single- or dual-core processor. Each Cray XT4 compute node has one dualor quad-core processor. Each Cray XT5 compute node has two quad- or hex-core processors, connected by HyperTransport links. All Cray XT service nodes use the same processors and memory as a Cray XT3 compute node. S–2423–22 15Cray XT™ System Overview Opteron processors feature: • Full support of the x86 instruction set. • Full compatibility with AMD Socket 940 design (Cray XT3 systems), AMD Socket AM2 design (Cray XT4 systems), and AMD Socket F design (Cray XT5 systems). • Out-of-order execution and the ability to issue instructions simultaneously. • Registers and a ?oating-point unit that support full 64-bit IEEE ?oating-point operations. • An integer processing unit that performs full 64-bit integer arithmetic. • Performance counters that can be used to monitor the number or duration of processor events, such as the number of data cache misses or the time it takes to return data from memory after a cache miss. • A memory controller that uses error correction code (ECC) for memory protection. • A HyperTransport interface that connects to the SeaStar chip. Multicore processors have two, four, or six computation engines (referred to as cores or CPUs). Each core has its own execution pipeline and the resources required to run without blocking resources needed by other processes. Because multicore processors can run more tasks simultaneously, they can increase overall system performance. The trade-offs are that the cores share local memory bandwidth and system interconnection bandwidth. The following ?gures show the components of Opteron processors. Figure 2. Single-core Processor L2 Cache Core System Request Queue Crossbar HyperTransport Memory Controller 16 S–2423–22Hardware Overview [2] Figure 3. Dual-core Processor L2 Cache Core 0 System Request Queue Crossbar HyperTransport Memory Controller Core 1 L2 Cache Figure 4. Quad-core Processor System Request Queue Crossbar Core 0 Core 1 L2 Cache L2 Cache L2 Cache L2 Cache L3 Cache HyperTransport Memory Controller Core 2 Core 3 Figure 5. Hex-core Processor System Request Queue Crossbar Core 0 Core 1 L2 Cache L2 Cache L2 Cache L2 Cache L3 Cache HyperTransport Memory Controller Core 2 Core 3 L2 Cache L2 Cache Core 4 Core 5 2.1.2 DIMM Memory Cray XT systems use Double Data Rate Dual In-line Memory Modules (DDR DIMMs). Cray XT3 systems use 1 GB, 2 GB, or 4 GB DDR1 DIMMs. Cray XT4 systems use 1 GB or 2 GB DDR2 DIMMs. With four DIMM slots per processor, the maximum physical memory is 16 GB per node on Cray XT3 systems and 8 GB per node on Cray XT4 systems. S–2423–22 17Cray XT™ System Overview Cray XT5 systems use 1 GB, 2 GB, or 4 GB DDR2 DIMMs. With four DIMM slots per processor and eight DIMM slots per node, the maximum physical memory is 32 GB per node. The minimum amount of memory for service nodes is 2 GB. Service nodes use the same type of memory as a Cray XT3 compute node. 2.1.3 Cray SeaStar Chip The Cray SeaStar application-speci?c integrated circuit (ASIC) chip is the system's message handler, of?oading communications functions from the AMD Opteron processors. Cray XT3 compute nodes use SeaStar 1 chips. Cray XT4 and Cray XT5 compute nodes use SeaStar 2 chips. Service nodes can use either SeaStar 1 or SeaStar 2 chips. A SeaStar chip contains: • HyperTransport Links, which connect SeaStar to the AMD Opteron processor. • A Direct Memory Access (DMA) engine, which manages the movement of data to and from node memory. The DMA engine is controlled by an on-board processor. • A router, which together with the other SeaStar routers connects the chip to the system interconnection network. For more information, see System Interconnection Network on page 19. • A low-level message passing interface called Portals, which provides a data path from an application to memory. Portions of the interface are implemented in Cray SeaStar ?rmware, which transfers data directly to and from user memory. The ?rmware runs on the embedded processor and RAM within the SeaStar chip. • A link to a blade control processor (also known as an L0 controller). Blade control processors are used for booting, monitoring, and maintenance (see Hardware Controllers on page 45). • A Remote Memory Access (RMA) engine for use in Cray XMT compute nodes. This engine provides an interface to the remote shared memory framework for that architecture. 18 S–2423–22Hardware Overview [2] Figure 6 illustrates the hardware components of the Cray SeaStar chip. Figure 6. Cray SeaStar Chip R o u t e r RMA HyperTransport Link RAM Processor Cray SeaStar Chip DMA Engine Link to L0 Controller 2.1.4 System Interconnection Network The system interconnection network is the communications center of the Cray XT system. The network consists of the Cray SeaStar routers, links and the cables that connect the compute and service nodes. The network uses a Cray proprietary protocol to provide fast node-to-node message passing and fast I/O to and from a global, shared ?le system. The network enables the system to achieve an appropriate balance between processor speed and interconnection bandwidth. 2.1.5 RAID Disk Storage Subsystems Cray XT systems use two types of RAID subsystems for data storage. System RAID stores the boot image and system ?les. File system RAID is managed by the Lustre parallel ?le system; it holds root and user ?les. Data on system RAID is not globally accessible. Data on ?le system RAID is globally accessible by default. 2.2 Nodes Cray XT processing components combine to form a node. The Cray XT system has two types of nodes: compute nodes and service nodes. Each node is a logical grouping of processor(s), memory, and a data routing resource. S–2423–22 19Cray XT™ System Overview 2.2.1 Compute Nodes Compute nodes run application programs. A Cray XT3 compute node consists of a single- or dual-core AMD Opteron processor, DDR1 DIMM memory, and a Cray SeaStar 1 chip. A Cray XT4 compute node consists of a dual- or quad-core AMD Opteron processor, DDR2 DIMM memory, and a Cray SeaStar 2 chip. A Cray XT5 compute node consists of two quad or hex-core NUMA nodes and one Cray SeaStar 2 chip. Each NUMA node has a quad- or hex-core processor and NUMA-node-local DDR2 DIMM memory. They are referred to as NUMA nodes because of the slight delay in accessing NUMA-node-local memory versus NUMA-node-remote memory. Figure 7. Cray XT3 and Cray XT4 Compute Nodes Compute Node Cray SeaStar HyperTransport Link AMD Opteron Processor D I M M S Figure 8. Cray XT5 Compute Node AMD Opteron Processor HyperTransport Links Compute Node Cray SeaStar NUMA Node 1 D I M M S AMD Opteron Processor NUMA Node 0 D I M M S AMD Opteron Processor 20 S–2423–22Hardware Overview [2] 2.2.2 Service Nodes Service nodes handle system functions such as user login, I/O, and network management. Each service node contain a single- or dual-core processor, DDR1 DIMM memory, and a SeaStar 1 or SeaStar 2 chip. In addition, each service node contains two PCI-X or PCIe slots for optional interface cards. Cray XT systems include several types of service nodes, de?ned by the function they perform. • Login nodes. Users log in to the system through login nodes. Each login node includes one or two Ethernet network interface cards that connect to an Ethernet-based local area network, to which user workstations are connected. • Network service nodes. Each network service node contains a PCI-X or PCIe card that can be connected to customer network storage devices. • I/O nodes. Each I/O node uses one or two ?bre channel cards to connect to Lustre-managed RAID storage. • Boot nodes. Each system requires one boot node. A boot node contains one ?bre channel card which is either PCI-X or PCIe. The ?bre channel card connects to the RAID subsystem, and an Ethernet network interface card connects to the System Management Workstation (SMW). Most systems have two boot nodes: a primary and a backup. • Service database (SDB) nodes. Each SDB node contains a ?bre channel card to connect to the SDB ?le system. The SDB node manages the state of the Cray XT system. For a description of the types of service nodes, see CLE Operating System on page 25. Figure 9. Service Node AMD Opteron Processor D I M M PCI-X or PCIe Bridge Cray SeaStar HyperTransport Link PCI-X or PCIe Slot PCI-X or PCIe Slot S–2423–22 21Cray XT™ System Overview 2.3 Blades, Chassis, and Cabinets This section describes the main physical components of the Cray XT system and their con?gurations. 2.3.1 Blades Whereas the node is the logical building block of the Cray XT system, the basic physical component and ?eld-replaceable unit is the blade. Any Cray XT blade has a mezzanine, also a ?eld-replaceable unit, containing four Cray SeaStar chips. There are two types of blades: compute blades and service blades. A compute blade consists of four compute nodes, voltage regulator modules, and an L0 controller. The L0 controller is an HSS component; for more information, see Chapter 4, Cray Hardware Supervisory System (HSS) on page 43. A service blade consists of two service nodes, voltage regulator modules, PCI-X or PCIe cards, and an L0 controller. Although a service blade has two nodes, it has four SeaStar chips to allow for a common board design and to simplify the interconnect con?gurations. Several different PCI-X or PCIe cards are available to provide Fibre Channel interfaces to ?le system RAID, GigE interfaces to user workstations, and 10 GigE interfaces to external networks. 2.3.2 Chassis and Cabinets Each cabinet contains three vertically stacked chassis (or cages), and each chassis contains eight vertically mounted blades. A cabinet can contain compute blades, service blades, or a combination of compute and service blades. There are different cabinet types: Cray XT3, Cray XT4, Cray XT5, Cray XT5-HE air cooled, and Cray XT5-HE liquid cooled. The primary difference is increased power and cooling capacity for newer blade types. Customer-provided three-phase power is supplied to the cabinet Power Distribution Unit (PDU). The PDU routes power to the cabinet's power supplies, which distribute 48 VDC to each of the chassis in the cabinet. 22 S–2423–22Hardware Overview [2] All cabinets have redundant power supplies. The PDU, power supplies, and the cabinet control processor (L1 controller) are located at the rear of the cabinet. Figure 10 shows the basic components of an air-cooled cabinet. Figure 10. Chassis and Cabinet (front view) Fan Chassis 2 Compute or service blade Chassis 1 Chassis 0 COMPUTE MODULE COMPUTE MODULE COMPUTE MODULE COMPUTE MODULE COMPUTE MODULE COMPUTE MODULE HOT SWAP L0 HOT SWAP L0 L0 HOT SWAP L0 HOT SWAP L0 HOT SWAP L0 HOT SWAP L0 CONSOLE 9600:8N1 CONSOLE 9600:8N1 CONSOLE 9600:8N1 CONSOLE 9600:8N1 CONSOLE 9600:8N1 CONSOLE 9600:8N1 (...) L0 (...) L0 (...) L0 (...) L0 (...) L0 (...) Slot 0 Slot 1 Slot 2 Slot 3 Slot 4 Slot 5 Slot 6 COMPUTE MODULE HOT SWAP L0 CONSOLE 9600:8N1 L0 (...) HOT SWAP L0 CONSOLE 9600:8N1 CONSOLE 9600:8N1 CONSOLE 9600:8N1 CONSOLE 9600:8N1 CONSOLE 9600:8N1 CONSOLE 9600:8N1 CONSOLE 9600:8N1 CONSOLE 9600:8N1 CONSOLE 9600:8N1 L0 (...) Slot 7 COMPUTE MODULE COMPUTE MODULE COMPUTE MODULE COMPUTE MODULE COMPUTE MODULE COMPUTE MODULE COMPUTE MODULE COMPUTE MODULE HOT SWAP L0 L0 HOT SWAP L0 HOT SWAP L0 HOT SWAP L0 HOT SWAP L0 HOT SWAP L0 HOT SWAP L0 (...) L0 (...) L0 (...) L0 (...) L0 (...) L0 (...) L0 (...) Slot 0 Slot 1 Slot 2 Slot 3 Slot 4 Slot 5 Slot 6 COMPUTE MODULE HOT SWAP L0 L0 (...) Slot 7 COMPUTE MODULE HOT SWAP L0 L0 CONSOLE 9600:8N1 CONSOLE 9600:8N1 L0 L0 (...) Slot 7 COMPUTE MODULE COMPUTE MODULE COMPUTE MODULE HOT SWAP L0 HOT SWAP L0 HOT SWAP L0 CONSOLE 9600:8N1 CONSOLE 9600:8N1 CONSOLE 9600:8N1 L0 (...) L0 (...) L0 (...) Slot 2 Slot 3 Slot 4 Slot 5 COMPUTE MODULE HOT SWAP L0 CONSOLE 9600:8N1 L0 (...) COMPUTE MODULE HOT SWAP (...) Slot 6 COMPUTE MODULE COMPUTE MODULE HOT SWAP L0 L0 HOT SWAP L0 CONSOLE 9600:8N1 CONSOLE 9600:8N1 (...) L0 (...) Slot 0 Slot 1 S–2423–22 23Cray XT™ System Overview 24 S–2423–22Software Overview [3] Cray XT systems run a combination of Cray-developed software, third-party software, and open source software. The software is optimized for applications that have ?ne-grain synchronization requirements, large processor counts, and signi?cant communication requirements. This chapter provides an overview of the Cray Linux Environment (CLE) operating system, the Lustre ?le system, the application development environment, and system administration tools. For a description of HSS software, see Chapter 4, Cray Hardware Supervisory System (HSS) on page 43. 3.1 CLE Operating System The Cray XT operating system, Cray Linux Environment (CLE), is a distributed system of service-node and compute-node components. The CLE compute node kernel, CNL, supports all compute node processes. Service nodes perform the functions needed to support users, administrators, and applications running on compute nodes. There are ?ve basic types of service nodes: login, network, I/O, boot, and SDB. Service nodes run a full-featured version of SUSE LINUX. Above the operating system level are specialized daemons and applications that perform functions unique to each service node. 3.1.1 CNL CNL is the Cray XT lightweight compute node kernel. It includes a run time environment based on the SUSE Linux Enterprise Server (SLES) distribution and the SLES kernel with Cray speci?c modi?cations. Cray has con?gured the kernel to eliminate device drivers for hardware not supported on Cray XT systems. Other features and services not required by Cray XT applications have also been con?gured out of the kernel. S–2423–22 25Cray XT™ System Overview CNL features: • Scalability. Only the features required to run high performance applications are available on the compute nodes. Other features and services are available from service nodes. • Minimized OS jitter. Cray has con?gured and tuned the kernel to minimize processing delays caused by inef?cient synchronization. • Minimized memory footprint. Cray has con?gured the kernel and the ramfs-based root ?le system to use a minimum amount of memory on each node in order to maximize the amount of memory available for applications. • Non-Uniform Memory Access (NUMA). NUMA architecture is particularly bene?cial to applications that run on Cray XT5 compute nodes (see Compute Nodes on page 20). NUMA is much more ef?cient than symmetric multiprocessing (SMP) because data is tightly coupled between processor and memory pairs. • Application networking (sockets). • POSIX system calls. • POSIX threads functions. A separate service, the Application Level Placement Scheduler (ALPS), handles application launch, monitoring, and signaling and coordinates batch job processing with a workload management system. 3.2 Lustre File System Lustre is the parallel ?le system for Cray XT applications. Lustre features high-performance, scalability, and POSIX compliance. I/O nodes host Lustre. Lustre is implemented as a set of Linux-loadable modules and uses Portals and an object-oriented architecture for storing and retrieving data. Lustre separates ?le metadata from data objects. Each instance of a Lustre ?le system consists of Object Storage Servers (OSSs) and a Metadata Server (MDS). Each OSS hosts one or more Object Storage Targets (OSTs). Lustre OSTs are backed by RAID storage. Applications store data on OSTs; ?les can be striped across multiple OSTs. Cray XT systems implement Lustre automatic failover and administrator-supported failback MDSs and OSTs. CNL supports I/O to Lustre ?le systems. For standard output streams, the application launch utility, aprun, forwards standard input to the application. An application's standard output and standard error messages are sent from the compute nodes back to aprun for display. Files local to the compute node, such as /proc or /tmp ?les, can be accessed by a CNL application. 26 S–2423–22Software Overview [3] Lustre's I/O operations are transparent to the application developer. The I/O functions available to the application developer—Fortran, C, and C++ I/O calls; C and C++ stride I/O calls; and system I/O calls—are converted to Lustre driver calls by the virtual ?le system switch (Linux VFS). For further information about Lustre, see Managing Lustre on a Cray XT System and http://manual.lustre.org. 3.3 Cray Data Virtualization Service (Cray DVS) Cray DVS is a distributed network service that gives compute node applications transparent access to ?le systems on I/O nodes. Applications can read and write data to the user's home directory, and users can access ?les over a network as if they were local. Cray DVS supports access to VFS-based, POSIX-compliant ?le systems. However, DVS is not a ?le system, but an I/O forwarding service. Cray DVS provides I/O scalability to large numbers of nodes, far beyond the typical number of clients supported by a single NFS server. For additional information, see the Cray XT System Software Installation and Con?guration Guide and Introduction to Cray Data Virtualization Service. 3.4 Development Environment Application development software is the set of software products and services that programmers use to build and run applications on compute nodes. 3.4.1 User Environment The user environment is similar to the environment on a typical Linux workstation. Users log in to a Cray XT login node or a stand-alone Linux workstation and compile and link their applications. They run their applications on Cray XT compute nodes. The Cray Application Developer's Environment Supplement (CADES) contains the additional components required in order to install and use CADE on standalone Linux systems. S–2423–22 27Cray XT™ System Overview Before starting to develop applications, the user: 1. Sets up a secure shell. The Cray XT system uses ssh and ssh-enabled applications for secure, password-free remote access to login nodes. Before using ssh commands, the user needs to generate an RSA authentication key. 2. Loads the appropriate modules. The Cray XT system uses the Modules utility to support multiple versions of software, such as compilers, and to create integrated software packages. As new versions of the supported software become available, they are added automatically to the Programming Environment, and earlier versions are retained to support legacy applications. For details, see the Cray XT Programming Environment User's Guide and the module(1) and modulefile(4) man pages. 3.4.2 Compiling Programs The Cray XT system Programming Environment includes Cray compilers and compiler suites from The Portland Group (PGI), the GNU Compiler Collection (GCC), and PathScale. The compilers translate C, C++, and Fortran source programs into Cray XT object ?les. Developers can use interlanguage communication functions to create Fortran programs that call C or C++ routines and C or C++ programs that call Fortran routines. The command used to invoke a compiler, called a compilation driver, can be used to apply options at the compilation unit level. Fortran directives and C or C++ pragmas apply options to selected portions of code or alter the effects of command-line options. In addition to the Cray, PGI, GCC, and PathScale compilers, the Cray XT Programming Environment includes the Java compiler for developing applications to run on service nodes. For details, see http://java.sun.com/javase/6/docs/. 3.4.2.1 Cray Compiler Commands The following Cray compiler commands are available: Cray Compiler Command C cc C++ (Deferred implementation) CC Fortran 90/95 ftn 28 S–2423–22Software Overview [3] See the cc(1), CC(1), or ftn(1) man page for information about the compiler driver command options. See the craycc(1), crayCC(1) (Deferred implementation), or crayftn(1) man page for details about Cray compiler options. For further information, see the Cray C and C++ Reference Manual, Cray Fortran Reference Manual, and Cray XT Programming Environment User's Guide. 3.4.2.2 PGI Compiler Commands The following PGI compiler commands are available: PGI Compiler Command C cc C++ CC Fortran 90/95 ftn Note: Users should not invoke a PGI compiler directly using the pgcc, pgCC, or pgf95 command. The resulting executable will not run on the Cray XT system. The cc(1), CC(1), and ftn(1) man pages contain information about the compiler driver commands, whereas the pgcc(1), pgCC(1), and pgf95(1) man pages describe the PGI compiler command options. For further information, see the Cray XT Programming Environment User's Guide. 3.4.2.3 GCC Compiler Commands The following GCC compiler commands are available: GCC Compiler Command C cc C++ CC Fortran 90 ftn Note: Users should not invoke a GCC compiler directly using the gcc, g++, or gfortran command. The resulting executable will not run on the Cray XT system. The cc(1), CC(1), and ftn(1) man pages contain information about the compiler driver commands, whereas the gcc(1), g++(1), and gfortran(1) man pages describe the GCC compiler command options. For further information, see the Cray XT Programming Environment User's Guide. S–2423–22 29Cray XT™ System Overview 3.4.2.4 PathScale Compiler Commands The following PathScale compiler commands are available: PathScale Compiler Command C cc C++ CC Fortran 90/95 ftn Note: Users should not invoke a PathScale compiler directly using the pathcc, pathCC, or path95 command. The resulting executable will not run on the Cray XT system. The cc(1), CC(1), and ftn(1) man pages contain information about the compiler driver commands, while the pathcc(1), pathCC(1), path95(1), and eko(7) man pages contain descriptions of the PathScale compiler command options. For further information, see the Cray XT Programming Environment User's Guide. 3.4.3 Using Library Functions Developers can use C, C++, and Fortran library functions and functions from the following libraries: • GNU C Language Runtime Library (glibc) functions. • Cray MPICH2, Cray SHMEM, OpenMP, and UPC functions. MPICH2 and SHMEM use Portals functions for message passing; the Portals interface is transparent to the application programmer. MPICH2 is an implementation of MPI-2 by the Argonne National Laboratory Group. The dynamic process (spawn) functions in Cray MPICH2 are not supported at this time, but otherwise the libraries are fully MPI 2.0 compliant. Cray SHMEM routines are similar to the Cray MPICH2 routines; they pass data between cooperating parallel processes. Cray SHMEM routines can be used in programs that perform computations in separate address spaces and explicitly pass data to and from different processing elements in the program. OpenMP is an industry-standard, portable model for shared-memory parallel programming. In addition to library routines, OpenMP provides Fortran directives and C and C++ pragmas. OpenMP applications can be used in hybrid MPI/OpenMP applications but cannot cross node boundaries. For further information, see the Cray XT Programming Environment User's Guide and the OpenMP Application Program Interface at http://www.openmp.org/. 30 S–2423–22Software Overview [3] The Cray, PGI, and GCC C compilers supprt UPC. The Cray C compiler supports the UPC Language Speci?cation 1.2 and Cray-speci?c functions. The PGI and GCC C compilers support Cray XT-UPC, which contains the following front ends: – Berkeley UPC translator, a UPC-to-C translator based on Open64. – Intrepid GCCUPC, a UPC-to-assembly compiler based on GNU GCC. Both front ends generate code that is linked with the Berkeley UPC run time library (UPCR) and communication system from Berkeley. • Cray XT LibSci scienti?c libraries. XT-LibSci contains: – Basic Linear Algebra Subroutines (BLAS) and LAPACK linear algebra routines – ScaLAPACK and Basic Linear Algebra Communication Subprograms (BLACS) routines – Iterative Re?nement Toolkit (IRT), a library of factorization routines, solvers, and tools that can be used to solve systems of linear equations more ef?ciently than the full-precision solvers in Cray XT-LibSci or ACML. – SuperLU, a set of routines that solve large, sparse, nonsymmetric systems of linear equations. XT-LibSci library routines are written in C but can be called from Fortran, C, or C++ programs. – CRay Adaptive Fast Fourier Transform (CRAFFT), a library of Fortran subroutines that compute the discrete Fourier transform in one, two, or three dimensions; of arbitrary input size; and of both real and complex data. CRAFFT provides a simpli?ed interface to FFT and allows the FFT library itself to choose the fastest FFT kernel. • Portable, Extensible Toolkit for Scienti?c Computation (PETSc), an open source library of sparse solvers. • AMD Core Math Library (ACML), which includes: – A suite of Fast Fourier Transform (FFT) routines for single-precision, double-precision, single-precision complex, and double-precision complex data types. – Fast scalar, vector, and array math transcendental library routines optimized for high performance. – A comprehensive random number generator suite. S–2423–22 31Cray XT™ System Overview • The Programming Environment includes the 2.1.5 and 3.1.1 releases of FFTW. FFTW is a C subroutine library with Fortran interfaces for computing the discrete Fourier transform in one or more dimensions, of arbitrary input size, and of both real and complex data (as well as of even/odd data, such as the discrete cosine/sine transforms). The Fast Fourier Transform (FFT) algorithm is applied for many problem sizes. Distributed memory parallel FFTs are available only in FFTW 2.1.5. For further information, see the intro_fftw2(3) and intro_fftw3(3) man pages. • Fast_mv, a library of high-performance math intrinsic functions. The functions can be used in PGI and PathScale applications. For further information, see the intro_fast_mv(3) man page. 3.4.4 Linking Applications After correcting compilation errors, the developer again invokes the compilation driver, this time specifying the application's object ?les (?lename.o) and libraries (?lename.a) as required. The linker extracts the library modules that the program requires and creates the executable ?le (named a.out by default). 3.4.5 Running Applications There are two methods of running applications: interactively and through application launch commands in batch job scripts. The user can run an application interactively by: • Using the aprun command to launch applications on compute nodes the administrator has con?gured for interactive use. • Using the qsub -I command to initiate an interactive session and then running an application interactively using the aprun command. The developer uses a workload management system (WMS) to run batch jobs. PBS Professional, Moab/Torque, and Platform LSF (Deferred implementation) are networked subsystems for submitting, monitoring, and controlling batch jobs. A batch job is typically a shell script and attributes that provide resource and control information about the job. Batch jobs are scheduled for execution at a time chosen by the subsystem according to a de?ned policy and the availability of resources. A WMS server maintains job queues. Each queue holds a group of jobs available for execution, and each job has a set of user-speci?ed resource requirements. The WMS scheduler is the policy engine that examines the ready-to-run jobs and selects the next job to run based on a set of criteria. Once the required compute nodes are reserved, the WMS processes the job script and transfers control to aprun. 32 S–2423–22Software Overview [3] Developers who want to launch an application only on nodes with certain attributes can use the cnselect command. Among the attributes are node ID, number of cores per node (1, 2, 4, or 8), amount of node memory, page size, and CPU clock rate. The cnselect utility uses the service database (SDB) to generate a candidate list of nodes. Developers can include this list or a subset of it on aprun or qsub commands to launch applications on compute nodes with those attributes. 3.4.5.1 Running Applications Interactively To launch an application interactively, the user enters the aprun command, specifying the application executables and the compute node resources they require. The aprun client sends the request to the apsys server, which forwards it to the apsched agent running on a service node. The apsched agent gets the compute node placement list, reserves the nodes needed for the application, and relays the placement list to aprun. On Cray XT4 and Cray XT5 compute nodes, support for non-uniform processing element (PE) placement is supported. This means that apsched will attempt to place the maximum amount of PEs per computer node. This potentially improves performance by allowing systems with both types of compute blades to exploit all resources rather than leave some nodes under-utilized. The aprun client then sends the placement list and the executable binary data to the apinit daemon running on the ?rst node assigned to the application. On each node allocated to the application, an apinit daemon creates an application shepherd to manage the processes of the application on that node. The application shepherd on the ?rst assigned node propagates the node placement list and the executable to the compute nodes using a fan-out tree and passes control to the application. S–2423–22 33Cray XT™ System Overview While the application is running, application shepherds monitor the application processes. If the aprun client or an application instance catches a signal, the signal is propagated to all processing elements. Applications rely on aprun to manage the standard input, standard output, and standard error streams and handle signal management. Figure 11. Launching Applications Interactively Cray XT User Login Node aprun Fan out application apsys apsched SDB Node Compute Node Placement List Compute Node apinit apshepherd User Application 3.4.5.2 Running Batch Jobs The Cray XT system uses a workload management system to launch batch jobs. The user creates a script containing aprun commands, then enters the qsub command to submit the job to a WMS server. The WMS server uses the apbasil interface to reserve the nodes required for the job, then processes the job scripts. When the WMS encounters the aprun command(s) in the script, control is transferred to ALPS for application propagation and launch. For further information about the ALPS/WMS interface, see the apbasil(1) and basil(7) man pages. 34 S–2423–22Software Overview [3] Figure 12. Running Batch Jobs Cray XT User Login Node qsub Fan out application Login Shell Login Node aprun apsched SDB Node Compute Node Placement List Compute Node WMS apinit apshepard apbasil User Application 3.4.6 Debugging Applications The Cray XT system supports the TotalView debugger for single-process and mutiprocess debugging and the GNU lgdb debugger for single-process applications. The TotalView debugger, available from TotalView Technologies, LLC, provides source-level debugging of applications. It is compatible with the PGI, GCC, and PathScale compilers. TotalView can debug applications running on 1 to 4096 compute nodes, providing visibility and control into each process individually or by groups. It also supports access to MPI-speci?c data, such as the message queues. S–2423–22 35Cray XT™ System Overview TotalView typically is run interactively. To debug a program using TotalView, the developer invokes TotalView either from the graphical interface (totalview) or the command line (totalviewcli). TotalView parses the command to get the number of nodes requested, then makes a node allocation request. TotalView directs aprun to load but not start the application. The aprun utility loads the application onto the compute nodes, after which TotalView can perform initial setup before instructing aprun to start the application. For more information about TotalView, see the Cray XT Programming Environment User's Guide, the totalview(1) man page, and TotalView documentation at http://www.totalviewtech.com/Documentation/. To ?nd out what levels of the compilers TotalView supports, see the TotalView Platforms and System Requirements document at the TotalView website. For more information about lgdb, see the Cray XT Programming Environment User's Guide lgdb(1) man page. 3.4.7 Monitoring and Managing Applications ALPS provides commands for monitoring and managing applications. The apstat command reports the status of applications, including the number of processing elements, number of threads, a map of the application's address space, and a map showing the placement of team members. The apkill command sends a kill signal to an application team. For more information, see the apstat(1) and apkill(1) man pages. 3.4.8 Measuring Performance The Cray XT system provides tools for collecting, analyzing, and displaying performance data. 3.4.8.1 Performance API The Performance API (PAPI) from the University of Tennessee and Oak Ridge National Laboratory is a standard interface for accessing hardware performance counters. A PAPI event set maps AMD Opteron processor hardware counters to a list of events, such as Level 1 data cache misses, data translation lookaside buffer (TLB) misses, and cycles stalled waiting for memory accesses. Developers can use the API to collect data on those events. 3.4.8.2 CrayPat CrayPat is a performance analysis tool. The developer can use CrayPat to perform sampling and tracing experiments on an instrumented application and analyze the results of those experiments. 36 S–2423–22Software Overview [3] Sampling experiments capture information at user-de?ned time intervals or when a predetermined event occurs, such as the over?ow of a user-speci?ed hardware performance counter. Tracing experiments capture information related to both prede?ned and user-de?ned function entry points, such as the number of times a particular MPI function is called and the amount of time the program spends performing that function. The developer uses the pat_build command to instrument programs. No recompilation is needed to produce the instrumented program. Alternatively, the developer can use the pat_hwpc command to instrument the program for collecting prede?ned hardware performance counter information, such as cache usage data. After instrumenting a program, the developer sets environment variables to control run time data collection, runs the instrumented program, then uses the pat_report command to either generate a report or export the data for use in Cray Apprentice2 or other applications. 3.4.8.3 Cray Apprentice2 Cray Apprentice2 is an interactive X Window System tool for displaying performance analysis data captured during program execution. Cray Apprentice2 identi?es many potential performance problem areas, including the following conditions: • Load imbalance • Excessive serialization • Excessive communication • Network contention • Poor use of the memory hierarchy • Poor functional unit use Cray Apprentice2 has the following capabilities: • It is a post-execution performance analysis tool that provides information about a program by examining data ?les that were created during program execution. It is not a debugger or a simulator. • Cray Apprentice2 displays many types of performance data contingent on the data that was captured during execution. • It reports time statistics for all processing elements and for individual routines. • It shows total execution time, synchronization time, time to execute a subroutine, communication time, and the number of calls to a subroutine. S–2423–22 37Cray XT™ System Overview 3.4.9 Optimizing Applications Two types of operations on multicore compute nodes—remote-NUMA-node memory references and process migration—can affect performance. On Cray XT5 systems, processes accessing remote-NUMA-node memory can reduce performance. To restrict applications to local-NUMA-node memory, developers can use aprun memory af?nity options. On Cray XT multicore systems, the compute node kernel can dynamically distribute workload by migrating processes and threads from one CPU to another. In some cases, this migration reduces performance. Developers can bind a process or thread to a particular CPU or a subset of CPUs by using aprun CPU af?nity options. In addition to these optimization options, the PGI, GCC, and PathScale compilers provide compiler command options, directives, and pragmas that the developer can use to optimize code. For further information, see the PGI compiler documentation at http://www.pgroup.com, the GCC compiler documentation at http://gcc.gnu.org/, or the PathScale compiler documentation at http://www.pathscale.com/docs.html. In addition, see the Software Optimization Guide for AMD64 Processors at http://www.amd.com/. 3.4.10 Using Data Visualization Tools Cray XT systems support the VisIt data visualization and graphical analysis tool. VisIt was developed by the Department of Energy (DOE) Advanced Simulation and Computing Initiative (ASCI) and is distributed through Lawrence Livermore National Laboratory (http://www.llnl.gov/visit). VisIt provides an extensible interface for creating, manipulating, and animating 2D and 3D graphical representations of data sets ranging in size from kilobytes to terabytes. 38 S–2423–22Software Overview [3] 3.5 System Administration The system administration environment provides the tools that administrators use to manage system functions, view and modify the system state, and maintain system con?guration ?les. System administration components are a combination of Cray XT system hardware, SUSE LINUX, Lustre, and Cray XT system utilities and resources. Note: For information about standard SUSE LINUX administration, see http://www.tldp.org or http://www.novell.com/linux. For details about Lustre functions, see the Cray XT System Software Installation and Con?guration Guide manual and http://www.lustre.org/ or http://www.sun.com/software/products/lustre/. Many of the components used for system administration are also used for system monitoring and management (such as powering up and down and monitoring the health of hardware components). For details, see Chapter 4, Cray Hardware Supervisory System (HSS) on page 43. 3.5.1 System Management Workstation The System Management Workstation (SMW) is a server and display that provides a single-point interface to an administrator's environment. The administrator uses the SMW to perform tasks like adding user accounts, changing passwords, and monitoring applications. 3.5.2 Shared-root File System The Cray XT system has a shared-root ?le system in which the root directory is shared read-only on the service nodes. All nodes have the same default directory structure. However, the /etc directory is specially mounted on each service node as a node-speci?c directory of symbolic links. The administrator can change the symbolic links in the /etc directory by the process of specialization, which changes the symbolic link to point to a non-default version of a ?le. The administrator can specialize ?les for individual nodes or for a class (type) of nodes. The administrator's interface includes commands to view ?le layout from a speci?ed node, determine the type of specialization, and create a directory structure for a new node or class based on an existing node or class. For details, see the Cray XT System Management manual. 3.5.3 Lustre File System Administration The Lustre ?le system is optimized for large-scale, serial access typical of parallel programming applications. S–2423–22 39Cray XT™ System Overview When a ?le is created, the client contacts a metadata server (MDS). The MDS handles namespace operations, such as opening or closing a ?le, managing directory listings, and changing permissions. The MDS contacts Object Storage Servers (OSSs) to create data objects. The OSSs handle block allocation, enforce security for client access, and perform parallel I/O operations to transfer ?le data. The administrator can create and mount more than one instance of Lustre. One MDS plus one or more OSSs make up a single instance of Lustre and are managed as such. Objects allocated on Object Storage Targets (OSTs) hold the data associated with the ?le. Once a ?le is created, read and write operations take place directly between the client and the OSS, bypassing the MDS. The OSTs use the ldiskfs ?le system, a modi?ed version of the ext3 ?le system, for backend storage. This ?le system is used to store Lustre ?le and metadata objects and is not directly visible to the user. The administrator con?gures Lustre as a parallel ?le system by creating multiple OSSs and OSTs. The ?le system optimizes I/O by striping ?les across many RAID storage devices. The administrator can con?gure a default system-wide striping pattern at ?le system creation time. Cray provides Lustre control utilities to simplify con?guration. The control utilities implement a centralized con?guration ?le and provide a layer of abstraction to the standard Lustre con?guration and mount utilities. The developer can use the Lustre lfs utility to: • Set quota policies • Create a ?le with a speci?c striping pattern • Find the striping pattern of existing ?les Lustre con?guration information is maintained in the service database (SDB). For details, see the Cray XT System Management manual. 3.5.4 Con?guration and Source Files The administrator uses the boot node to view ?les, maintain con?guration ?les, and manage the processes of executing programs. Boot nodes connect to the SMW and are accessible through a login shell. The xtopview utility runs on boot nodes and allows the administrator to view ?les as they would appear on any node. The xtopview utility also maintains a database of ?les to monitor as well as ?le state information such as checksum and modi?cation dates. Messages about ?le changes are saved through a Revision Control System (RCS) utility. 40 S–2423–22Software Overview [3] 3.5.5 System Log Once the system is booted, console messages are sent to the system log and are written to the boot RAID system. System log messages generated by service node kernels and daemons are gathered by syslog daemons running on all service nodes. Kernel errors and panic messages are sent directly to the SMW via the HSS network. The administrator can con?gure the syslog daemon to write the messages to different ?les, sorted by message generator or degree of importance. 3.5.6 CMS Log Manager The log manager collects, analyzes, and displays messages from the system. The administrator can use the log manager to collect syslog messages that are sent to the SMW and event log information from the event router. The administrator can also create log de?nitions to add, delete, ignore, archive, or notify (take an action) based on a message. For further information, see the Cray XT System Management manual. 3.5.7 Service Database A database node hosts the Service Database (SDB), which is accessible from every service processor. The SDB, implemented in MySQL, contains the following information: • Node attributes used by aprun to schedule jobs. Node attributes include the number of cores present on a processor, the processor clock speed, the amount of memory available to the processor, the architecture type of the node processor, and the type of kernel running on the node. • System con?guration tables that list and describe the con?guration ?les. 3.5.8 System Accounting The GNU 6.4 process accounting is enabled for Cray XT service nodes. Comprehensive System Accounting (CSA) includes accounting utilities that perform standard types of system accounting processing on the CSA-generated accounting ?les. CSA uses open-source software with changes to the Linux kernel so that the CSA can collect more types of system resource usage data than under standard Fourth Berkeley Software Distribution (BSD) process accounting. In addition, the project database used with CSA can use customer supplied user, account, and project information that resides on a separate Lightweight Directory Access Protocol (LDAP) server. S–2423–22 41Cray XT™ System Overview 3.5.9 System Activity Reports The sar(1) command collects, reports, or saves system activity information for service nodes. For more information, see the sar(1) man page. 42 S–2423–22Cray Hardware Supervisory System (HSS) [4] The Cray Hardware Supervisory System (HSS) is an independent system of hardware and software that monitors system components, manages hardware and software failures, controls startup and shutdown processes, manages the system interconnection network, and displays the system state to the administrator. Because the HSS is a completely separate system with its own processors and network, the services that it provides do not take resources from running applications. In addition, if a component fails, the HSS continues to provide fault identi?cation and recovery services and enables the functioning parts of the system to continue operating. For more information about the HSS, see the Cray XT System Management manual. 4.1 HSS Hardware The hardware components of the HSS are the HSS network, the SMW, the blade control processors (L0 controllers), and the cabinet control processors (L1 controllers). HSS hardware monitors compute and service node components, operating system heartbeats, power supplies, cooling fans, voltage regulators, sensors, microcontrollers, and RAID systems. S–2423–22 43Cray XT™ System Overview Figure 13. HSS Components System Management Workstation HSS Network L1 B al de L0 Blade L0 Blade L0 Cabinet L1 Blade L0 Blade L0 Blade L0 Cabinet L1 Blade L0 Blade L0 Blade L0 Cabinet 4.1.1 HSS Network The HSS network consists of Ethernet connections between the SMW and the L1- and L0-microprocessors. The network's function is to provide an ef?cient means of collecting status from and broadcasting messages to system components. The HSS network is separate from the system interconnection network. Traf?c on the HSS network is normally low, with occasional peaks of activity when major events occur. There is a baseline level of traf?c to and from the hardware controllers. All other traf?c is driven by events, either those due to hardware or software failures or those initiated by the administrator. The highest level of network traf?c occurs during the initial booting of the entire system as console messages from the booting images are transmitted onto the network. 4.1.2 System Management Workstation The SMW is the administrator's single-point interface for booting, monitoring, and managing system components. The SMW consists of a server and a display device. Multiple administrators can use the SMW locally or remotely over an internal LAN or WAN. Note: The SMW is also used to perform system administration functions (see System Administration on page 39). 44 S–2423–22Cray Hardware Supervisory System (HSS) [4] 4.1.3 Hardware Controllers The L0 and L1 are the controllers that monitor the hardware and software of the components on which they reside. Every compute blade and service blade has a blade control processor (L0 controller). This processor monitors the following blade components: Opteron status registers, SeaStar status registers, and voltage regulation modules. L0 controllers also monitor board temperatures and the CLE heartbeat. Each cabinet has a cabinet control processor (L1 controller) that communicates with the L0 controllers and monitors the power supplies and the temperature of the air cooling the blades. Each L1 controller also routes messages between the L0 controllers and the SMW. 4.2 HSS Software HSS software consists of software monitors; the administrator's HSS interfaces; and event probes, loggers, and handlers. This section describes the software monitors and administrator interfaces. For a description of event probes, loggers, and handlers, see HSS Actions on page 46. 4.2.1 Software Monitors The System Environment Data Collection (SEDC) HSS manager monitors the system health and records environmental data (such as temperature) and the status of hardware components (such as power supplies, processors, and fans). SEDC can be set to run at all times (automatic data collection) or only when a client is listening; set the INT:startup_action option in the SEDC con?guration ?le to indicate your preference. For additional information, see the Cray XT System Management manual. Resiliency communication agents (RCAs) run on all compute nodes and service nodes. RCAs are the primary communications interface between a node's operating environment and the HSS components external to the node. They monitor software services and the operating system instance on each node. Through the RCA, the HSS and the system processes running on a node handle event noti?cation, informational messages, information requests, and probing. The RCA also provides a subscription service for processes running on the nodes. This service noti?es the current node of events on other nodes that may affect the current node or that require action by the current node or its functions. Each RCA generates a periodic heartbeat message, enabling HSS to know when an RCA has failed. Failure of an RCA heartbeat is interpreted as a failure of CLE on that node. S–2423–22 45Cray XT™ System Overview RCA daemons running on each node start a system resiliency process called failover manager. If a service fails, the RCA daemon transmits a service-failed message to the HSS. Failover managers on other nodes subscribe to receive these messages. Each failover manager checks to determine if it is the backup for any failed services that relate to the message and, if it is, directs the RCA daemon on its node to locally restart the failed service. 4.2.2 HSS Administrator Interfaces The HSS provides a command-line interface, xtcli. For details, see the xtcli(8) man page. If any component of the system detects an error, it sends a message to the SMW. The message is logged and displayed for the administrator. HSS policy decisions determine how the fault is handled. The SMW writes all information it receives from the system to the SMW disk to ensure the information is not lost due to component failures. 4.3 HSS Actions The HSS manages the startup and shutdown processes and event probing, logging, and handling. The HSS collects data about the system (event probing and logging) that is then used to determine which components have failed and in what manner. After determining that a component has failed, the HSS initiates certain actions (event handling) in response to detected failures that, if left unattended, could cause worse failures. The HSS also initiates actions to prevent failed components from interfering with the operations of other components. 4.3.1 System Startup and Shutdown The administrator starts a Cray XT system by powering up the system and booting the software on the service nodes and compute nodes. Booting the system sets up the system interconnection network. A script, set up by the administrator, shuts the system down. For logical machines, the administrator can boot, run diagnostics, run user applications, and power down without interfering with other logical machines as long as the HSS is running on the SMW and the machines have separate ?le systems. For details about the startup and shutdown processes, see the Cray XT System Management manual. 46 S–2423–22Cray Hardware Supervisory System (HSS) [4] 4.3.2 Event Probing HSS probes are the primary means of monitoring hardware and software components. Probes hosted on the SMW collect data from probes running on the L0 and L1 controllers and RCA daemons running on compute nodes. In addition to dynamic probing, the HSS provides an of?ine diagnostic suite that probes all HSS-controlled components. 4.3.3 Event Logging The event logger preserves data that the administrator uses to determine the reason for reduced system availability. It runs on the SMW and logs all status and event data generated by: • HSS probes • Processes communicating through the RCA interface on compute and service nodes • Other HSS processes running on L0 and L1 controllers 4.3.4 Event Handling The event handler evaluates messages from HSS probes and determines what to do about them. The HSS is designed to prevent single-point failures of either hardware or system software from interrupting the system. Examples of single-point failures that are handled by the HSS system are: • Compute node failure. A failing compute node is automatically isolated and shut down and the user job fails; the rest of the system continues running and servicing other applications. • Power supply failure. Power supplies have an N+1 con?guration for each chassis in a cabinet; failure of an individual power supply does not cause an interrupt of a compute node. In addition, the HSS distributes failure events to those who have subscribed to them over the HSS network so that each component can make a local decision about how to deal with the fault. For example, both the L0 and L1 controllers contain code to react to critical faults without administrator intervention. S–2423–22 47Cray XT™ System Overview 48 S–2423–22Glossary blade 1) A ?eld-replaceable physical entity. A Cray XT service blade consists of AMD Opteron sockets, memory, the Cray SeaStar mezzanine FRU, PCI-X or PCIe cards, and a blade control processor. A Cray XT compute blade consists of AMD Opteron sockets, memory, the Cray SeaStar mezzanine FRU, and a blade control processor. A Cray X2 compute blade consists of eight Cray X2 chips (CPU and network access links), two voltage regulator modules (VRM) per CPU, 32 memory daughter cards, a blade controller for supervision, and a back panel connector. 2) From a system management perspective, a logical grouping of nodes and blade control processor that monitors the nodes on that blade. blade control processor A microprocessor on a blade that communicates with a cabinet control processor through the HSS network to monitor and control the nodes on the blade. See also blade, L0 controller, Hardware Supervisory System (HSS). cabinet control processor A microprocessor in the cabinet that communicates with the HSS via the HSS network to monitor and control the devices in a system cabinet. See also Hardware Supervisory System (HSS) and L1 Controller.page 51 class A group of service nodes of a particular type, such as login or I/O. See also specialization. CNL CNL is the Cray XT compute node operating system. CNL provides system calls and many of the operating system functions available through the service nodes, although some functionality has been removed to improve performance and reduce memory usage by the system. compute blade See blade. S–2423–22 49Cray XT™ System Overview compute node A node that runs application programs. A compute node performs only computation; system services cannot run on compute nodes. The compute node kernel, CNL, supports either scalar and vector applications. See also node; service node. Cray Linux Environment (CLE) The operating system for Cray XT systems. Cray SeaStar chip The component of the system interconnection network that provides message routing and communication services. See also system interconnection network. CrayDoc Cray's documentation system for accessing and searching Cray books, man pages, and glossary terms from a web browser. deferred implementation The label used to introduce information about a feature that will not be implemented until a later release. dual-core processor A processor that combines two independent execution engines ("cores"), each with its own cache and cache controller, on a single chip. GNU Compiler Collection (GCC) From The Free Software Foundation, a compiler that supports C, C++, Objective-C, Fortran, and Java code (see http://www.x.org/gcc/). Hardware Supervisory System (HSS) Hardware and software that monitors the hardware components of the system and proactively manages the health of the system. It communicates with nodes and with the management processors over the private Ethernet network. heartbeat A signal sent at regular intervals by software to show that it is still active. L0 controller See blade control processor. 50 S–2423–22Glossary L1 controller See cabinet control processor. logical machine An administrator-de?ned portion of a physical Cray XT system, operating as an independent computing resource. login node The service node that provides a user interface and services for compiling and running applications. metadata server (MDS) The component of the Lustre ?le system that manages Metadata Targets (MDT) and handles requests for access to metadata residing on those targets. module See blade. node For Cray Linux Environment (CLE) systems, the logical group of processor(s), memory, and network components acting as a network end point on the system interconnection network. See also processing element. node ID A decimal number used to reference each individual node. The node ID (NID) can be mapped to a physical location. NUMA node A multicore processor and its local memory. Multisocket compute nodes have two or more NUMA nodes. object storage server (OSS) The component of the Lustre ?le system that manages Object Storage Targets and handles I/O requests for access to ?le data residing on those targets. object storage target (OST) The Lustre system component that represents an I/O device containing ?le data. This can be any LUN, RAID array, disk, disk partition, etc. S–2423–22 51Cray XT™ System Overview parallel processing Processing in which multiple processors work on a single application simultaneously. processing element One instance of an executable propagated by the Application Level Placement Scheduler (ALPS). quad-core processor A processor that combines four independent execution engines ("cores"), each with its own cache and cache controller, on a single chip. resiliency communication agent (RCA) A communications interface between the operating environment and the HSS. Each RCA provides an interface between the HSS and the processes running on a node and supports event noti?cation, informational messages, information requests, and probes. See also Hardware Supervisory System (HSS). service blade See blade. service database (SDB) The database that maintains the global system state. service node A node that performs support functions for applications and system services. Service nodes run SUSE LINUX and perform specialized functions. There are six types of prede?ned service nodes: login, IO, network, boot, database. specialization The process of setting ?les on the shared-root ?le system so that unique ?les can exist for a node or for a class of nodes. system interconnection network The high-speed network that handles all node-to-node data transfers. TLB A content addressable memory in the processor that contains translations between the virtual and physical addresses of recently referenced pages of memory. 52 S–2423–22 Cray X1™ Series System Overview S–2346–25© 2002-2004 Cray Inc. All Rights Reserved. This manual or parts thereof may not be reproduced in any form unless permitted by contract or by written permission of Cray Inc. U.S. GOVERNMENT RESTRICTED RIGHTS NOTICE The Computer Software is delivered as "Commercial Computer Software" as defined in DFARS 48 CFR 252.227-7014. All Computer Software and Computer Software Documentation acquired by or for the U.S. Government is provided with Restricted Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7014, as applicable. Technical Data acquired by or for the U.S. Government, if any, is provided with Limited Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7013, as applicable. Autotasking, Cray, Cray Channels, Cray Y-MP, GigaRing, LibSci, MPP Apprentice, SuperCluster, UNICOS and UNICOS/mk are federally registered trademarks and Active Manager, CCI, CCMT, CF77, CF90, CFT, CFT2, CFT77, ConCurrent Maintenance Tools, COS, Cray Ada, Cray Animation Theater, Cray APP, Cray C++ Compiling System, Cray C90, Cray C90D, Cray CF90, Cray EL, Cray Fortran Compiler, Cray J90, Cray J90se, Cray J916, Cray J932, Cray MTA, Cray MTA-2, Cray MTX, Cray NQS, Cray Research, Cray RS, Cray SeaStar, Cray S-MP, Cray SSD-T90, Cray SuperCluster, Cray SV1, Cray SV1ex, Cray SX-5, Cray SX-6, Cray T3D, Cray T3D MC, Cray T3D MCA, Cray T3D SC, Cray T3E, Cray T90, Cray T916, Cray T932, Cray UNICOS, Cray X1, Cray X1E, Cray XT3, Cray XD1, Cray X-MP, Cray XMS, Cray Y-MP EL, Cray/REELlibrarian, Cray-1, Cray-2, Cray-3, CrayDoc, CrayLink, Cray-MP, CrayPacs, CraySoft, CrayTutor, CRI/TurboKiva, CRInform, CSIM, CVT, Delivering the power..., Dgauss, Docview, EMDS, HEXAR, HSX, IOS, ISP/Superlink, ND Series Network Disk Array, Network Queuing Environment, Network Queuing Tools, OLNET, RapidArray, RQS, SEGLDR, SMARTE, SSD, SUPERLINK, System Maintenance and Remote Testing Environment, Trusted UNICOS, TurboKiva, UNICOS MAX, and UNICOS/mp are trademarks of Cray Inc. Acrobat Reader and Adobe are trademarks of Adobe Systems, Inc. ADIC and StorNext are trademarks of Advanced Digital Information Corporation. Dinkum is a trademark of Dinkumware, Ltd. Etnus and TotalView are trademarks of Etnus LLC. GNU is a trademark of The Free Software Foundation. IRIX, SGI, and XFS are trademarks of Silicon Graphics, Inc. PBS Pro is a trademark of Altair Grid Technologies, L.L.C. NFS, Solaris, and Sun are trademarks of Sun Microsystems, Inc. Motif, UNIX, the “X device,” X Window System, and X/Open are trademarks of The Open Group in the United States and other countries. All other trademarks are the property of their respective owners. The UNICOS, UNICOS/mk, and UNICOS/mp operating systems are derived from UNIX System V. These operating systems are also based in part on the Fourth Berkeley Software Distribution (BSD) under license from The Regents of the University of California.Record of Revision Version Description 1.0 June 28, 2002 Draft printing to support the Cray X1 early production systems. 2.0 December 20, 2002 Draft printing to support Cray X1 systems running the UNICOS/mp 2.0, Cray Programming Environment 4.2, Cray MPT 2.1, Cray Workstation (CWS) 2.0, and CNS 1.0 releases. 2.1 March 17, 2003 Draft printing to support Cray X1 systems running the UNICOS/mp 2.1, Cray Programming Environment 4.3, and Cray MPT 2.1 releases. 2.2 June 2003 Supports Cray X1 systems running the Cray Programming Environment 5.0, Cray MPT 2.2, and UNICOS/mp 2.2 releases. 2.3 October 2003 Supports Cray X1 systems running the Cray Programming Environment 5.1, Cray MPT 2.2, and UNICOS/mp 2.3 releases. 2.4 March 2004 Supports Cray X1 systems running the Cray Programming Environment 5.2, Cray MPT 2.3, and UNICOS/mp 2.4 releases. 2.5 October 2004 Supports Cray X1 series systems running the Cray Programming Environment 5.3, Cray MPT 2.4, and UNICOS/mp 2.5 releases. S–2346–25 iContents Page Preface vii Accessing Product Documentation . . . . . . . . . . . . . . . . . . . vii Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . viii Reader Comments . . . . . . . . . . . . . . . . . . . . . . . . ix Introduction [1] 1 Hardware Overview [2] 3 Cray X1 Series Mainframe . . . . . . . . . . . . . . . . . . . . . . 5 Compute Modules . . . . . . . . . . . . . . . . . . . . . . . 5 Processors . . . . . . . . . . . . . . . . . . . . . . . . . 6 Local Memory . . . . . . . . . . . . . . . . . . . . . . . . 9 System Port Channel (SPC) I/O Ports . . . . . . . . . . . . . . . . . 10 Interconnection Network . . . . . . . . . . . . . . . . . . . . . 11 Cray X1 Series I/O Subsystem . . . . . . . . . . . . . . . . . . . . 11 I/O Drawer . . . . . . . . . . . . . . . . . . . . . . . . . 12 Cray Programming Environment Server (CPES) . . . . . . . . . . . . . . 12 Cray Network Subsystem (CNS) . . . . . . . . . . . . . . . . . . . 12 RAID Subsystem for Disk Storage . . . . . . . . . . . . . . . . . . . 13 Storage Area Network (SAN) File Sharing and Storage . . . . . . . . . . . . 13 Cray X1 Series Support System . . . . . . . . . . . . . . . . . . . . 13 Cray Workstation . . . . . . . . . . . . . . . . . . . . . . . 14 System Control Facility . . . . . . . . . . . . . . . . . . . . . . 14 Private Ethernet Subnetworks . . . . . . . . . . . . . . . . . . . . 14 System Partitioning . . . . . . . . . . . . . . . . . . . . . . . . 15 Hardware Models . . . . . . . . . . . . . . . . . . . . . . . . 15 Air-cooled Model . . . . . . . . . . . . . . . . . . . . . . . 15 S–2346–25 iiiCray X1™ Series System Overview Page Liquid-cooled Model . . . . . . . . . . . . . . . . . . . . . . 17 Development Environment Overview [3] 19 Development Tools . . . . . . . . . . . . . . . . . . . . . . . . 19 Accessing a Cray X1 Series System . . . . . . . . . . . . . . . . . . . 20 Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Writing Source Code . . . . . . . . . . . . . . . . . . . . . . . 23 Cray Fortran Programming Environment . . . . . . . . . . . . . . . . 23 Cray C and C++ Programming Environment . . . . . . . . . . . . . . . 23 Cray Libraries . . . . . . . . . . . . . . . . . . . . . . . . 24 Parallel Programming Memory Models . . . . . . . . . . . . . . . . . 25 Compiling Source Code . . . . . . . . . . . . . . . . . . . . . . 26 Vectorization and Multistreaming . . . . . . . . . . . . . . . . . . . 27 MSP-mode and SSP-mode Applications . . . . . . . . . . . . . . . . . 28 Debugging Your Program . . . . . . . . . . . . . . . . . . . . . . 28 Loading and Linking Object Files . . . . . . . . . . . . . . . . . . . . 29 Executing Your Program . . . . . . . . . . . . . . . . . . . . . . 29 Distributing Work . . . . . . . . . . . . . . . . . . . . . . . 31 Managing Memory . . . . . . . . . . . . . . . . . . . . . . . 32 Checkpointing Your Program . . . . . . . . . . . . . . . . . . . . . 33 Monitoring Your Program . . . . . . . . . . . . . . . . . . . . . . 34 Analyzing Your Application’s Performance . . . . . . . . . . . . . . . . . 34 Optimizing Your Application . . . . . . . . . . . . . . . . . . . . . 35 Operations Overview [4] 37 Software Release Packages . . . . . . . . . . . . . . . . . . . . . . 37 Cray Workstation (CWS) Functionality . . . . . . . . . . . . . . . . . . 40 Cray Programming Environment Server (CPES) Functionality . . . . . . . . . . . 41 Trigger Environment . . . . . . . . . . . . . . . . . . . . . . . 42 Cray Network Subsystem (CNS) Functionality . . . . . . . . . . . . . . . . 42 UNICOS/mp Functionality . . . . . . . . . . . . . . . . . . . . . 42 iv S–2346–25Contents Page Key IRIX Functionality in the UNICOS/mp Operating System . . . . . . . . . . 43 Cray Added Functionality of Interest to All Users . . . . . . . . . . . . . . 45 Accelerated Execution for Distributed Memory Applications . . . . . . . . . . 45 Distributed Memory Message Passing . . . . . . . . . . . . . . . . 45 Node Allocation (Migration) for Accelerated Applications . . . . . . . . . . 45 Interactive and Batch Processing . . . . . . . . . . . . . . . . . . 45 Application Launch and Query Commands . . . . . . . . . . . . . . . 46 Large Pages . . . . . . . . . . . . . . . . . . . . . . . . 46 Application Monitoring . . . . . . . . . . . . . . . . . . . . . 46 Multiple Program, Multiple Data (MPMD) Programs . . . . . . . . . . . . 47 X Window System Client and Libraries . . . . . . . . . . . . . . . . 47 Motif Version 2.1 . . . . . . . . . . . . . . . . . . . . . . . 47 Access to Files via a Storage Area Network (SAN) . . . . . . . . . . . . . 47 Additional Cray Added Functionality of Interest to System Administrators . . . . . . 47 Software Installation . . . . . . . . . . . . . . . . . . . . . . 47 UNICOS/mp Storage . . . . . . . . . . . . . . . . . . . . . 48 UNICOS/mp File System Hierarchy . . . . . . . . . . . . . . . . . 48 Security . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Network Routing . . . . . . . . . . . . . . . . . . . . . . . 48 Name Service Switch . . . . . . . . . . . . . . . . . . . . . . 48 Application Placement Scheduling Mechanism . . . . . . . . . . . . . . 49 UNICOS/mp Accounting . . . . . . . . . . . . . . . . . . . . 50 Resource Limits . . . . . . . . . . . . . . . . . . . . . . . 50 System Resiliency . . . . . . . . . . . . . . . . . . . . . . . 50 Partitioning a System . . . . . . . . . . . . . . . . . . . . . 51 System Maintenance Utilities . . . . . . . . . . . . . . . . . . . 52 S–2346–25 vCray X1™ Series System Overview Page Glossary 53 Index 65 Figures Figure 1. Cray X1 Series System Functional Diagram . . . . . . . . . . . . . 4 Figure 2. Cray X1 Series Compute Module Block Diagram . . . . . . . . . . . 5 Figure 3. Four SSPs per MSP . . . . . . . . . . . . . . . . . . . . 6 Figure 4. Cray X1 Systems, 1 MSP per MCM . . . . . . . . . . . . . . . . 7 Figure 5. Cray X1E Systems, 2 MSPs per MCM . . . . . . . . . . . . . . . 8 Figure 6. Cray X1 Series Air-cooled System . . . . . . . . . . . . . . . . 16 Figure 7. Cray X1 Series Liquid-cooled System . . . . . . . . . . . . . . . 18 Figure 8. Accessing a Cray X1 Series System . . . . . . . . . . . . . . . . 21 Figure 9. Cray X1 Series System Administration Components . . . . . . . . . . 38 Tables Table 1. Memory Models Supported on Cray X1 Series Systems . . . . . . . . . . 25 Table 2. Source and Object Files . . . . . . . . . . . . . . . . . . . 26 vi S–2346–25Preface The information in this preface is common to Cray documentation provided with this software release. Accessing Product Documentation With each software release, Cray provides books and man pages, and in some cases, third-party documentation. These documents are provided in the following ways: • CrayDoc, the Cray documentation delivery system that allows you to quickly access and search Cray books, man pages, and in some cases, third-party documentation—Access this HTML and PDF documentation via CrayDoc at the following URLs: – The local network location defined by your system administrator – The CrayDoc public website: www.cray.com/craydoc/ • Man pages—Access man pages by entering the man command followed by the name of the man page. For more information about man pages, see the man(1) man page by entering: % man man • Third-party documentation not provided through CrayDoc—Access this documentation, if any, according to the information provided with that product. S–2346–25 viiCray X1™ Series System Overview Conventions These conventions are used throughout Cray documentation: Convention Meaning command This fixed-space font denotes literal items, such as file names, pathnames, man page names, command names, and programming language elements. variable Italic typeface indicates an element that you will replace with a specific value. For instance, you may replace filename with the name datafile in your program. It also denotes a word or concept being defined. user input This bold, fixed-space font denotes literal items that the user enters in interactive sessions. Output is shown in nonbold, fixed-space font. [ ] Brackets enclose optional portions of a syntax representation for a command, library routine, system call, and so on. ... Ellipses indicate that a preceding element can be repeated. name(N) Denotes man pages that provide system and programming reference information. Each man page is referred to by its name followed by a section number in parentheses. Enter: % man man to see the meaning of each section number for your particular system. viii S–2346–25Preface Reader Comments Contact us with any comments that will help us to improve the accuracy and usability of this document. Be sure to include the title and number of the document with your comments. We value your comments and will respond to them promptly. Contact us in any of the following ways: E-mail: swpubs@cray.com Telephone (inside U.S., Canada): 1–800–950–2729 (Cray Customer Support Center) Telephone (outside U.S., Canada): +1–715–726–4993 (Cray Customer Support Center) Mail: Software Publications Cray Inc. 1340 Mendota Heights Road Mendota Heights, MN 55120–1128 USA S–2346–25 ixIntroduction [1] Cray X1 series systems utilize powerful vector processors, shared memory, and a modernized vector instruction set in a highly scalable configuration to provide the computational power required for advanced scientific and engineering applications. Cray X1 series systems have high memory bandwidth and scalable system software, which are crucial to achieving peak and sustained performance. This document provides a brief overview of Cray X1 series (Cray X1 and Cray X1E systems) hardware and software capabilities. It assumes the reader is familiar with UNICOS, UNICOS/mk, or UNIX systems. This overview is not intended to provide an exhaustive list of all software features and utilities included with Cray X1 series systems; instead, it includes references to other Cray documents that contain additional, detailed information. These other Cray documents reside on the CrayDoc documentation system, which you can access from a web browser using the network path defined by your system administrator. If you are migrating from a UNICOS or UNICOS/mk system, the following manuals will also be of interest to you: • Migrating Applications to the Cray X1 Series Systems • Cray X1 User Environment Differences • Cray X1 Series System Administration Differences S–2346–25 1Cray X1™ Series System Overview 2 S–2346–25Hardware Overview [2] A Cray X1 series system combines the single-processor performance and single-shared address space of Cray parallel vector processor (PVP) systems with the high bandwidth, low latency, scalable interconnect, and scalable microprocessor-based architecture used in Cray T3E systems. As shown in Figure 1, a Cray X1 series system contains the following major functional blocks: • The Cray X1 series mainframe consists of compute modules and an interconnection network housed in one or more cabinets. Compute modules contain processors, globally addressable shared local memory, and System Port Channel (SPC) I/O ports. Compute modules communicate with each other through the interconnection network. • The Cray X1 series I/O subsystem includes the following components housed in multiple cabinets: – I/O drawers (IODs) that convert the SPC protocol used by the compute modules to Fibre Channel protocol used by various peripheral devices – Cray Programming Environment Server (CPES) that runs the Programming Environment for Cray X1 series systems – Cray Network Subsystem (CNS) that connects a Cray X1 series system to the site’s networks – RAID subsystems that provide disk storage for a Cray X1 series system • The Cray X1 series support system is used to boot, configure, troubleshoot, and monitor a Cray X1 series system. This support system consists of a Cray Workstation (CWS), the System Control Facility, and several private Ethernet subnetworks. This chapter describes these three major functional blocks. It also provides information about partitioning a system and describes the two Cray X1 series model types–air-cooled (AC) and liquid-cooled (LC) models. S–2346–25 3Cray X1™ Series System Overview Support System Private Ethernet Subnetworks Optional Remote Support Connection Cray Workstation Compute Module Compute Module Compute Module Compute Module Compute Module Compute Module I/O Drawer I/O Drawer I/O and Peripheral Cabinets Mainframe Cabinet(s) System Port Channels Fibre Channel Loops Site Cray Network Networks Subsystem(s) Cray Programming Environment Server Optional SAN Infrastructure with Storage I/O Drawer I/O Drawer Customer SAN Client Customer SAN Client Fibre Channel Fabric RAID Disk Storage Subsystems(s) I/O Drawer I/O Drawer R o u t e r s Interconnection Network Compute Module Figure 1. Cray X1 Series System Functional Diagram 4 S–2346–25Hardware Overview [2] 2.1 Cray X1 Series Mainframe The primary functional building blocks of a Cray X1 series mainframe are compute modules and the interconnection network, which are housed in one or more cabinets. Compute modules communicate with each other through the interconnection network. Compute modules and the interconnection network are described in the following sections. 2.1.1 Compute Modules A compute module is the physical, configurable building block for the Cray X1 series mainframe. As depicted in Figure 2, page 5, a compute module contains four multichip modules (MCMs), globally addressable shared local memory, and System Port Channel (SPC) I/O ports. A distributed set of routing switches controls all memory access within a compute module and access to the interconnection network. The system’s processing power, memory, and I/O bandwidth scale as compute modules are added to the system. 8 to 64 GBytes Memory 200 GB/s 12.8 GF (64bit) CPU I/O I/O I/O I/O 12.8 GF (64bit) CPU 12.8 GF (64bit) CPU 12.8 GF (64bit) CPU MCM MCM MCM MCM Local Routing Switches Memory Interconnect Network I/O Figure 2. Cray X1 Series Compute Module Block Diagram Each Cray X1 series compute module also contains two controllers (not shown) that are part of the System Control Facility. The System Control Facility is used to configure, boot, troubleshoot, and monitor the system. The controllers communicate with the CWS through an Ethernet connection. (For more information about the CWS, see Section 2.3.1, page 14; for more information about the System Control Facility, see Section 2.3.2, page 14.) The following subsections define the major functional blocks of a compute module. S–2346–25 5Cray X1™ Series System Overview 2.1.1.1 Processors As shown in Figure 2, page 5, a compute module contains four multichip modules (MCMs). MCMs contain multistreaming processors (MSPs). As shown in Figure 3, page 6, each MSP has four internal single-streaming processors (SSPs). Each SSP contains both a superscalar processing unit and a two-pipe vector processing unit, depicted in the figure as S and V, respectively. MSP SSP S V V Figure 3. Four SSPs per MSP A Cray X1 compute module has one MSP per MCM. On Cray X1 MCMs, the four SSPs in an MSP share the 2-MB cache of the MSP, as shown in Figure 4, page 7. 6 S–2346–25Hardware Overview [2] S V V S V V S V V S V V 0.5 MB Cache 0.5 MB Cache 0.5 MB Cache 0.5 MB Cache SSP SSP SSP SSP To Local Memory and Interconnection Figure 4. Cray X1 Systems, 1 MSP per MCM A Cray X1E compute module has two MSPs per MCM, an "upper" MSP and a "lower" MSP. On Cray X1E MCMs, the four SSPs in the lower MSP share the 2-MB cache of the lower MSP, and the four SSPs in the upper MSP share the 2-MB cache of the upper MSP, as shown in Figure 5, page 8. S–2346–25 7Cray X1™ Series System Overview To Local Memory and Interconnection 0.5 MB Cache 0.5 MB Cache 0.5 MB Cache 0.5 MB Cache 0.5 MB Cache 0.5 MB Cache 0.5 MB Cache 0.5 MB Cache V V S SSP V V S SSP V V S SSP V V S SSP V V S SSP V V S SSP V V S SSP V V S SSP Figure 5. Cray X1E Systems, 2 MSPs per MCM The logical grouping of four MSPs and cache-coherent shared local memory is called a node. Cache coherency is maintained for the four MSPs in a node. A Cray X1 compute module has one node and includes network and I/O ports. A Cray X1E compute module has two nodes that share the compute module’s network and I/O ports. In a Cray X1E system, one node treats the upper half of the compute module’s memory as its cache-coherent local memory, and the other node uses the lower half of memory. A Cray X1 series system requires a minimum of two nodes. Physically, all nodes are the same; software controls how a node is used. Processors are designed so that an application can run in either MSP mode or SSP mode. In MSP mode, an MSP provides the user-programmable processor for typical parallel applications; each MSP tightly couples the interactions of its four constituent SSPs and automatically distributes the parallel parts of a multistreaming application to its SSPs. In SSP mode, each SSP runs independently of the others, executing its own stream of instructions. Applications can be built to run with one or more MSPs or with one or more SSPs, where the optimal choice depends on the algorithms used within the application. 8 S–2346–25Hardware Overview [2] Both 32-bit and 64-bit integer and floating-point arithmetic are supported. A single MSP provides 12.8 GF (gigaflops) of peak computational power for 64-bit data computation (theoretically, 32-bit computation is twice as fast as 64-bit computation; however, other constraints such as memory bandwidth will reduce this peak for typical application execution). The SSP vector processing units implement Cray’s new vector (NV-1) instruction set architecture (ISA). The NV-1 instruction set features: • Support for 32-bit and 64-bit two’s-complement integers • Support for IEEE–754 floating-point format (both 32-bit and 64-bit formats) • Fixed 32-bit width instructions with regular encoding • Support for virtual memory and multistreaming processing • Large register sets to reduce the number of memory accesses and to hide memory latency • Hardware features to support improved vectorization • Cache allocation control to support explicit communication and reduce cache pollution • Relaxed memory ordering rules with mechanisms for explicit synchronization • Ease of use and transition for current Cray vector processor users • Simple data paths and simple instructions • Support for decoupled scalar and vector execution for maximum performance 2.1.1.2 Local Memory For both Cray X1 and Cray X1E systems, the coherency domain is the local memory for each node. For Cray X1E systems, there are two cache domains (upper and lower) per compute module; the cache domain for an "upper" MSP is the upper half of memory on its compute module, and the cache domain for a "lower" MSP is the lower half of memory on its compute module (the processors on the two nodes must share memory bandwidth). Each memory word is 72 bits wide; 64 bits are used for data and the remaining 8 bits are used for single-error-correction, double-error-detection (SECDED) protection. Each 32-bit half of a 64-bit data word can be written separately to support 32-bit operations. S–2346–25 9Cray X1™ Series System Overview Local memory latency is completely flat for all processors within a single logical node. Local memory bandwidth supports network traffic and I/O without greatly impacting computational performance. Local memory bandwidth is the same for all Cray X1 series systems, except Cray X1E systems have twice as many MSPs sharing that bandwidth. Each compute module contains a set of 16 memory controller chips and 32 memory daughter cards. Daughter cards have either a DRAM capacity of 288 megabits per chip (256 megabits for data, 32 megabits for error correction), yielding a per-compute module memory capacity of 16 GB, or a DRAM capacity of 576 megabits per chip (512 megabits for data, 64 megabits for error correction), yielding a per-compute module memory capacity of 32 GB. On Cray X1E systems, the available per-node memory size is 8 GB or 16 GB, equivalent to 16 GB or 32 GB per compute module. Local memory can operate in two degraded modes. First, half of the memory chips on a daughter card can be disabled to tolerate the loss of a memory chip. This degraded mode cuts the memory size in half with very little effect on memory bandwidth. Second, half of the daughter cards can be disabled to tolerate the failure of a daughter card. This degraded mode cuts both memory size and bandwidth in half. It is possible to degrade to a quarter of memory on a compute module. The degraded modes affect both nodes of a Cray X1E compute module. 2.1.1.3 System Port Channel (SPC) I/O Ports Each Cray X1 series compute module has four Cray proprietary protocol SPCs that serve as I/O ports between the mainframe and an I/O drawer that resides in an I/O cabinet (for information about I/O drawers, see Section 2.2.1, page 12). I/O channel capacity scales with the compute module count. The peak channel bandwidth is 1.2 GBps per direction per SPC. The peak total I/O bandwidth per compute module is 4.8 GBps full duplex. Full duplex means that input and output channel bandwidth can be sustained at the same time. SPCs are logically accessible by all processors but physically distributed among the compute modules. All I/O channel controllers are globally addressable and controllable. There is no special relationship between an I/O channel controller and its local compute module other than physical proximity. Data is routed to memory without going through the processors. 10 S–2346–25Hardware Overview [2] 2.1.2 Interconnection Network Compute modules communicate with each other through the interconnection network, which consists of router logic on compute modules, cables, and router modules, except for small air-cooled systems. On small air-cooled systems (single-cabinet systems with up to four compute modules), the router logic on the compute modules connects directly to each other. For Cray X1 series systems with router modules, router modules connect the compute modules within a mainframe cabinet and, in larger configurations, connect compute modules from one mainframe cabinet to compute modules in other mainframe cabinets. The number of router modules scales with the number of mainframe cabinets in a system. Each compute module provides the first-level routing function. The compute module determines whether a memory reference is local to the compute module or whether the reference is to a memory location on a remote compute module. Remote memory references are made through the interconnection network. Each compute module contains 32 network ports, and each port supports 1.6 GBps peak per direction. Software-loaded configuration tables are used for data flow mapping across the interconnection network. These tables are loaded at system boot time. They can also be reloaded in the event of a hardware failure, thus providing a means to reconfigure the interconnection network around hardware failures. 2.2 Cray X1 Series I/O Subsystem The Cray X1 series I/O subsystem includes the following components, which are housed in multiple cabinets: • I/O drawers that convert the SPC protocol used by the compute modules to Fibre Channel protocol used by various peripheral devices. Each drawer supports four SPC channels to the mainframe and up to 16 Fibre Channel connections. • Cray Programming Environment Server (CPES) that runs the tools used by programmers for program development. • Cray Network Subsystem (CNS) that connects a Cray X1 series system to the customer’s networks. • RAID subsystem that provides disk storage for the Cray X1 series system. The I/O subsystem components are described in the following subsections. S–2346–25 11Cray X1™ Series System Overview 2.2.1 I/O Drawer An I/O drawer resides in an I/O cabinet (IOC) and converts SPC channel protocol to Fibre Channel protocol. Each drawer supports four SPC channels using four independent circuits. Each circuit contains an I/O Channel Adapter (IOCA). The IOCA converts one SPC channel to a dual-slotted PCI-X bus. Dual-port Fibre Channel host bus adapter (HBA) cards are inserted into each PCI-X slot. Thus, each SPC channel can drive four independent Fibre Channel connections (each I/O drawer can drive up to 16 Fibre Channel connections). Each channel connects to a single Fibre Channel device. The PCI-X Fibre Channel cards can be used to connect directly to RAID storage located in a peripheral cabinet (PC-20) using the normal Fibre Channel Arbitrated Loop (FCAL) protocol. These cards can also be used to make a network connection to the CNS and the CPES using Internet Protocol (IP) over Fibre Channel. If an optional SAN is implemented, Fibre Channel HBAs in a Cray X1 series I/O drawer connect the Cray X1 series hardware to the SAN using Fibre Channel fabric protocol. Each I/O drawer has controllers that are part of the System Control Facility, which is used to boot, configure, troubleshoot, and monitor the system. 2.2.2 Cray Programming Environment Server (CPES) A Cray X1 series system includes a CPES. The Cray Programming Environment compilers, loader, and performance analysis tool reside on the CPES. Program compilations are invoked by user commands on the Cray X1 series mainframe and are executed transparently on the CPES. There is no user access (login) directly to the CPES. The CPES resides in a PC-20 cabinet. (For an overview of the development environment, see Chapter 3, page 19.) 2.2.3 Cray Network Subsystem (CNS) A Cray X1 series system includes a CNS for network routing. In addition to bridging from Fibre Channel connections to network protocols, the CNS provides IP packet management, minimizing the impact on the Cray X1 series processors. Multiple CNSs can be configured on a Cray X1 series system to scale to the number of network connections required. The CNS resides in a PC-20 cabinet. 12 S–2346–25Hardware Overview [2] 2.2.4 RAID Subsystem for Disk Storage A Cray X1 series system has one or more RAID (redundant array of independent drives) subsystems for disk storage. The RAID subsystems are configured using various RAID levels and consist of modular bricks with full built-in redundancy. A RAID subsystem resides in a PC-20 cabinet. Each RAID subsystem includes one controller brick (C-brick) and its associated storage bricks (S-bricks), which are the scalable storage component. Full redundancy is built into the C-brick. If storage capacity is critical, fewer RAID systems with daisy-chained S-bricks may be used. If disk transfer speed is important, multiple RAID systems with minimal daisy-chaining may be used. 2.2.5 Storage Area Network (SAN) File Sharing and Storage Cray Inc. has partnered with Advanced Digital Information Corporation (ADIC) to support file sharing and file storage for Cray X1 series systems through a Storage Area Network (SAN). This optional SAN installation includes disk storage as described in Section 2.2.4, page 13. The ADIC StorNext File System (StorNext FS) is a high-performance, heterogeneous, shared SAN file system that is usable from Cray X1 series systems. The StorNext FS safely allows multiple computer systems to share disk storage at the file level. File system data is shared among these computers, which are called StorNext FS clients, across the Fibre Channel fabric at nearly the full Fibre Channel bandwidth. Support for the Cray X1 series StorNext FS client and Fibre Channel fabric is provided with the UNICOS/mp software. Also offered is the StorNext Storage Manager, which is a hierarchical storage management (HSM) system that allows sites to define policies to automatically balance access needs with storage capacities. The Cray X1 series hardware attaches to the SAN through Fibre Channel HBAs in a Cray X1 series I/O drawer. 2.3 Cray X1 Series Support System The Cray X1 series support system is used to boot, configure, troubleshoot, and monitor a Cray X1 series system. This support system consists of a Cray Workstation (CWS), the System Control Facility, and several private Ethernet subnetworks, which are described in the following subsections. S–2346–25 13Cray X1™ Series System Overview 2.3.1 Cray Workstation Each Cray X1 series system includes one CWS, which is used to load media for the CWS, CPES, and UNICOS/mp software and to configure, troubleshoot, monitor, and boot the system. The CWS communicates to all of the system components through various private Ethernet subnetworks. The CWS provides the operational interface for all system maintenance and operator functions, including consolidating log messages from the CWS, System Control Facility, CPES, UNICOS/mp, and CNS as part of the centralized logging capability. The CWS includes the Cray Storage Management (CRAYSM) software for configuring RAID subsystems and the interface for remote maintenance via a modem (when allowed by site agreement with Cray). 2.3.2 System Control Facility The System Control Facility (not shown in Figure 1) consists of controllers that reside on each Cray X1 series compute module, router module, I/O drawer, and the mainframe’s power/cooling equipment. These controllers monitor the power and cooling subsystem and provide the operational interface to the system for initial system startup, driver interface communications to Cray X1 series compute and router modules, and mainframe hardware error logging. 2.3.3 Private Ethernet Subnetworks There are several private Ethernet subnetworks in the Cray X1 series system; all connect to the CWS through Ethernet switches. A brief description of each subnetwork follows: • System Control Facility Subnetwork: This subnetwork connects the CWS to the controllers in compute modules, router modules, mainframe power and cooling control circuits, and I/O drawers. • Private Administration Subnetwork: This subnetwork connects the CWS to Ethernet ports on the CPES and CNSs. • RAID Subnetwork: This subnetwork connects the CWS to Ethernet ports on RAID controller bricks. If the system includes a SAN, the SAN-attached RAID controller bricks also connect to this network. The RAID administration software uses this subnetwork to configure and to monitor the status of the RAID subsystems. • Customer Service Subnetwork: This optional subnetwork allows Cray Customer Service to have remote access to the CWS. It is used for remote support when allowed by site agreement with Cray. 14 S–2346–25Hardware Overview [2] 2.4 System Partitioning A Cray X1 series system can be divided into multiple partitions. A partition is a set of compute modules that functions as an independent Cray X1 series system. Each Cray X1 series partition requires a minimum of two nodes: a Cray X1 partition requires a minimum of two compute modules; a Cray X1E partition requires a minimum of one compute module. Note: A partition cannot have both Cray X1 and Cray X1E compute modules within a partition; however, a system can have a Cray X1 partition and a Cray X1E partition. Partitioning is facilitated through the CWS. Each partition is independently booted, dumped, and so on without impacting other running partitions. Users log on to a partition as if it were an independent Cray X1 series system. Because a Cray X1 series partition does not have access to I/O facilities on other partitions, each partition must have independent I/O drawers, CNSs, and RAID subsystems. If users wish to compile from a partition, that partition also requires either a separate CPES or an independent connection to a common CPES. However, all partitions connect to a common CWS. 2.5 Hardware Models A Cray X1 series system is available as either an air-cooled (AC) or a liquid-cooled (LC) model. Both models are multicabinet systems that include one or more of the following cabinets: mainframe cabinet (MFC), I/O cabinet (IOC), and peripheral cabinet (PC-20). The mainframe cabinets differ between the model types. Both models use the same IOC and PC-20 cabinets. IOCs and PC-20s are always air cooled. The following subsections describe the two models in more detail. 2.5.1 Air-cooled Model A Cray X1 series air-cooled system has an MFC-AC mainframe cabinet and a minimum of one IOC and one PC-20. Additional peripheral cabinets can be added, depending on the system configuration. All cabinets are air cooled; the system does not use facility water for cooling. Air-cooled systems do not require a raised computer room floor, but a raised floor is recommended. Figure 6 shows a Cray X1 series air-cooled system with one MFC-AC mainframe cabinet, one IOC, and two PC-20s. S–2346–25 15Cray X1™ Series System Overview The MFC-AC and IOC are physically attached to each other; cabling between the MFC-AC and IOCs is run through cable troughs at the top of the system. PC-20s do not use cable troughs; cabling between an IOC and PC-20s is typically run under the raised computer room floor. IOC PC-20 Cabinets MFC-AC Figure 6. Cray X1 Series Air-cooled System A Cray X1 series MFC-AC cabinet holds up to 4 compute modules (up to 16 MSPs for Cray X1 systems, and up to 32 MSPs for Cray X1E systems) and their associated routers. An IOC houses I/O drawers (IODs) that provide a bridge between the MFC-AC and its various peripheral devices and a SAN. The MFC-AC communicates with an IOD using the SPC protocol. An IOD communicates with the Cray X1 series system’s various peripheral devices through Fibre Channel protocol. The Fibre Channels connect to and from the RAID disk storage, the CPES, and CNSs. Fibre 16 S–2346–25Hardware Overview [2] Channel HBAs in a Cray X1 series I/O drawer can connect the Cray X1 series hardware to a SAN. Each Cray X1 series air-cooled system requires a minimum of one IOD in the IOC. The number of drawers is dependent on the system’s mainframe and peripheral configuration. An IOC can support up to 32 SPC mainframe connections and up to 128 Fibre Channel connections. For Cray X1 series AC systems with 4 compute modules, the IOC uses up to 16 SPC mainframe connections and up to 64 Fibre Channel connections. A PC-20 houses various peripherals, including the CPES, the CNS, and the RAID disk storage components. All Cray X1 series air-cooled systems require a minimum of one PC-20 cabinet. Additional PC-20 cabinets are added based on the system’s peripheral configuration. 2.5.2 Liquid-cooled Model A Cray X1 series liquid-cooled system requires a minimum of one MFC-LC mainframe cabinet, one IOC, and one PC-20. Additional cabinets can be added, depending on the system configuration. An MFC-LC requires facility water for cooling. IOCs and PC-20s are air-cooled. Cray X1 series LC systems require a raised computer room floor for routing the facility water. Figure 7 shows a Cray X1 series liquid-cooled system consisting of one MFC-LC, one IOC, and two PC-20s. All MFC-LCs and IOCs are physically attached to each other; cabling between the MFC-LCs and IOCs is run through cable troughs at the top of the system. PC-20s do not use cable troughs; cabling between the IOC and PC-20s is typically run under the raised computer room floor. Each MFC-LC holds up to 16 compute modules and their associated routers. Cray X1 series LC systems can consist of 1 to 64 MFC-LC cabinets, yielding Cray X1 systems that can scale up to 4096 MSPs and Cray X1E systems that can scale up to 8192 MSPs. An IOC houses I/O drawers (IODs) that provide a bridge between the MFC-LC and its various peripheral devices and a SAN. The MFC-LC communicates with an IOD using the SPC protocol. An IOD communicates with the Cray X1 series system’s various peripheral devices through Fibre Channel protocol. The Fibre Channels connect to and from the RAID disk storage, the CPES, and CNSs. Fibre Channel HBAs in a Cray X1 series I/O drawer can connect the Cray X1 series hardware to a SAN. S–2346–25 17Cray X1™ Series System Overview PC-20 Cabinets IOC MFC-LC Figure 7. Cray X1 Series Liquid-cooled System Each Cray X1 series liquid-cooled system requires a minimum of one IOC with two IODs. The number of cabinets and drawers is dependent on the system’s mainframe and peripheral configuration. An IOC can support up to 32 SPC mainframe connections and up to 128 Fibre Channel connections. A PC-20 houses various peripherals, including the CPES, the CNS, and the RAID disk storage components. All Cray X1 series liquid-cooled systems require a minimum of one PC-20 cabinet. Additional PC-20 cabinets are added based on the system’s peripheral configuration. 18 S–2346–25Development Environment Overview [3] This chapter presents an overview of the development environment for Cray X1 series systems. It describes the software that you can use to carry out the steps required to build and run optimized applications on a Cray X1 series system. If you are migrating from a UNICOS or UNICOS/mk system, also refer to Migrating Applications to the Cray X1 Series Systems and Cray X1 User Environment Differences. Detailed information about optimizing your applications is provided in Optimizing Applications on the Cray X1 Series Systems. 3.1 Development Tools The development environment for Cray X1 series systems is the software that is provided with or that runs on top of the UNICOS/mp operating system. The UNICOS/mp operating system is based on the IRIX 6.5 operating system with Cray enhancements. It is designed to sustain the high-capacity, high-bandwidth throughput provided by Cray X1 series systems. A single UNICOS/mp kernel image runs on the entire system or on each partition of the system. The software comprises: • Commands, system calls, system libraries, special files, and protocols provided with the UNICOS/mp operating system. For information about UNICOS/mp operating system functionality, see Section 4.6.1, page 43. • Software provided in or with the Cray Programming Environment products: – Cray Fortran Compiler and/or the Cray C and C++ Compilers – CrayTools, which includes the Cray loader and CrayPat, the performance analysis tool for Cray X1 series systems – Libraries for application programmers, including a comprehensive set of math and scientific library routines, X Window X11 System Libraries, and (if proper licensing is in place) Motif – The Modules and Trigger environment utilities – Cray Assembler for Cray X1 series systems S–2346–25 19Cray X1™ Series System Overview • Message Passing Toolkit (MPT) for Cray X1 series systems, which is available from Cray as an optional product; included in MPT are the Message Passing Interface (MPI) and shared memory access (SHMEM) libraries. • The Etnus TotalView debugger for Cray X1 series systems, which is available from Cray as an optional product. • PBS Pro batch workload system for Cray X1 series systems, which is available from Cray as an optional product. • Cray Open Software (COS) package for Cray X1 series systems, which is available from Cray as an optional product. COS consists of widely available, public-domain programs. Many of the COS utility programs are very similar to those provided in UNICOS/mp programs but provide options that are either not available in or behave differently than their UNICOS/mp counterparts. 3.2 Accessing a Cray X1 Series System You access a Cray X1 series system from your network-connected system using standard telnet or rlogin commands, as permitted by your local site. In addition, your site administration may offer or require logins through the open secure shell command, ssh. Related commands for file transfers, ftp, rcp, sftp, and scp, are also supported. You enter commands in a standard terminal window. The system supports sh (defaults to ksh), csh, and tcsh user shells. The Cray Network Subsystem provides a transparent connection between users on a network-connected system and the Cray X1 series system. 20 S–2346–25Development Environment Overview [3] Cray X1 series mainframe Cray Network Subsystem Source Files Object Files Executables Libraries SAN CPES UNICOS/mp Support Functions User Applications TotalView PBS Pro Cray Compilers Cray Assembler Cray Loader CrayPat Figure 8. Accessing a Cray X1 Series System S–2346–25 21Cray X1™ Series System Overview When you execute one of a limited set of development environment commands, called trigger commands, the Cray X1 series mainframe initiates a process or process sequence on the CPES. The cc, CC, and ftn commands and certain tools like the CrayPat performance analysis tool are initiated by links to the trigger executable; trigger execution is transparent. When you compile and load your code, the compile and load commands actually trigger the compile and load processes to be performed on the CPES. Files are accessed between the Cray X1 series mainframe and the CPES by NFS mounted file systems. System and user files are stored on disks on a Cray X1 series system. The file management system for these files is the XFS file system, which is a journaling, mature file system with support for very large individual files and file systems. Data files and temporary storage may be maintained on Cray X1 series disk storage or on disks mounted using NFS. In addition, Cray Inc., in partnership with Advanced Digital Information Corporation (ADIC), supports file sharing and file storage through a storage area network (SAN). The ADIC StorNext File System, a shared SAN system, allows multiple computer systems to share disk files. The role of the Cray Programming Environment Server (CPES) is discussed in Section 4.3, page 41. 3.3 Nodes As described in Section 2.1.1.1, page 6, a Cray X1 series system is made up of nodes. Software defines how a node is used. A node’s software-assigned “flavor” dictates the kind of processes and threads that can use its resources. The three flavors (assignable by the system administrator) are application, support, and OS. Application nodes are used to run user applications. Support nodes are used to run serial commands, such as user-written commands, shells, editors and other user commands, such as ls. OS nodes provide kernel-level services, such as system calls, to all support nodes and application nodes. For applications compiled and executed using defaults, the developer does not need to deal with nodes, node processors, or node memory. For multinode applications, however, it is important that the developer understand nodes to make informed decisions regarding trade-offs in application placement and memory usage. 22 S–2346–25Development Environment Overview [3] 3.4 Writing Source Code Application developers can write programs in Fortran, C, C++, or Cray Assembly Language (CAL). Interlanguage communication functions enable developers to create Fortran programs that call C or C++ routines and C or C++ programs that call Fortran routines. In addition, the Fortran 2003 C interoperability feature enables Fortran and C/C++ programs to share data and procedures. You can sometimes avoid bottlenecks in programs by rewriting parts of a Fortran, C, or C++ program in assembly language, maximizing performance by selecting instructions to reduce machine cycles. You use the standard UNICOS/mp program linkage macros, eliminating the need to know the specific registers used. See Cray Assembly Language (CAL) for Cray X1 Systems Reference Manual for further information. 3.4.1 Cray Fortran Programming Environment The Cray Fortran Compiler fully supports the Fortran language through the Fortran 95 Standard, which is the International Organization of Standards ISO/IEC 1539–1:1997. Selected features from the proposed Fortran 2003 Standard are also supported. The Cray Fortran Compiler supports 8-, 16-, 32-, and 64-bit integer data types; 8-, 16-, 32-, and 64-bit logical data types; 32-, 64-, and 128-bit real data types; 64-, 128-, and 256-bit complex data types; and the character data type. In addition, the Cray X1 series development environment extends standard Fortran via compiler directives, calls to Cray library routines and intrinsic functions, and parallel programming model constructs. The parallel programming model constructs comprise Message Passing Interface (MPI) library routines, shared memory access (SHMEM) library routines, Fortran language extensions for co-array Fortran (CAF), and OpenMP directives. 3.4.2 Cray C and C++ Programming Environment The C++ compiler supports the C++ language in accordance with the International Organization of Standards ISO/IEC 1998, with some exceptions. (See Cray C and C++ Reference Manual for a description of the exceptions.) The C compiler supports the C language per the standard ISO/IEC 9899:1999 (C99). The C and C++ compilers and libraries support the LP64 data type model, which includes the 16-bit short, the 32-bit int, the 64-bit long, the 32-bit float, the 64-bit double, and 8-bit char data types. S–2346–25 23Cray X1™ Series System Overview In addition, the Cray X1 series development environment extends standard C and C++ via compiler directives, calls to Cray library routines and intrinsic functions, parallel programming model constructs, and a subset of the GNU Compiler Collection (GCC) C and C++ language extensions. The parallel programming model constructs comprise Message Passing Interface (MPI) library routines, shared memory access (SHMEM) library routines, the Unified Parallel C (UPC) language extension, and OpenMP directives. 3.4.3 Cray Libraries Cray provides an extensive set of library routines. These library routines are provided with Cray X1 series systems: • UNICOS/mp operating system routines. For details, see UNICOS/mp system calls and library routines man pages. • Fortran library procedures. For details, see Fortran library procedures, Fortran intrinsic procedures, and Fortran application programmer’s I/O man pages. • C and C++ library routines. For details, see C/C++ library functions, C/C++ intrinsic procedures, and Dinkum C++ library man pages. • LibSci, the Cray X1 series optimized scientific library routines. LibSci provides support for MSP and SSP modes, 32- and 64-bit data types, and Fortran interfaces for all routines. The default 32-bit version of LibSci provides both single and double precision routines. The 64-bit version of LibSci contains single precision routines. Included in LibSci are: – Fast Fourier transform (FFT), filter, and convolution routines – Basic Linear Algebra Subprograms (BLAS) – Version 3.0 Linear Algebra Package (LAPACK) (linear and eigensolvers) – Version 1.7 Scalable LAPACK (ScaLAPACK) (distributed memory parallel set of LAPACK routines) – Basic Linear Algebra Communication Subprograms (BLACS) – solvers for dense linear systems and eigensystems (LAPACK) – basic linear algebra subprograms (BLAS) – sparse direct solvers 24 S–2346–25Development Environment Overview [3] For additional information, see the intro_libsci(3s) man page. • Message Passing Interface (MPI) library routines. These routines comply with MPI-1, conform to the MPI 1.2 specification, and provide MPI I/O and MPI one-sided communications features specified in the MPI-2 standard. For details, see the MPI man pages. • Shared memory access (SHMEM) library routines. For details, see the SHMEM man pages. • X Window System (X11) and Motif library routines. Refer to the X Window System and Motif documentation for details. 3.4.4 Parallel Programming Memory Models High-level parallelization of codes is supported on Cray X1 series systems by both shared memory and distributed memory models and hybrid combinations of certain models. Table 1 shows the memory models supported on Cray X1 series systems. Table 1. Memory Models Supported on Cray X1 Series Systems Distributed Memory Models Shared Memory Models Message Passing Interface (MPI) POSIX threads (C and C++) Unified Parallel C (UPC) OpenMP Co-array Fortran (CAF) Shared Memory (SHMEM) Distributed memory models provide anywhere from no visibility of address space (as with MPI standards) to full visibility (as with UPC and CAF) between tasks or threads of execution. Distributed memory model address space spans nodes. All distributed memory models require explicit definition and referencing of data objects not local to a task or thread of execution. These distributed models are well suited for nonuniform memory access (NUMA) systems like Cray X1 series systems. S–2346–25 25Cray X1™ Series System Overview Shared memory models are shared in the sense that the address space is visible to all tasks or threads within that model. Shared memory model address space is limited to the memory on a single node. A single node supports uniform memory access (UMA), which provides optimal OpenMP performance. Shared memory applications using OpenMP may span the MSPs or SSPs within a node but may not run on more than one node unless nested within a distributed parallel programming model such as MPI. It is possible to mix shared and distributed memory models or different distributed memory models in the same application. In such hybrid models, the shared memory model is seen as being applied to a single process of the distributed memory model. In other words, the distributed memory model application should be viewed as composed of multiple processes, each of which is composed of multiple shared memory threads. All threads of a single process must execute on the same Cray X1 series node. Each model has advantages and disadvantages that the developer should understand. All higher-level parallelization models should be viewed as optional and secondary to vectorization and multistreaming. 3.5 Compiling Source Code The Cray X1 series Fortran, C, and C++ compilers translate source programs into Cray X1 series object files. Directives (Section 3.4.1, page 23) enable the developer to apply options to selected portions of code. In contrast, the compiler command can be used to apply options at the program or multiprogram level. Table 2 shows the source files, commands, and object files associated with the compilers. Table 2. Source and Object Files Compiler Source File Compiler Command Resulting Object File Name Cray Fortran file.ftn ftn -c file.ftn file.o Cray C file.c cc -c file.c file.o Cray C++ file.C CC -c file.C file.o The -c option tells the compiler to create object files without calling the loader. 26 S–2346–25Development Environment Overview [3] The PrgEnv module file must be loaded before any compilations, assemblies, or loads can be executed. You create and load module files using commands from the Modules package. For further information, see the module(1) man page. 3.5.1 Vectorization and Multistreaming The compilers use sophisticated analysis techniques to identify candidates for vectorization (such as array processing), multistreaming (such as Fortran do and C/C++ for loops), and other optimizations (such as inlining—replacing a user procedure call with the procedure definition itself) and compiles the code accordingly. The Cray compilers automatically translate code to make optimal use of machine resources. Register allocation, scheduling, vectorization, and multistreaming are all performed by default, and they are carefully balanced to give optimal performance. Compiler vectorization provides loop-level parallelization of operations and uses the high-performance vector processing hardware. Multistreaming code generation by the compiler permits tightly coupled execution across the four SSPs of an MSP on a block of code. Vectorization and multistreaming can be intermixed, and multistreaming often extends outside loop boundaries. In some situations, the compiler does not vectorize or multistream portions of code because it does not have enough information to determine that the process would be safe. For example, the compiler assumes it is not safe to multistream a loop that contains a function call. However, the developer, knowing the function call to be safe, could use Cray Streaming Directives (CSDs) to direct the compiler to multistream the code. You can determine what portions of your code the compiler vectorized and/or multistreamed by reviewing a loopmark listing. You create loopmark listings via the Fortran -rm option or the C/C++ -h list=m option. Detailed information about the compiler commands and CSDs is available in the Cray Fortran Compiler Commands and Directives Reference Manual and the Cray C and C++ Reference Manual. See also the ftn(1) and CC(1) man pages for details about the compiler command options. For details about loopmark listings, see Optimizing Applications on the Cray X1 Series Systems. S–2346–25 27Cray X1™ Series System Overview 3.5.2 MSP-mode and SSP-mode Applications A program can be compiled and run as either an MSP-mode application (default) or an SSP-mode application. In an MSP-mode application, the four SSPs of an MSP are tightly coupled to work together. The developer uses a parallel programming model to coordinate activities among multiple MSPs. In an SSP-mode application, each SSP runs independently of the others, executing its own stream of instructions. The developer uses a parallel programming model to coordinate activities among multiple SSPs. By default, an application is compiled to run in MSP mode. You designate SSP mode by using the Fortran -O ssp option or the C/C++ -h ssp option. To create an SSP-mode application, you must use this option when compiling and linking. Executables compiled for SSP mode can use only object files compiled in SSP mode. You can compile routines to be single-streamed for MSP mode applications, and you can call a single-streamed routine from a streamed region using Cray Streaming Directives. Programs that have enough work to keep all four SSPs of an MSP busy are generally good candidates for MSP mode. Programs that do not have significant amounts of innate parallelism or that were designed to run on a large number of less powerful processors may be more appropriate for SSP mode. However, it is often not possible to predetermine which mode works better. A careful analysis of MSP-mode versus SSP-mode performance is often required to make that choice. Note: Commands created using the Fortran -O command or C/C++ -h command option run on a support node. Although they run on an SSP, they are not SSP-mode applications. SSP-mode applications run on application nodes. For additional information, see the ftn(1) and CC(1) man pages, Cray Fortran Compiler Commands and Directives Reference Manual, Cray C and C++ Reference Manual, and Optimizing Applications on the Cray X1 Series Systems. 3.6 Debugging Your Program You can debug programs by analyzing compile time and runtime messages and making the appropriate corrections. The explain ftn-msgnumber and explain CC-msgnumber commands provide explanations of error messages. 28 S–2346–25Development Environment Overview [3] In addition, Cray offers the optional TotalView debugger from Etnus, LLC. Etnus TotalView is a scalable debugger that is designed to debug applications that use parallel programming models. Both the command line interface and the graphical user interface are supported on UNICOS/mp systems to control debugging sessions. The Etnus TotalView debugger can be used to debug MSP-mode or SSP-mode applications. For information about the optional Etnus TotalView debugger for UNICOS/mp systems, see the TotalView Release Overview and Installation Guide. 3.7 Loading and Linking Object Files The Cray X1 series loader combines relocatable object files (file.o) and library modules. It produces an output file (named a.out by default) that can be executed on Cray X1 series systems. Cray recommends that you load and link programs using the compiler commands rather than the ld command. The compiler calls the loader with the appropriate default libraries. For further information about using the ftn, CC, or cc commands to invoke the Cray X1 series loader, see the ftn(1) or CC(1) man pages. 3.8 Executing Your Program After the loading step is completed, your application is ready to execute on the Cray X1 series mainframe. To launch your application, you can use the aprun command, for example aprun ./a.out, or the mpirun command (mpirun ./a.out); or you can use the auto aprun method, for example, ./a.out. You use mpirun for programs that use the MPI parallel programming model. In addition, Cray enables you to launch multiple, interrelated applications with a single aprun or mpirun command. This feature, called Multiple Program Multiple Data (MPMD), applies to applications with the following attributes: • The applications can use MPI, SHMEM, or CAF to perform application-to-application communications. Using UPC for application-to-application communication is not supported. • Within each application, the supported programming models are MPI, SHMEM, CAF, and OpenMP. • All applications must be of the same mode; that is, they must all be MSP-mode applications or all SSP-mode applications. S–2346–25 29Cray X1™ Series System Overview • If one or more of the applications in an MPMD job use a shared memory model (OpenMP or pthreads) and need a depth greater than the default of 1, then all of the applications will have the depth specified by the aprun or mpirun -d option, whether they need it or not. An application can execute in one of two modes: accelerated or flexible. Applications executing in accelerated mode run in a predictable period of processor time, although their wall clock time may vary depending on I/O usage, network use, and/or whether any oversubscription of processors or memory occurs on the relevant nodes. Due to the characteristics of the memory address space, applications executing in accelerated mode must run on logically contiguous nodes. Applications executing in flexible mode can run on noncontiguous nodes, and they run in a less predictable amount of processor time due to the flexible network topology resulting from using noncontiguous nodes. Although both interactive and batch access are provided, the user environment is strongly oriented toward batch use. For running batch jobs, Cray offers the PBS Pro batch workload management system. The PBS Pro implementation on UNICOS/mp systems uses a command-line API with commands that will be familiar to NQS/NQE users. Optionally, PBS Pro allows you to run your job interactively; the job is queued and scheduled as any PBS Pro batch job, but when executed, the standard input, output, and error streams of the job are connected through qsub to the terminal session in which qsub is running. When the job begins execution, all input to the job is from the terminal session in which qsub is running. PBS Pro interacts with the application placement scheduler, Psched, and uses the Psched resource management and job placement capabilities. Either through the native capabilities of PBS Pro or in cooperation with Psched, PBS Pro maintains queues of jobs by resource type and priority and supports preemptive executions, job dependencies, and checkpoint/restart capabilities. The package enforces site policy limits for users and jobs. For simple executions, the qsub and qstat PBS Pro commands may suffice. For additional information, see the qsub(1) and qstat(1) PBS Pro man pages and PBS Pro User Guide. The major factors to consider in running applications are the manner in which the work is distributed among nodes, MSPs, and SSPs, and the way in which memory is managed. 30 S–2346–25Development Environment Overview [3] 3.8.1 Distributing Work Compiler command and aprun and mpirun options provide a spectrum of alternatives for distributing work. The software abstraction of a Cray X1 series execution engine is the processing element. A processing element is either an MSP or an SSP, depending on how the program was compiled. By default, programs are compiled as MSP-mode applications. By using compiler command and/or aprun or mpirun options, you can specify how and where you want your application to execute. You specify the number of processing elements needed for your application via the compiler command (-X npes), aprun (-n procs), or mpirun (-np procs) option. In addition, you can use the aprun or mpirun -N procs option to specify the number of processing elements per node. Psched mediates application placement on nodes, MSPs, and SSPs. When you use defaults in compiling and loading programs, you create an MSP-mode application in which the work is distributed among the four SSPs on one MSP on an application node. Although a single block of code is compiled, the generated code forms a separate instruction sequence for each of the four SSPs. The sequence is approximately the same for each SSP. However, each SSP operates on a different subset of input and/or output data. For example, consider source files x.ftn, y.ftn, and z.ftn that include loop structures that are good candidates for vectorizing and multistreaming. You compile the source files: % ftn -c x.ftn y.ftn z.ftn The compiler analyzes the source code, vectorizes inner loops, multistreams outer loops, and creates the object files x.o, y.o, and z.o. You then load and link the object files: % ftn -o trio x.o y.o z.o The loader loads and links the object files and required library routines, creating an MSP-mode application named trio. You launch the application via an explicit call to aprun: % aprun ./trio or via the auto aprun method: % ./trio S–2346–25 31Cray X1™ Series System Overview The application trio executes on SSPs 0–3 on one MSP on the first available node. The application runs on SSP0 until it encounters a multistreamed region (an outer loop), at which point the work is distributed among SSPs 0–3. The inner loops are processed on each SSP, with each SSP’s vector processor performing operations on arrays and its superscalar processor handling scalar operations. When the multistreamed work is completed, execution resumes on SSP0. 3.8.2 Managing Memory On Cray X1 series systems, there are two aspects to managing memory: managing memory references in code and managing memory resources. A memory reference is the transfer of an operand from a memory location or cache to a register (called a load or read) or the transfer of an operand from a register to cache or a memory location (called a store or write). The memory hierarchy (using the term memory in a broad sense) on a Cray X1 series system, from fastest to slowest, is as follows: • registers • Level 1 cache (D-cache) for scalar operations • Level 2 cache (E-cache) for vector and scalar operations • local node memory • remote node memory Effective use of memory references in applications can determine whether a loop can be vectorized or multistreamed and can have a significant effect on performance. Applications with a high degree of locality of reference and stride-1 loops in critical regions make the most effective use of memory by minimizing latency and maximizing bandwidth. On Cray X1 series systems, there are two types of constraints on memory resources: restrictions on memory reservations and restrictions on the actual memory in use. By default, maximum stack and heap sizes are set to 1 GB. You can modify these defaults by using environment variables. If your program tries to use more memory than was reserved based on default or explicit sizes, you get an error indication. Overflowing a stack results in a segmentation violation and possibly a core dump. Overflowing a heap results in a NULL return from an allocator routine. 32 S–2346–25Development Environment Overview [3] Some applications are restricted by the amount of actual memory they can use. Multinode applications are restricted to the real memory limit of each node on which the application is running. Single-node applications are restricted only by the amount of virtual memory available; they can use the node’s real memory plus the virtual memory swap space. Movement of data from swap space to local node memory is handled via paging. A page is the unit of memory addressable through the Translation Lookaside Buffer (TLB). For Cray X1 series systems, the base page size is 64 KB, but larger page sizes (up to 4 GB) are also available. Psched factors memory requirements into job placement; it may delay the start of an application until sufficient resources become available to permit execution. For more information about memory management, see the aprun(1) and memory(7) man pages. 3.9 Checkpointing Your Program UNICOS/mp provides a checkpoint/restart software management tool, cpr, to checkpoint a process or a set of processes that is executing and to restart it later. You must be the owner of a process or set of processes being checkpointed, although system administrators and operators may checkpoint/restart any process. You can checkpoint both MSP-mode and SSP-mode applications. Options to the cpr command let you create, query, restart, and delete checkpoints. By using the cpr -p id[:type] command, you can specify the process or set of processes to checkpoint. The id can be one of the following types: • PID (UNIX process ID; this is the default). • GID (UNIX process group ID). • SID (UNIX process session ID). • HID (Process hierarchy (tree) rooted at that PID). This type checkpoints all associated shells and child processes of that PID. • APT ApTeams (application team ID). Applications launched via aprun or mpirun create an ApTeam of processes. An ApTeam can be a single execution thread or a large number of processes. The cpr tool works with Psched to restart ApTeams on the application nodes. S–2346–25 33Cray X1™ Series System Overview For additional information about using the UNICOS/mp checkpoint/restart tool, see the cpr(1) man page. 3.10 Monitoring Your Program The psview command provides Psched status information on one or all applications running on the mainframe, including logical node numbers, the flavor of each node, the number of MSPs and SSPs, and the number of processing elements per node. The apstat command provides application status, including the number of processing elements, number of threads, a map of the application’s address space, and a map showing the placement of ApTeam members. The apkill command sends a kill signal to an ApTeam. For more information about these commands, see the psview(1), apstat(1), and apkill(1) man pages. You can also monitor activity on the CPES by executing certain commands that run remotely, such as the remps command, which is documented in the remps(1) man page. 3.11 Analyzing Your Application’s Performance CrayPat is an integrated tool designed to help you analyze the performance of MSP-mode and SSP-mode applications. You can use CrayPat to perform profiling, sampling, and tracing experiments on an instrumented application and analyze the results of those experiments. In addition, CrayPat provides access to all hardware performance counters. You use the pat_build and pat_hwpc commands to analyze performance. The pat_build command enables you to instrument your programs. pat_build supports two categories of experiments: trace experiments, which count some event such as the number of times a specific system call is executed, and asynchronous experiments, which capture values from the call stack or the program counter at specified intervals. All profiling and sampling experiments are asynchronous experiments. No recompilation is needed to produce the instrumented program. After using the pat_build command to instrument a program, you set environment variables to control run time data collection, run the instrumented program, then use the pat_report command to view the resulting report. The pat_hwpc utility collects hardware performance counter statistics for an executed command. pat_hwpc executes a program—the original, non-instrumented program—and writes the values of the specified hardware 34 S–2346–25Development Environment Overview [3] performance counters. Alternately, a process that is already executing may have its counters captured. For additional information about CrayPat, see the pat_build(1), pat_hwpc(1), pat_report(1), and counters(5) man pages and Optimizing Applications on the Cray X1 Series Systems. The pat_help program provides examples. 3.12 Optimizing Your Application Loopmark listings and CrayPat reports will help you determine what portions of your application are the best candidates for optimization. In general, you will make the most efficient use of machine time and realize the best performance gains by addressing optimization in this order: 1. Optimize memory use by minimizing latency and maximizing bandwidth 2. Simplify I/O demands 3. Optimize single-processor performance 4. Optimize multiple-processor performance For details about the techniques for improving the performance of your applications, see Optimizing Applications on the Cray X1 Series Systems. S–2346–25 35Cray X1™ Series System Overview 36 S–2346–25Operations Overview [4] This chapter provides the following operations overview information for administrators of Cray X1 series systems: • A list of the software release packages for Cray X1 series systems and the related administration documentation • Brief descriptions of software functionality with references to Cray X1 series system administration documentation for detailed information This chapter addresses system administrators who are experienced UNICOS or UNICOS/mk system administrators or have administered other systems based on the UNIX operating system. If you are migrating from a UNICOS or UNICOS/mk system, Cray X1 Series System Administration Differences will also be of interest to you. 4.1 Software Release Packages As shown in Figure 9, Cray X1 series system administration encompasses the following components: • Cray Workstation (CWS) software, which also includes RAID administration • UNICOS/mp operating system software • Cray Programming Environment Server (CPES) software • Cray Programming Environment software, which runs on the Cray Programming Environment Server (CPES) • Cray Network Subsystem (CNS) software • Any optional software products running on your Cray X1 series system S–2346–25 37Cray X1™ Series System Overview Cray X1 series mainframe Cray Network Subsystem Source Files Object Files Executables Libraries SAN CPES UNICOS/mp User Applications TotalView (optional) PBS Pro (optional) Users CWS COS (optional) Cray Compilers Cray Assembler Cray Loader CrayPat MPT CPES Software Programming Environment Software Figure 9. Cray X1 Series System Administration Components 38 S–2346–25Operations Overview [4] As a system administrator, you will need to understand the software products that are provided in the following release packages for Cray X1 series systems: • Cray Workstation (CWS) release package, which includes the following CWS administration documentation: – Cray Workstation (CWS) Release Overview – Cray Workstation (CWS) Installation Guide – Cray X1 Series System Configuration and CWS Administration – CWS man pages • Cray Programming Environment Server (CPES) release package, which includes the following CPES administration documentation: – Cray Programming Environment Server Software Installation and Administration – CPES man pages • Cray Fortran Programming Environment release package and/or the Cray C and C++ Programming Environment release package, which include administration (trigger) scripts, module files, the Cray Programming Environment Releases Overview and Installation Guide, and man pages related to triggering program execution on a Cray X1 series mainframe • Modules release package (also included with the Cray Programming Environment release packages) • Cray Network Subsystem (CNS) release package, which includes the following CNS administration documentation: – Cray Network Subsystem (CNS) Software Installation and Administration – CNS man pages • UNICOS/mp release package, which includes the following UNICOS/mp administration documentation: – UNICOS/mp Release Overview – UNICOS/mp Installation Guide – UNICOS/mp System Administration – UNICOS/mp Disks and File Systems Administration S–2346–25 39Cray X1™ Series System Overview – UNICOS/mp Networking Facilities Administration – UNICOS/mp man pages for user commands, system calls, system library routines, file formats and special files, and administrator commands If your site also chooses to use any of the following products, you should also know the administration needs for these software packages: • Message Passing Toolkit (MPT) release package, which includes the Cray Message Passing Toolkit Release Overview, MPI man pages, and SHMEM man pages • Etnus TotalView debugger release package, which includes the TotalView Release Overview and Installation Guide and TotalView documentation from Etnus LLC • PBS Pro batch system release package for UNICOS/mp systems, which includes the PBS Pro Release Overview, Installation Guide, and Administration Addendum for Cray Systems and PBS Pro documentation from Altair Grid Technologies, L.L.C. • Cray Open Software release package, which includes the Cray Open Software Release Overview and Installation Guide and documentation about each included software package 4.2 Cray Workstation (CWS) Functionality You use the CWS to perform administration functions, such as configuring, operating, administering, monitoring, booting, halting, dumping, and diagnosing your Cray X1 series system. The CWS is also used to install CWS, CPES, and UNICOS/mp software using the Common Installation Tool (CIT). (All software release packages for your Cray X1 series system may be installed using CIT.) The UNICOS/mp system uses RAID (redundant array of independent drives) devices exclusively for storage. Configuring and managing UNICOS/mp RAID controllers is done primarily through Cray storage management (CRAYSM) tools on the CWS. (For additional information about RAID devices and disk storage, see Section 2.2.4, page 13.) 40 S–2346–25Operations Overview [4] The CWS has private network connections to the System Control Facility, the CPES, the CNS, the RAID disk storage, and an optional service network used for remote support by Cray Customer Service. (For additional information about the System Control Facility and the CWS private network connections, see Section 2.3, page 13.) In addition, there is a centralized logging capability; the CWS consolidates log messages from the CWS, System Control Facility, CPES, UNICOS/mp, and CNS. For detailed information about CWS functions and administration, see Cray X1 Series System Configuration and CWS Administration and UNICOS/mp Disks and File Systems Administration. 4.3 Cray Programming Environment Server (CPES) Functionality The Cray Programming Environment compilers, loader, and performance analysis tool reside on the CPES. Program compilations are invoked by user (trigger) commands on the Cray X1 series mainframe and are executed on the CPES in a way that is transparent to the user. There is no user access (login) directly to the CPES. The CPES has a Gigabit Ethernet connection so that file systems on other servers that are needed for compiling can be mounted on both the Cray X1 series mainframe and on the CPES. The CPES resides in a PC-20 peripheral cabinet. It runs the Solaris operating system. Basic operations performed on the CPES are halting, booting, and restarting the CPES; setting up connections, user accounts, and group accounts; and backing up and restoring the file system. For information about CPES functions and administration, see Cray Programming Environment Server Software Installation and Administration. For information about the Cray Programming Environment software that runs on the CPES, see the Cray Programming Environment Releases Overview and Installation Guide. Note: The Cray Programming Environment (Fortran, C, and C++) also runs on other platforms (also known as a cross-compiler). If your site has the proper licensing in place, you might choose to use one of these other platforms so that your users will have faster compile time and will have access to the Cray Programming Environment when the Cray X1 series system is not available to them. Supported platforms are listed in the Cray Programming Environment Releases Overview and Installation Guide. S–2346–25 41Cray X1™ Series System Overview 4.4 Trigger Environment Included with the Programming Environment release package are the trigger environment commands. When a user enters the name and options of a CPES-hosted command on the command line of the Cray X1 series mainframe, a trigger environment command executes, setting up an environment for the CPES-hosted command. This trigger environment command duplicates the portion of the current working environment on the Cray X1 series mainframe that relates to the Programming Environment. The user interacts with the system as if all elements of the Programming Environment are hosted on the Cray X1 series mainframe. Commands entered on a Cray X1 series system trigger the execution of the corresponding CPES-hosted commands that have the same names. A template trigexecd.cfg file is provided with the Programming Environment release, which the administrator needs to edit to ensure the trigger environment commands will function properly. For additional information, see the intro_trigger(8) man page. 4.5 Cray Network Subsystem (CNS) Functionality The CNS is a specialized router on the Transmission Control Protocol/Internet Protocol (TCP/IP) network. Access to that network from a Cray X1 series system is controlled entirely by the CNS software. The CNS connects the Cray X1 series mainframe to site networks while isolating the Cray X1 series mainframe from the small network packets commonly used on those networks. This improves Cray X1 series mainframe TCP network performance by allowing the Cray X1 series mainframe to operate with larger, more appropriate packet sizes. The CNS resides in a PC-20 peripheral cabinet. Multiple CNSs can be configured on a Cray X1 series system to scale to the number of network connections required. For detailed information about the CNS, see Cray Network Subsystem (CNS) Software Installation and Administration. 4.6 UNICOS/mp Functionality The UNICOS/mp operating system consists of a kernel that is based on the IRIX 6.5 operating system kernel with Cray enhancements that provide for scalability and resource scheduling. A single UNICOS/mp kernel image runs on the entire system. 42 S–2346–25Operations Overview [4] The following sections briefly describe UNICOS/mp key functionality; detailed information is provided in the UNICOS/mp system administration documentation and UNICOS/mp man pages. 4.6.1 Key IRIX Functionality in the UNICOS/mp Operating System Key IRIX functionality in the UNICOS/mp operating system includes: • An environment that is based on IRIX, with command, utility, and operating system interfaces as defined by POSIX 1003.1–1990 and POSIX 1003.2–1992. • A single-system image operating system. • A checkpoint/restart software management tool, cpr, to suspend a process or a set of processes that is executing and restart it later. System administrators, operators, and the owner of a process or set of processes may checkpoint the targeted process or processes. For information about using the UNICOS/mp checkpoint/restart facility, see the cpr(1) man page. Note: The PBS Pro batch system product available from Cray supports checkpointing and restarting of batch jobs. Documentation is provided in the PBS Pro software package if your site is using this optional batch system. • The XFS journaling file system and the XLV volume manager for defining and managing logical volumes (file systems) on a Cray X1 series system. Detailed information is provided in UNICOS/mp Disks and File Systems Administration. • Disk quotas administration on the XFS file system using the quota, edquota, quot, quotaon, and repquota commands. For additional information, see the related man pages and UNICOS/mp Disks and File Systems Administration. • Backup and restore of XFS file systems through the xfsdump and xfsrestore utilities, using local or remote drives. You can back up file systems, directories, and/or individual files, and then restore file systems, directories, and files independently of how they were backed up. The xfsdump utility also allows you to back up "live" (mounted, in-use) file systems. Additional information is provided in the xfsdump(8) and xfsrestore(8) man pages and in UNICOS/mp System Administration. • Network File System (NFS) to make mounted file systems available from the CPES or any other accessible server. A UNICOS/mp system can function as both an NFS client and an NFS server. Automatic mounting of NFS file systems is supported. Information about NFS is provided in UNICOS/mp Networking Facilities Administration. S–2346–25 43Cray X1™ Series System Overview • Remote Procedure Call (RPC) protocol, which NFS requires for session layer services. • Domain Name System (DNS) client; the Internet uses DNS to map names to IP addresses. Information is provided in UNICOS/mp Networking Facilities Administration. • A UNICOS/mp system can function as a Network Information Service (NIS) client. NIS is a network-based information service and an administrative tool. It allows centralized database administration and a distributed lookup service. NIS supports multiple databases based on regular text files. For information about NIS, see UNICOS/mp Networking Facilities Administration. • Transmission Control Protocol/Internet Protocol (TCP/IP) support, including the socket interface for network communications, ftp, telnet, and rsh. Information about TCP/IP is provided in UNICOS/mp Networking Facilities Administration and in the related man pages. • The Network Time Protocol (NTP), which provides the mechanisms to synchronize time and coordinate time distribution in a large, diverse internetwork. The NTP protocol is implemented on UNICOS/mp systems by the ntpd daemon process. For information about NTP, see UNICOS/mp Networking Facilities Administration and the ntpd(8) man page. • Standard UNIX SVR4 accounting evaluation tools such as acctcom, acctcms, and acctmerg. These tools enable the collection of application-wide data. Additional information is provided in UNICOS/mp System Administration and in the related man pages. • System activity monitoring and reporting. System activity data is accessed on UNICOS/mp systems by using the sar and timex utilities. The sar (System Activity Reporter) command can be used to track the use of CPU time, number and type of processors, memory, and I/O to secondary storage and over network connections. Additional information is provided in UNICOS/mp System Administration and the sar(1) and timex(1) man pages. • Security features in the following forms: /etc/passwd file for storing user passwords and a protected /etc/shadow file for storing encrypted passwords; access control lists (ACLs); and granular privilege mechanism, which divides the power of the superuser into discrete units of privilege called capabilities. Additional information is provided in UNICOS/mp System Administration and the passwd(5) and shadow(5) man pages. • Source Code Control System (SCCS); for additional information, see the sccs(1) man page. 44 S–2346–25Operations Overview [4] 4.6.2 Cray Added Functionality of Interest to All Users Cray has added the following key functionality to the UNICOS/mp operating system. 4.6.2.1 Accelerated Execution for Distributed Memory Applications The UNICOS/mp operating system supports accelerated execution for distributed memory applications, which is called accelerated mode. This acceleration is supported by special Cray X1 series system memory mapping hardware through which all remote memory is completely mapped to minimize Translation Lookaside Buffer (TLB) misses and to guarantee peak application performance in tightly synchronized environments. 4.6.2.2 Distributed Memory Message Passing Distributed memory message passing is managed through direct reads and writes as applications share a globally mapped address space across all nodes on which they execute. This feature is unique to UNICOS/mp systems. 4.6.2.3 Node Allocation (Migration) for Accelerated Applications A requirement for accelerated applications is that all nodes allocated to the application are logically contiguous. This is not a requirement for applications run in flexible mode, but their performance should be improved if their nodes are contiguous. The placement process prefers contiguous allocation but will fragment a flexible application if that is necessary to find it a place to run. The contiguity requirement of accelerated applications can lead to situations where the occupied nodes and available nodes are scattered in such a way that no waiting application can be placed anywhere in the domain. Fragmentation of this kind lowers the utilization of the machine by leaving portions of it effectively unavailable. Migration is used to increase the size of contiguous free space. Migration moves applications within the domain to eliminate allocation holes. Application migration is managed by the UNICOS/mp application placement scheduler, Psched. For additional information, see UNICOS/mp System Administration. 4.6.2.4 Interactive and Batch Processing UNICOS/mp systems support both interactive and batch processing. S–2346–25 45Cray X1™ Series System Overview Cray offers the PBS Pro product as the batch system for UNICOS/mp systems. PBS Pro is the professional version of the Portable Batch System (PBS), which is a flexible resource and workload management system. PBS Pro works with the UNICOS/mp application placement scheduler, Psched, so that you can initiate and manage the workload of your users’ computational jobs in accordance with your site scheduling policies. PBS Pro software is licensed to Cray customers directly from Cray, and Cray provides the PBS Pro software package, documentation, and support directly to our customers. Detailed information about the optional batch system offered by Cray is provided in the PBS Pro Release Overview, Installation Guide, and Administration Addendum for Cray Systems. 4.6.2.5 Application Launch and Query Commands A user can use the ftn, cc, or CC command-line option -X npes to designate the number of processors needed to execute a program. A user can also use the aprun or mpirun command to launch a program; the mpirun command is used for programs that use the MPI memory model. For more information, see the ftn(1), cc(1), CC(1), aprun(1), and mpirun(1) man pages. The psview command uses the psched daemon to display the status of applications. For detailed information, see the psview(1) man page. 4.6.2.6 Large Pages A user is allowed to specify page size by using the aprun or mpirun command. The larger the page size, the fewer TLB misses occur, providing better application performance. UNICOS/mp starts forming free large pages of the size specified by the user as soon as the application is placed, before the application actually needs the large pages. For information about how to set the page size, see the aprun(1) or mpirun(1) man page. 4.6.2.7 Application Monitoring The UNICOS/mp psview, apstat and snflv commands provide information about running applications and node configuration. The apkill command is used to send a kill signal to an ApTeam. For additional information about these commands, see the apstat(1), psview(1), snflv(8), and apkill(1) man pages. 46 S–2346–25Operations Overview [4] Activity on the CPES can be monitored by executing certain commands that run remotely, such as the remps(1) command, which is documented in the remps(1) man page. 4.6.2.8 Multiple Program, Multiple Data (MPMD) Programs UNICOS/mp provides the capability to run multiple programs as part of a single ApTeam and to specify which processing elements (PEs) should run which binaries. This is referred to as MPMD (Multiple Program, Multiple Data). For additional information, see the aprun(1) man page. 4.6.2.9 X Window System Client and Libraries UNICOS/mp supports the X Window System client (X11R6.6); the essential X Window System libraries are provided with the Cray Programming Environment releases. For additional information, see the Cray Programming Environment Releases Overview and Installation Guide. 4.6.2.10 Motif Version 2.1 Motif version 2.1 is supported and, if licensed, is provided with the Cray Programming Environment releases. For additional information, see the Cray Programming Environment Releases Overview and Installation Guide. 4.6.2.11 Access to Files via a Storage Area Network (SAN) A storage area network (SAN) may be connected to a UNICOS/mp system for file sharing and file storage. 4.6.3 Additional Cray Added Functionality of Interest to System Administrators This section describes additional software functionality that Cray includes for UNICOS/mp system administration. 4.6.3.1 Software Installation The Common Installation Tool (CIT) is used to install UNICOS/mp from the CWS. UNICOS/mp installation information is provided in the UNICOS/mp Installation Guide. S–2346–25 47Cray X1™ Series System Overview 4.6.3.2 UNICOS/mp Storage The UNICOS/mp system uses RAID devices for storage. UNICOS/mp treats RAID volumes as a single, logical volume and communicates to it only through a RAID controller. In effect, the entire physical disk infrastructure is invisible and under total control of the RAID subsystem; UNICOS/mp sees only virtual devices. The RAID subsystem configuration and management utilities are part of the Cray Workstation (CWS) release; for additional information, see UNICOS/mp Disks and File Systems Administration. Note: The StorNext File System client software is included in the UNICOS/mp release. This feature supports file sharing and file storage for UNICOS/mp systems if your site chooses to use a storage area network (SAN). 4.6.3.3 UNICOS/mp File System Hierarchy The UNICOS/mp file system hierarchy was implemented to closely comply with the Filesystem Hierarchy Standard (FHS) Version 2.2. The UNICOS/mp file system hierarchy is documented in UNICOS/mp System Administration. 4.6.3.4 Security In addition to the security features listed in Section 4.6.1, page 43, the OpenSSH and OpenSSL optional security software products are supported on UNICOS/mp systems. These open-source products are included in the Cray Open Software (COS) package. For additional information, see Cray Open Software Release Overview and Installation Guide. Also, the UNICOS/mp 2.4.15 was successfully evaluated against criteria described in the Common Criteria for Information Technology Security Evaluation, Version 2.1. The evaluation assurance level for Cray X1 systems is EAL2, augmented with basic flaw remediation. 4.6.3.5 Network Routing Network traffic is forwarded from a UNICOS/mp system to the Cray Network Subsystem (CNS) for network routing. The customer network uses the CNS as a gateway (router) to a UNICOS/mp system. For information about the CNS, see Cray Network Subsystem (CNS) Software Installation and Administration. 4.6.3.6 Name Service Switch UNICOS/mp uses many databases of information about users, groups, and so forth. Data for these databases comes from a variety of sources. These sources 48 S–2346–25Operations Overview [4] and their lookup order can be specified in the /etc/nsswitch.conf file. For additional information, see UNICOS/mp Networking Facilities Administration and the nsswitch.conf(5) and nsdispatch(3) man pages. 4.6.3.7 Application Placement Scheduling Mechanism The UNICOS/mp operating system employs an enhanced version of the UNICOS/mk Psched mechanism. The UNICOS/mp psched daemon supports placement of applications according to site policy, including time sharing of node memory and gang scheduling for high processor use. In addition, the UNICOS/mk Global Resource Management functionality is included as part of the UNICOS/mp psched daemon. The UNICOS/mp psched daemon is configured through the /etc/psched.conf file as well as through the psmgr command. The UNICOS/mp psched daemon schedules and places all applications on nodes; it includes the following major functions to support application workloads: • Allocation of work to application nodes • Load balancing work among the application nodes • Gang scheduling (context switching) applications when nodes are oversubscribed There is also a recovery feature to restore application placement scheduling if the daemon is killed or fails, and there is an interface to the checkpoint and restart facility so that a restarted application is properly allocated to the application nodes. The psched daemon assigns applications to processing resources. The psched daemon adheres to the restrictions set by the configuration of gates and limits. When an application is launched using the aprun or mpirun command, Psched compares the aprun command options and requirements of the application to the configured limits and gates in order to appropriately place each application. An administrative interface using the psmgr command allows configuration of the Psched functions. A set of displays for both administrators and users is provided by using the psview command. Lower-level processor and memory scheduling is done by the UNICOS/mp kernel. The psched daemon provides information to the kernel through the apteamctl interface. Using this information, the kernel allocates the required resources to serve the applications. S–2346–25 49Cray X1™ Series System Overview Detailed information about the UNICOS/mp application placement scheduler is provided in UNICOS/mp System Administration. 4.6.3.8 UNICOS/mp Accounting Cray has added accounting capabilities to the UNICOS/mp operating system that allow you to display account IDs and consolidate by account ID. In addition, Cray provides ApTeam accounting records, which contain accounting information collected on a per-application basis. The ApTeam accounting record can be viewed by using the acctcom -A command. Additional information is provided in UNICOS/mp System Administration and the acctcom(1) and acct(5) man pages. 4.6.3.9 Resource Limits The UNICOS/mp operating system allows a system administrator to dynamically assign process resource limits. Limits are defined in two dimensions: service provider type (batch or interactive) and placement class (command or application). Both maximum and initial values may be specified for each limit. The limit_mkdb command is used to maintain the limit database. Each application is assigned a limit from the limits database when it is launched. The userlimit command is used to see the limits imposed on individual users. Additional information is provided in the limit_mkdb(8) and userlimit(1) man pages and in UNICOS/mp System Administration. 4.6.3.10 System Resiliency The UNICOS/mp operating system provides the following system resiliency features; additional information is provided in the UNICOS/mp and CWS system administration documentation: • System diagnostics. System diagnostics are provided for Cray X1 series systems. • Running a system that has a failed SSP or MSP. A Cray X1 series system can be run in a degraded mode with failing SSPs and MSPs removed from the configuration. 50 S–2346–25Operations Overview [4] • Disabling an SSP or an MSP on an application node. The mpadmin command allows a system administrator to disable an SSP or an MSP on a application node without requiring a system reboot. • Capability to run with degraded memory. A Cray X1 series system allows memory on a node to be degraded in case of memory failures. Local memory can operate in two degraded modes. First, half of the memory chips on a daughter card can be disabled to tolerate the loss of a memory chip. This degraded mode cuts the memory size in half. Second, half the daughter cards can be disabled to tolerate the failure of a daughter card. This degraded mode cuts both memory size and bandwidth in half. It is possible to degrade down to one quarter of memory on a compute module. The degraded modes affect both nodes of a Cray X1E compute module. • Automatic disk failover. A Cray X1 series system automatically uses the alternate path to a disk device if the first path fails. Manual intervention may be required to correct disk configuration in the event of a disk failover. • Path-managed disk devices. The path-managed disk device driver (pmd) manages individual devices (LUNs) and their underlying paths, combining dynamic path selection for performance with alternate path utilization for error recovery. The pm utility is used to monitor and control path-managed disk devices. • Network resiliency. The Bonded Fibre Channel facility allows two Fibre Channel links to be treated as a single, logical interface. Bonded Fibre Channel interfaces are managed by using the bfc command. 4.6.3.11 Partitioning a System It is possible to run a single system as two or more independent systems (partitions). Each partition is booted and shut down independently. A partition does not have access to I/O facilities on other partitions. Therefore, each partition must have independent I/O drawers, CNS(s), and RAID subsystems. Each partition also requires either a separate CPES or an independent connection to a common CPES. All partitions share a single CWS. Hardware and software failures in one partition do not affect other partitions. Partitioning or repartitioning requires a full system reboot. Each Cray X1 series partition requires a minimum of two nodes: a Cray X1 partition requires a minimum of two compute modules; a Cray X1E partition requires a minimum of one compute module. S–2346–25 51Cray X1™ Series System Overview Note: A partition cannot have both Cray X1 and Cray X1E compute modules within a partition; however, a system may have a Cray X1 partition and a Cray X1E partition. For additional information about partitioning a Cray X1 series system, see Cray X1 Series System Configuration and CWS Administration. 4.6.3.12 System Maintenance Utilities The following system maintenance utilities are provided with UNICOS/mp systems: • Log files. As part of the centralized logging capability, the CWS consolidates log messages from the CWS, System Control Facility, CPES, UNICOS/mp, and CNS. • Hardware error reporting. UNICOS/mp has three utilities that provide different levels of system error warning capability: the watchlog utility, the x1wacs GUI tool, and the xl0control GUI tool. For additional information, see Cray X1 Series System Configuration and CWS Administration. • System dump analysis. To perform a system dump, the dumpsys command should be used. The crashmp command is used to analyze mainframe system dumps. For more information about system dump analysis, see UNICOS/mp System Administration and the crashmp(8) and dumpsys(8) man pages. • Tunable kernel parameters. The systune utility lets you view and change tunable kernel parameters. For additional information, see the systune(8) man page and UNICOS/mp System Administration. 52 S–2346–25Glossary accelerated mode One of two modes of execution for an application on UNICOS/mp systems; the other mode is flexible mode. Applications running in accelerated mode perform in a predictable period of processor time, though their wall clock time may vary depending on I/O usage, network use, and/or whether any oversubscription occurs on the relevant nodes. Due to the characteristics of the memory address space, accelerated applications must run on logically contiguous nodes. See also flexible mode. air-cooled (AC) For Cray X1 series systems, the description of the cabinet that uses a combination of Fluorinert liquid and forced air to cool the components on a compute module. The Fluorinert is air cooled within a heat exchange unit so that a chilled water supply is not required for AC Cray X1 series systems. An AC cabinet uses two blowers: one for the forced air cooling in the module cabinet, and the second for the Fluorinert heat exchange unit. application node For UNICOS/mp systems, a node that is used to run user applications. Application nodes are best suited for executing parallel applications and are managed by the strong application placement scheduling and gang scheduling mechanism Psched. See also node; node flavor. ApTeam Applications launched via the aprun or mpirun command create an ApTeam of processes. An ApTeam can be a single execution thread or a large number of processes and can use shared memory, distributed memory, or a hybrid model for parallel execution. automatic mounting Making a remote file system accessible (using NFS) when the file system is accessed. S–2346–25 53Cray X1™ Series System Overview brick A Cray X1 series hardware term for four grouped compute modules. On the Cray X1 series mainframe cabinet, a brick is designated as either V or W. See also brick-pair; compute module. brick-pair For Cray X1 series systems, two bricks (the space required for 8 compute modules) in the front or back half (V-side or W-side for notational reference) of a liquid-cooled Cray X1 series mainframe cabinet. Air-cooled Cray X1 series systems have only the V-side brick. See also brick; compute module. C interoperability A Fortran 2003 feature that allows Fortran programs to call C functions and access C global objects and also allows C programs to call Fortran procedures and access Fortran global objects. C-brick For Cray X1 series systems, the modular controller component of a RAID subsystem, also called a controller brick. It is located in a PC-20. A C-brick can have multiple associated RAID storage components (S-bricks). cache domain The subset of memory that is permitted to be cached. For Cray X1 systems, all the memory on the node where the multistreaming processor (MSP) resides. For Cray X1E systems, the cache domain for an "upper" MSP is the upper half of memory on its node, and the cache domain for a "lower" MSP is the lower half of memory on its node. cache pollution A delay that results from loading data into cache that will not be used before it is evicted, thereby displacing cached data that would have been used. call stack A software stack of functions and subroutines used by the executing program. The functions and subroutines are listed in the opposite order in which they were called. That is, the function at the bottom of the stack is the one currently executing. When function a immediately follows function b in the list, a was called by b. 54 S–2346–25Glossary co-array A syntactic extension to Fortran that offers a method for programming data passing; a data object that is identically allocated on each image and can be directly referenced syntactically by any other image. Common Installation Tool (CIT) A graphical user interface (GUI) for loading software. The cit and setup commands invoke CIT. compute module For a Cray X1 series mainframe, the physical, configurable, scalable building block. Each compute module contains either one node with 4 MCMs/4MSPs (Cray X1 modules) or two nodes with 4 MCMs/8MSPs (Cray X1E modules). Sometimes referred to as a node module. See also node. Cray Fortran Compiler The compiler that translates Fortran programs into Cray object files. The Cray Fortran Compiler fully supports the Fortran language through the Fortran 95 Standard, ISO/IEC 1539-1:1997. Selected features from the proposed Fortran 2003 Standard are also supported. Cray Network Subsystem (CNS) A specialized router (gateway) providing IP (Internet Protocol) network connectivity between a supported Cray mainframe and site networks. The CNS also provides Transmission Control Protocol (TCP) assist functionality that can enhance TCP performance in supported Cray systems. Cray Open Software (COS) A collection of public-domain software that is an optional product available for certain Cray systems. COS provides options that are either not available in or behave differently than Cray operating system utilities. Cray Programming Environment Server (CPES) A server for the Cray X1 series system that runs the Programming Environment software. S–2346–25 55Cray X1™ Series System Overview Cray Storage Management (CRAYSM) Software on the Cray Workstation (CWS) that is used for configuring the RAID subsystem. Cray streaming directives (CSDs) Nonadvisory directives that allow you to more closely control multistreaming for key loops. Cray Workstation (CWS) For Cray X1 series systems, the system operation, administration, and maintenance workstation. CrayDoc Cray’s documentation system for accessing and searching Cray books, man pages, and glossary terms from a web browser. CrayPat For Cray X1 series systems, the primary high-level tool for identifying opportunities for optimization. CrayPat allows you to perform profiling, sampling, and tracing experiments on an instrumented application and to analyze the results of those experiments; no recompilation is needed to produce the instrumented program. In addition, the CrayPat tool provides access to all hardware performance counters. D-cache For Cray X1 series systems, a small, high-speed random-access memory that stores frequently or recently accessed data (data cache). The second level of cache storage for scalar operations, it is located between the registers and the E-cache. distributed memory The kind of memory in a parallel processor where each processor has fast access to its own local memory and where to access another processor’s memory it must send a message via the interprocessor network. E-cache For Cray X1 series systems, the first level of cache storage. It is located between node memory and a multistreaming processor (MSP). It is provided by the 56 S–2346–25Glossary E-chips (for Cray X1 systems) or E+ chips (for Cray X1E systems) in a multichip module and shared by all processors in an MSP. All data that enters an MSP routes through the E-cache. Scalar, vector, and instruction data can be stored in the E-cache. Etnus TotalView For UNICOS/mp systems, a symbolic source-level debugger designed for debugging the multiple processes of parallel Fortran, C, or C++ programs. explicit communication A programming style in which communication is language independent and in which communication occurs through library calls. The message-passing and explicit shared memory programming methods use products such as MPI or SHMEM. failover The ability to define and manage alternate paths to a single disk device or part of a disk device. flexible mode One of two modes of execution for an application on UNICOS/mp systems; the other mode is accelerated mode. Applications running in flexible mode may run on noncontiguous nodes; they perform in a less predictable amount of processor time than applications running in accelerated mode due to the exclusive use of source processor address translation. See also accelerated mode. granular privilege mechanism A security feature that divides the power of the superuser into discrete units of privilege called capabilities. I/O Channel Adapter (IOCA) For Cray X1 series systems, a field programmable gate array, resident on an IOB, that converts proprietary System Port Channel transactions into standard PCI-X bus transactions. I/O drawer (IOD) For Cray X1 series systems, the hardware enclosure that contains two I/O boards (IOBs) and a Cray L1 controller assembly. Each I/O drawer supports System S–2346–25 57Cray X1™ Series System Overview Port Channels (SPCs) to the Cray X1 series mainframe cabinet and Fibre Channel connections to I/O devices. (Also known as an I/O module.) liquid-cooled (LC) Describing a Cray X1 series system that has a chilled water supply. An LC system uses two chilled-water heat exchange systems: one to cool the Fluorinert liquid, and one to cool the forced air that cools compute modules. load balancing A process for ensuring that each processor in a job performs equal work. loopmark listing A listing that is generated by invoking the Cray Fortran Compiler with the -rm option. The loopmark listing displays what optimizations were performed by the compiler and tells you which loops were vectorized, streamed, unrolled, interchanged, and so on. Message Passing Interface (MPI) A widely accepted standard for communication among nodes that run a parallel program on a distributed-memory system. MPI is a library of routines that can be called from Fortran, C, and C++ programs. Message Passing Toolkit (MPT) A Cray product that consists of the Message Passing Interface and shared distributed memory (SHMEM) data-passing routines. module file A metafile that defines information specific to an application or collection of applications. (This term is not related to the module statement of the Fortran language; it is related to setting up the UNICOS/mp system environment.) For example, to define the paths, command names, and other environment variables to use the Programming Environment for the Cray X1 series systems, you use the module file PrgEnv, which contains the base information needed for application compilations. The module file mpt sets a number of environment variables needed for message passing and data passing application development. 58 S–2346–25Glossary Modules A package on the UNICOS/mp system that allows you to dynamically modify your user environment by using module files. (This term is not related to the module statement of the Fortran language; it is related to setting up the UNICOS/mp system environment.) The user interface to this package is the module command, which provides a number of capabilities to the user, including loading a module file, unloading a module file, listing which module files are loaded, determining which module files are available, and others. MSP mode (multistreaming mode) One of two types of application modes for UNICOS/mp systems. Programs are compiled either as MSP-mode applications (default) or SSP-mode applications. MSP-mode applications run on one or more MSPs. For MSP-mode applications, each MSP coordinates the interactions of its associated four SSPs. See also SSP mode. multichip module (MCM) For Cray X1 series systems, the physical packaging that contains processor chips and cache chips.The chips implement either one multistreaming processor (Cray X1 MCM) or two multistreaming processors (Cray X1E MCM). See also MSP. multistreaming processor (MSP) For UNICOS/mp systems, a basic programmable computational unit. Each MSP is analogous to a traditional processor and is composed of four single-streaming processors (SSPs) and E-cache that is shared by the SSPs. See also node; SSP; MSP mode; SSP mode. network file system (NFS) A file system that is exported from a remote server and mounted on other hosts across a network. Network Information Service (NIS) A network-based information service and an administrative tool. It allows centralized database administration and a distributed lookup service. NIS supports multiple databases based on regular text files. For example, NIS databases can be generated from the hosts, passwd, group, and aliases files on the NIS master. S–2346–25 59Cray X1™ Series System Overview Network Time Protocol (NTP) For Cray X1 series systems, the protocol that allows synchronizing the clocks in the System Controller and Cray mainframe by pointing the mainframe clocks to the CWS. NTP is used by the CWS to keep the Cray L1 controllers synchronized. This provides consistent time stamps among the various logs for use in diagnosing system problems. node For UNICOS/mp systems, the logical group of four multistreaming processors (MSPs), cache-coherent shared local memory, high-speed interconnections, and system I/O ports. A Cray X1 system has one node with 4 MSPs per compute module. A Cray X1E system has two nodes of 4 MSPs per node, providing a total of 8 MSPs on its compute module. Software controls how a node is used: as an OS node, application node, or support node. See also compute module; MCM; MSP, node flavor; SSP. node flavor For UNICOS/mp systems, software controls how a node is used. A node’s software-assigned flavor dictates the kind of processes and threads that can use its resources. The three assignable node flavors are application, OS, and support. See also application node; OS node; support node; system node. OpenMP An industry-standard, portable model for shared memory parallel programming. OS node For UNICOS/mp systems, the node that provides kernel-level services, such as system calls, to all support nodes and application nodes. See also node; node flavor. page size The unit of memory addressable through the Translation Lookaside Buffer (TLB). For a UNICOS/mp system, the base page size is 65,536 bytes, but larger page sizes (up to 4,294,967,296 bytes) are also available. partitioning Configuring a UNICOS/mp system into logical systems (partitions). Each partition is independently operated, booted, dumped, and so on without impact 60 S–2346–25Glossary on other running partitions. Hardware and software failures in one partition do not affect other partitions. PC-20 For Cray X1 series systems, the peripheral cabinet that can contain either RAID C-bricks and RAID S-bricks or a Cray Programming Environment Server (CPES), peripherals for the CPES, and the Cray Network Subsystem (CNS). Private Administration subnetwork The Cray X1 series system internal Ethernet network that connects the Cray Workstation to the Cray Programmer Environment Server and the Cray Network Subsystem. This network provides access to the hardware that is used to configure and operate the programming environment and a site’s private network. program counter A hardware element that contains the address of the instruction currently executing. Psched The UNICOS/mp application placement scheduling tool. The psched command can provide job placement, load balancing, and gang scheduling for all applications placed on application nodes. RAID subnetwork The Cray X1 series internal Ethernet that connects the CWS and the RAID subsystem. The Cray Storage Management (CRAYSM) software uses this network to configure and monitor the RAID subsystem. router module For Cray X1 series systems, the physical hardware that connects compute modules. S-brick For Cray X1 series systems, the modular, scalable storage component of a RAID subsystem, also called a storage brick. It is located in a PC-20. Each S-brick contains a number of Fibre Channel disk drives. The number of drives per S-brick and the size of each disk varies with the different S-brick modules. S–2346–25 61Cray X1™ Series System Overview SHMEM A library of optimized functions and subroutines that take advantage of shared memory to move data between the memories of processors. The routines can either be used by themselves or in conjunction with another programming style such as Message Passing Interface. SHMEM routines can be called from Fortran, C, and C++ programs. single-streaming processor (SSP) For UNICOS/mp systems, a basic programmable computational unit. See also node; MSP; MSP mode; SSP mode. SSP mode (single-streaming mode) One of two types of application modes for UNICOS/mp systems. Programs are compiled either as MSP-mode applications (default) or SSP-mode applications. SSP-mode applications run on one or more SSPs. Each SSP runs independently of the others, executing its own stream of instructions. In contrast, compiler options enable the programmer to develop command-mode programs that run on an SSP on the support node. See also MSP mode. stride The relationship between the layout of an array’s elements in memory and the order in which those elements are accessed. A stride of 1 means that memory-adjacent array elements are accessed on successive iterations of an array-processing loop. support node For UNICOS/mp systems, the node that is used to run serial commands, such as shells, editors, and other user commands (ls, for example). See also node; node flavor. system node For UNICOS/mp systems, the node that is designated as both an OS node and a support node; this node is often called a system node; however, there is no node flavor of "system." See also node; node flavor. System Port Channel (SPC) The proprietary I/O channel that provides I/O access to the Cray X1 series system by connecting the I-chip to an I/O Channel Adapter on an I/O board. 62 S–2346–25Glossary thread The active entity of execution. A sequence of instructions together with machine context (processor registers) and a stack. On a parallel system, multiple threads can be executing parts of a program at the same time. trigger A command that a user logged into a Cray X1 series system uses to launch Programming Environment components residing on the CPES. Examples of trigger commands are ftn, CC, and pat_build. type A means for categorizing data. Each intrinsic and user-defined data type has four characteristics: a name, a set of values, a set of operators, and a means to represent constant values of the type in a program. UNICOS/mp The operating system for Cray X1 series (Cray X1 and Cray X1E) systems. Unified Parallel C (UPC) An extension of the C programming language designed for high performance computing on large-scale parallel processing machines. UPC provides a uniform programming model for both shared and distributed memory hardware. Other parallel programming models include Message Passing Interface, SHMEM, Co-array Fortran, and OpenMP. uniform memory access (UMA) A system in which any memory element can be read from or written to in the same, constant time. All processing elements of a single-node application have uniform access to the node’s memory. Contrast to nonuniform memory access (NUMA). vector A series of values on which instructions operate; this can be an array or any subset of an array such as row, column, or diagonal. Applying arithmetic, logical, or memory operations to vectors is called vector processing. See also vector processing. S–2346–25 63Cray X1™ Series System Overview vector processing A form of instruction-level parallelism in which the vector registers are used to perform iterative operations in parallel on the elements of an array, with each iteration producing up to 64 simultaneous results. See also vector. vectorization The process, performed by the compiler, of analyzing code to determine whether it contains vectorizable expressions and then producing object code that uses the vector unit to perform vector processing. X1WACS The Cray X1 series warning and control system, run by the x1wacs program, that enables the user to control power and monitor hardware status for one or more mainframe cabinets and one or more associated I/O cabinets. XFS file system The standard UNICOS/mp file system; a journaling, mature file system with support for very large individual files and file systems. 64 S–2346–25Index A Accelerated application, 45 node allocation, 45 Accelerated mode, 30 Access control lists (ACLs), 44 Accessing documentation, 1 Accounting, 50 tools, 44 Air-cooled model, 15 apkill, 34 Application accelerated mode, 30, 45 accounting, 50 application node, 22 executable modes, 28 flexible mode, 30 migration, 45 MSP mode, 28 node disabling an SSP or an MSP, 51 performance analysis, 34 placement, 30, 49 scheduling, 49 SSP mode, 28 team ID, 33 aprun command, 29, 46 apstat command, 34 APT type, 33 ApTeam, 33–34 accounting, 50 restarting, 33 Assembler, 19 Automatic disk failover, 51 B Backing up XFS file systems, 43 Batch system, 30 PBS Pro, 46 bfc command, 51 Bonded Fibre Channel, 51 C C-brick, 13 Cache domain, 9 CAL (Cray Assembly Language), 23 Centralized logging, 52 Checkpoint, 33, 43 CNS, 12, 20, 42 Co-array Fortran (CAF), 25 Command set, 20 Command mode, 29 Common Installation Tool (CIT), 47 Compilers Cray C/C++, 19 Cray Fortran, 19 Compute module, 5 Controller brick (C-brick), 13 COS, 20, 48 CPES, 12, 22, 41 monitoring activity, 34, 47 cpr, 33 crashmp command, 52 Cray Assembler, 23 Cray Assembly Language (CAL), 19, 23 Cray C/ C++ compilers, 26 Cray C/C++ compilers, 23 checkpointing your program, 33 debugging your code, 28 executing your code, 29 loading your code, 29 monitoring programs, 34 triggering, 22 Cray Fortran Compiler, 26 checkpointing your program, 33 debugging your code, 28 S–2346–25 65Cray X1™ Series System Overview executing your code, 29 loading your code, 29 monitoring programs, 34 standard, 23 triggering, 22 Cray Network Subsystem (CNS), 12, 20, 42, 48 Cray Open Software (COS), 48 Cray Programming Environment Server (CPES), 22 functions, 12, 41 monitoring activity, 34, 47 Cray Workstation (CWS), 14, 40 Cray X1 series support system private Ethernet subnetworks, 14 Cray X1 series system control facility, 14 CrayDoc documentation system, 1 CrayLibs, 19 CrayPat, 19, 34 CrayTools, 19 Cross compiler, 41 csh, 20 CWS, 14, 40 D Debugger TotalView, 20, 28 Degraded memory, 51 Development environment, 19 Disabling an SSP or an MSP on an application node, 51 Disk failover, 51 quotas, 43 storage, 13 Distributed memory applications accelerated execution, 45 message passing, 45 Distributed memory models, 25 Domain Name System (DNS) client, 44 Dump analysis, 52 dumpsys, 52 E /etc/passwd file, 44 /etc/shadow file, 44 F Failed MSP, 50 Failed SSP, 50 Fibre Channel Arbitrated Loop (FCAL) protocol, 12 File systems StorNext FS, 22 Filesystem Hierarchy Standard (FHS), 48 Flexible mode, 30 Fortran 2003 standard, 23 Fortran 95 standard, 23 Fortran co-arrays, 25 G GCC, 24 GNU Compiler Collection (GCC) language extensions, 24 Granular privilege mechanism, 44 H -h command, 29 -h ssp option, 28 Hardware configurations, 15 air-cooled model, 15 liquid-cooled model, 17 error reporting, 52 overview, 3 I I/O, 12 I/O cabinet (IOC), 12 I/O Channel Adapters (IOCAs), 12 I/O drawer (IOD), 12 I/O subsystem, 3, 11 Cray Network Subsystem (CNS), 12 Cray Programming Environment Server (CPES), 12 66 S–2346–25Index disk storage, 13 I/O drawer (IOD), 12 Instruction set, 9 IOCA, 12 IRIX, 19 ISA, 9 K Kernel parameters, 52 ksh, 20 L Large pages, 46 Launching applications, 29 ld command, 29 Libraries, 19, 24 Library routines, 24 C/ C++, 24 C++ Dinkum libraries, 24 Fortran, 24 Message Passing Interface (MPI), 25 scientific, 24 SHMEM, 25 UNICOS/mp, 24 LibSci, 19 Limits See User limits Liquid-cooled model, 17 Loader, 19, 29 Loading your code, 29 Log files, 52 Log messages, 52 Loopmark listing, 27 M MCM, 5–6 Memory allocation, 29 description, 9 limits, 33 Memory models, 25 Message Passing Interface (MPI), 25 distributed memory model, 46 Message Passing Toolkit (MPT), 20 Messages log file, 52 Migrating from a UNICOS or UNICOS/mk system, 19 Migration information, 1 module utility, 19, 26 Monitoring activity on CPES, 34, 47 Monitoring programs, 34 apkill command, 34 psview command, 34 remps command, 47 Motif, 47 mpadmin command, 51 MPI, 20 MPI distributed memory model, 25 mpirun command, 29, 46 MPMD, 29, 47 MSP, 6 MSP mode, 8, 28 single-streamed routines, 28 Multichip module, 5–6 Multiple program multiple data (MPMD), 29 Multiple Programs as part of single ApTeam, 47 Multistreaming, 27 Multistreaming processors (MSPs), 6 N Name service switch, 48 Network File System (NFS), 43 Network Information Service (NIS) client, 44 Network routing, 12, 42 Network Time Protocol (NTP), 44 New vector (NV-1) instruction set, 9 NFS, 22 Node allocation, 45 definition, 8 flavors, 22 nsswitch.conf file, 48 S–2346–25 67Cray X1™ Series System Overview O -O ssp option, 28 Online documentation, 1 OpenMP, 26 OpenSSH, 48 OpenSSL, 48 Optimizing applications, 19, 34 OS node, 22 P Page size, 32 Partitioning a system, 15, 51 PBS Pro, 20, 30 PC-20, 12, 15–18 PCI-X bus, 12 Performance analysis, 19, 34 Peripheral cabinet (PC-20), 12, 15–18 POSIX threads, 26 Private Ethernet subnetworks, 14 Programming Environment, 19 accessing, 26 cross-compiler, 41 module utility, 26 running on a different platform, 41 software residing on the CPES, 12, 41 trigger environment, 42 Programming models, 25 Psched, 30, 33, 49 psview command, 34 Q qstat command, 30 qsub command, 30 R RAID subsystem, 13, 40, 48 Remote Procedure Call (RPC) protocol, 44 remps(1) command, 47 Resource administration user limits, 50 Restart, 43 Restoring XFS file systems, 43 rlogin, 20 Router module, 11 Running applications, 29 S S-brick, 13 SAN, See Storage area network (SAN), 13 sar command, 44 SCCS, 44 Scheduling, 49 Security features, 44, 48 sh, 20 Shared memory (SHMEM) models, 25 OpenMP, 26 POSIX threads, 26 SHMEM, 20, 25 Single-streamed routines Cray Streaming Directives (CSDs), 28 Source Code Control System (SCCS), 44 SSP mode, 8, 28 Storage area network (SAN), 13, 47–48 Storage bricks (S-brick), 13 Storage management, 40 StorNext File System clients, 13, 48 description, 13 Fibre Channel fabric, 13 StorNext Storage Manager, 13 Support node, 22 Support system, 3, 13 CWS, 14 System activity monitoring, 44 activity reporting, 44 administration overview, 37 degraded memory, 51 diagnostics, 50 disabling an SSP or an MSP on an application node, 51 disk failover, 51 dump analysis, 52 failed MSP, 50 68 S–2346–25Index failed SSP, 50 log files, 52 maintenance utilities, 52 messages, 52 partitioning, 15, 51 resiliency, 50 tunable kernel parameters, 52 System Port Channel (SPC) channel protocol, 12 T tcsh, 20 telnet, 20 TotalView debugger, 20, 28 Transmission Control Protocol/Internet Protocol (TCP/IP), 44 Trigger commands, 19, 22, 42 Trigger environment, 42 Tunable kernel parameters, 52 U UNICOS/mp functionality, 42 UNICOS/mp library routines, 24 UNICOS/mp operating system, 19 Unified Parallel C (UPC), 25 User limits, 50 batch, 50 interactive, 50 User shells, 20 V Vectorization, 27 X -X npes command-line option, 29, 46 X Window System, 25 client, 47 XFS file system, 22, 43 backing up, 43 restoring, 43 XLV volume manager, 43 S–2346–25 69 Migrating Applications to the Cray X1™ Series Systems S–2378–54© 2003-2005 Cray Inc. All Rights Reserved. This manual or parts thereof may not be reproduced in any form unless permitted by contract or by written permission of Cray Inc. U.S. GOVERNMENT RESTRICTED RIGHTS NOTICE The Computer Software is delivered as "Commercial Computer Software" as defined in DFARS 48 CFR 252.227-7014. All Computer Software and Computer Software Documentation acquired by or for the U.S. Government is provided with Restricted Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7014, as applicable. Technical Data acquired by or for the U.S. Government, if any, is provided with Limited Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7013, as applicable. Autotasking, Cray, Cray Channels, Cray Y-MP, GigaRing, LibSci, MPP Apprentice, SuperCluster, UNICOS and UNICOS/mk are federally registered trademarks and Active Manager, CCI, CCMT, CF77, CF90, CFT, CFT2, CFT77, ConCurrent Maintenance Tools, COS, Cray Ada, Cray Animation Theater, Cray APP, Cray Apprentice 2 , Cray C++ Compiling System, Cray C90, Cray C90D, Cray CF90, Cray EL, Cray Fortran Compiler, Cray J90, Cray J90se, Cray J916, Cray J932, Cray MTA, Cray MTA-2, Cray MTX, Cray NQS, Cray Research, Cray SeaStar, Cray S-MP, Cray SSD-T90, Cray SuperCluster, Cray SV1, Cray SV1ex, Cray SX-5, Cray SX-6, Cray T3D, Cray T3D MC, Cray T3D MCA, Cray T3D SC, Cray T3E, Cray T90, Cray T916, Cray T932, Cray UNICOS, Cray X1, Cray X1E, Cray XT3, Cray XD1, Cray X-MP, Cray XMS, Cray Y-MP EL, Cray/REELlibrarian, Cray-1, Cray-2, Cray-3, CrayDoc, CrayLink, Cray-MP, CrayPacs, CraySoft, CrayTutor, CRI/TurboKiva, CRInform, CSIM, CVT, Delivering the power..., Dgauss, Docview, EMDS, HEXAR, HSX, IOS, ISP/Superlink, ND Series Network Disk Array, Network Queuing Environment, Network Queuing Tools, OLNET, RapidArray, RQS, SEGLDR, SMARTE, SSD, SUPERLINK, System Maintenance and Remote Testing Environment, Trusted UNICOS, TurboKiva, UNICOS MAX, UNICOS/lc, and UNICOS/mp are trademarks of Cray Inc. Etnus and TotalView are trademarks of Etnus LLC. IBM is a trademark of International Business Machines Corporation. UNIX, the "X device," X Window System, and X/Open are trademarks of The Open Group in the United States and other countries. All other trademarks are the property of their respective owners. The UNICOS, UNICOS/mk, and UNICOS/mp operating systems are derived from UNIX System V. These operating systems are also based in part on the Fourth Berkeley Software Distribution (BSD) under license from The Regents of the University of California.New Features Migrating Applications to the Cray X1™ Series Systems S–2378–54 Changes made to this manual document issues related to migrating applications from Cray SV1 series and Cray T3E systems to Cray X1 series systems: Interlanguage communications Added a note that interlanguage communications using C++ on Cray X1 series systems requires a C++ main routine. See Section 4.1, page 35.Record of Revision Version Description 4.3 March 31, 2003 Original Printing. Draft to support the Programming Environment 4.3, Message Passing Toolkit 2.1, and UNICOS/mp 2.1 releases. Migration information has been moved from the Cray X1 User Environment Differences to this manual. The Cray X1 User Environment Differences has been renamed Cray X1 User Environment Differences. 5.0 June 2003 Supports the Programming Environment 5.0 and Message Passing Toolkit 2.2 releases. 5.1 October 2003 Supports the Programming Environment 5.1 and Message Passing Toolkit 2.2 releases. 5.2 April 2004 Supports the Programming Environment 5.2 and Message Passing Toolkit 2.3 releases. 5.3 November 2004 Supports the Programming Environment 5.3 and Message Passing Toolkit 2.4 releases. 5.4 March 2005 Supports the Programming Environment 5.4 and Message Passing Toolkit 2.5 releases. S–2378–54 iContents Page Preface vii Accessing Product Documentation . . . . . . . . . . . . . . . . . . . vii Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . viii Reader Comments . . . . . . . . . . . . . . . . . . . . . . . . ix Introduction [1] 1 Related Manuals . . . . . . . . . . . . . . . . . . . . . . . . . 2 Migrating Fortran Applications [2] 3 Reviewing the Source Code . . . . . . . . . . . . . . . . . . . . . 3 Data Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Default Data Sizes . . . . . . . . . . . . . . . . . . . . . . 4 Explicit Data Sizes . . . . . . . . . . . . . . . . . . . . . . 5 Changes in the Libraries . . . . . . . . . . . . . . . . . . . . . 6 Library Routines and Intrinsic Procedures . . . . . . . . . . . . . . . 6 Fortran Interfaces Replaced by PXF Routines . . . . . . . . . . . . . . 6 Unsupported Cray Fortran Routines . . . . . . . . . . . . . . . . . 7 Cray IEEE Interface Routines . . . . . . . . . . . . . . . . . . . 7 Cray Math Routines . . . . . . . . . . . . . . . . . . . . . . 7 MPI and SHMEM Routines . . . . . . . . . . . . . . . . . . . . 7 Cray Scientific Library Routines . . . . . . . . . . . . . . . . . . 8 Unsupported Fortran Compiler Directives . . . . . . . . . . . . . . . . 9 Conditional Compilation Macros . . . . . . . . . . . . . . . . . . . 9 Subprogram Arguments . . . . . . . . . . . . . . . . . . . . . 9 POINTER Objects . . . . . . . . . . . . . . . . . . . . . . . 10 Input/Output . . . . . . . . . . . . . . . . . . . . . . . . . 11 Numeric Data Conversion . . . . . . . . . . . . . . . . . . . . . 11 S–2378–54 iiiMigrating Applications to the Cray X1™ Series Systems Page OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Tasking . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Vector Dependencies . . . . . . . . . . . . . . . . . . . . . . 13 Compiling the Code . . . . . . . . . . . . . . . . . . . . . . . 13 Linking the Code . . . . . . . . . . . . . . . . . . . . . . . . 15 Testing the Code . . . . . . . . . . . . . . . . . . . . . . . . . 17 Optimizing the Code . . . . . . . . . . . . . . . . . . . . . . . 19 Migrating C and C++ Applications [3] 21 Reviewing the Source Code . . . . . . . . . . . . . . . . . . . . . 21 Data Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . 21 pragma Directives . . . . . . . . . . . . . . . . . . . . . . . 22 Changes in the Libraries . . . . . . . . . . . . . . . . . . . . . 23 Library Routines and Intrinsic Procedures . . . . . . . . . . . . . . . 23 Unsupported Library Routines . . . . . . . . . . . . . . . . . . . 23 MPI and SHMEM Routines . . . . . . . . . . . . . . . . . . . . 24 Cray Scientific Library Routines . . . . . . . . . . . . . . . . . . 24 Bit-wise Operations . . . . . . . . . . . . . . . . . . . . . . . 25 Shift Operations . . . . . . . . . . . . . . . . . . . . . . . . 25 Example 1: Shift Operation (no data modifier) . . . . . . . . . . . . . 25 Example 2: Shift Operation (with data modifier) . . . . . . . . . . . . . 26 Wide Characters and Multiple Locales . . . . . . . . . . . . . . . . . 27 C++ Headers . . . . . . . . . . . . . . . . . . . . . . . . . 28 Instantiation Files . . . . . . . . . . . . . . . . . . . . . . . 28 AT&T C++ Compatibility Headers . . . . . . . . . . . . . . . . . . 28 Vector Dependencies . . . . . . . . . . . . . . . . . . . . . . 28 Optimizations for std::complex Class Causes Byte Alignment Changes . . . 29 Compiling the Code . . . . . . . . . . . . . . . . . . . . . . . 29 Linking the Code . . . . . . . . . . . . . . . . . . . . . . . . 30 Testing the Code . . . . . . . . . . . . . . . . . . . . . . . . . 31 Optimizing the Code . . . . . . . . . . . . . . . . . . . . . . . 33 iv S–2378–54Contents Page Interlanguage Communications [4] 35 Fortran to C or C++ Calls . . . . . . . . . . . . . . . . . . . . . . 35 Passing or Receiving char Objects . . . . . . . . . . . . . . . . . . 35 Example 3: Fortran Program Calling C Function . . . . . . . . . . . . . 36 Example 4: Called C Routine (Cray SV1 series or Cray T3E system) . . . . . . . 36 Example 5: Called C Routine (Converted for Cray X1 series system) . . . . . . . 37 Example 6: Function call with return type of CHARACTER . . . . . . . . . . 38 Passing Noncharacter Data Types . . . . . . . . . . . . . . . . . . . 39 Interfacing to System Calls . . . . . . . . . . . . . . . . . . . . . 39 C or C++ to Fortran Calls . . . . . . . . . . . . . . . . . . . . . . 40 Passing or Receiving String Arguments . . . . . . . . . . . . . . . . . 40 Example 7: Called Fortran Subroutine . . . . . . . . . . . . . . . . 40 Example 8: C Program Calling Fortran Routine (Cray SV1 series or Cray T3E system) . . 41 Example 9: C Program Calling Fortran Routine (Converted for Cray X1 series system) . 41 Passing or Receiving Noncharacter Arguments . . . . . . . . . . . . . . . 42 Fortran, C, and C++ Data Sizes . . . . . . . . . . . . . . . . . . . . 42 Glossary 45 Index 51 Tables Table 1. Fortran Default Data Sizes (in Bits) . . . . . . . . . . . . . . . . 4 Table 2. Fortran Data Sizes of Explicit KIND Types (in Bits) . . . . . . . . . . . 5 Table 3. Fortran Scaling Factor in Pointer Arithmetic . . . . . . . . . . . . . 11 Table 4. Implied Data Sizes for Basic C and C++ Data Types . . . . . . . . . . . 22 Table 5. Fortran, C, and C++ Data Objects with the Same Data Sizes . . . . . . . . . 43 S–2378–54 vPreface The information in this preface is common to Cray documentation provided with this software release. Accessing Product Documentation With each software release, Cray provides books and man pages, and in some cases, third-party documentation. These documents are provided in the following ways: • CrayDoc, the Cray documentation delivery system that allows you to quickly access and search Cray books, man pages, and in some cases, third-party documentation—Access this HTML and PDF documentation via CrayDoc at the following URLs: – The local network location defined by your system administrator – The CrayDoc public website: docs.cray.com • Man pages—Access man pages by entering the man command followed by the name of the man page. For more information about man pages, see the man(1) man page by entering: % man man • Third-party documentation not provided through CrayDoc—Access this documentation, if any, according to the information provided with that product. S–2378–54 viiMigrating Applications to the Cray X1™ Series Systems Conventions These conventions are used throughout Cray documentation: Convention Meaning command This fixed-space font denotes literal items, such as file names, pathnames, man page names, command names, and programming language elements. variable Italic typeface indicates an element that you will replace with a specific value. For instance, you may replace filename with the name datafile in your program. It also denotes a word or concept being defined. user input This bold, fixed-space font denotes literal items that the user enters in interactive sessions. Output is shown in nonbold, fixed-space font. [ ] Brackets enclose optional portions of a syntax representation for a command, library routine, system call, and so on. ... Ellipses indicate that a preceding element can be repeated. name(N) Denotes man pages that provide system and programming reference information. Each man page is referred to by its name followed by a section number in parentheses. Enter: % man man to see the meaning of each section number for your particular system. viii S–2378–54Preface Reader Comments Contact us with any comments that will help us to improve the accuracy and usability of this document. Be sure to include the title and number of the document with your comments. We value your comments and will respond to them promptly. Contact us in any of the following ways: E-mail: docs@cray.com Telephone (inside U.S., Canada): 1–800–950–2729 (Cray Customer Support Center) Telephone (outside U.S., Canada): +1–715–726–4993 (Cray Customer Support Center) Mail: Software Publications Cray Inc. 1340 Mendota Heights Road Mendota Heights, MN 55120–1128 USA S–2378–54 ixIntroduction [1] The audience for this document is the programmer who is migrating applications from a Cray SV1 series or Cray T3E system to a Cray X1 series system. This document describes the processes for migrating Fortran, C, and C++ applications and for resolving interlanguage communications differences. You should be familiar with the Programming Environment on Cray X1 series systems before migrating applications. In addition, you should understand the basic features of Cray X1 series systems, which are described in the Cray X1 Series System Overview. The Programming Environment on a Cray X1 series system is very similar to that on Cray SV1 series and Cray T3E systems. That is, most Cray Fortran, Cray C, and Cray C++ compiler and language features are supported by Cray X1 series systems and perform the same functions as on Cray SV1 series and Cray T3E systems. Therefore, the major portion of the programs and scripts used to compile the programs that were developed for systems running the Cray Programming Environment 3.6 release may not require any changes to compile and run successfully on a Cray X1 series system. However, a Cray X1 series system does have hardware and software enhancements for which you may need to modify your programs or scripts. For example, some compiler defaults are different, but you can change these settings to achieve the compiler behavior that you expect. S–2378–54 1Migrating Applications to the Cray X1™ Series Systems 1.1 Related Manuals The following documents contain information that may be helpful: • Cray X1 Series System Overview • Cray Fortran Compiler Commands and Directives Reference Manual • Fortran Language Reference Manual, Volume 1 • Fortran Language Reference Manual, Volume 2 • Fortran Language Reference Manual, Volume 3 • Cray C and C++ Reference Manual • Optimizing Applications on Cray X1 Series Systems • TotalView Release Overview, Installation Guide, and User’s Guide Addendum for Cray X1 Systems 2 S–2378–54Migrating Fortran Applications [2] Cray X1 series systems support most Cray SV1 series and Cray T3E systems Fortran compiler and language features. Therefore, many of the programs developed for the Cray Programming Environment 3.6 release will compile and run successfully on a Cray X1 series system without changes. However, differences in hardware architecture and Programming Environment products may cause problems when you attempt to migrate your applications to a Cray X1 series system. This chapter describes the process for migrating Fortran applications written for Cray SV1 series or Cray T3E systems to Cray X1 series systems. Note: See Cray X1 User Environment Differences for the list of differences. The following sections describe the process for migrating Fortran applications: • Reviewing the Source Code (Section 2.1, page 3) • Compiling the Code (Section 2.2, page 13) • Linking the Code (Section 2.3, page 15) • Testing the Code (Section 2.4, page 17) • Optimizing the Code (Section 2.5, page 19) 2.1 Reviewing the Source Code Before you attempt to compile programs, examine the source code for problems. You may need to modify the code for the application to execute properly on a Cray X1 series system. 2.1.1 Data Sizes Many migration problems stem from data size mismatches. Ensure that Cray SV1 series or Cray T3E system data types have the desired size on a Cray X1 series system. Make sure you do not convert objects with larger data sizes on Cray SV1 series or Cray T3E systems to objects with smaller data sizes on a Cray X1 series system. S–2378–54 3Migrating Applications to the Cray X1™ Series Systems 2.1.1.1 Default Data Sizes For numerical objects with default data sizes, use the -s default64 or -s default32 compiler command option. The compiler default is -s default32. See Table 1 for a comparison of Cray SV1 series, Cray T3E systems, and Cray X1 series system data sizes. Table 1. Fortran Default Data Sizes (in Bits) Intrinsic Type Cray SV1 series system Cray T3E system Cray X1 series system (s default32) Cray X1 series system (-s default64) CHARACTER 8 8 8 8 INTEGER 64 64 32 64 LOGICAL 64 64 32 64 REAL 64 64 32 64 DOUBLE PRECISION 128 64 64 128 COMPLEX 128 128 64 128 COMPLEX (double precision) 265 128 128 256 If you want to specify the default data size of just INTEGERs or REALs, you can use the following compiler command options instead: • -s integer32 • -s integer64 • -s real32 • -s real64 Note: If you mix different size variables, be careful how you handle storage associated with those variables. See the -s size option in Cray Fortran Compiler Commands and Directives Reference Manual for details. 4 S–2378–54Migrating Fortran Applications [2] 2.1.1.2 Explicit Data Sizes For numerical data objects with explicit data sizes, Cray X1 series systems support the same data sizes as Cray T3E systems, so few or no changes are required for Cray T3E Fortran programs that use KIND or star values. A Cray X1 series system can also support the same data sizes that are used on a Cray SV1 series system for numerical objects that use explicit KIND or star values. If Cray SV1 series and Cray X1 series data sizes are the same, no changes to KIND or star values are needed. In other cases, you must select a different KIND or star value to ensure the data size is at least as large as the corresponding object on a Cray SV1 series system. Table 2 indicates the explicit KIND or star values you should select on a Cray X1 series system to maintain the same data size that the program expects. Note: The table also shows the corresponding data size when you use the -e h compiler option. Cray discourages enabling this option because it causes poorer performance than when the option is disabled. See Cray Fortran Compiler Commands and Directives Reference Manual for further information. The -s size options (such as -s default64) do not affect the size of data objects with explicit KIND or star values. Table 2. Fortran Data Sizes of Explicit KIND Types (in Bits) Type (KIND=byte_count) Cray SV1 series system Cray T3E system Cray X1 series system (default) Cray X1 series system (-e h) INTEGER(KIND=1) 64 32 32 8 INTEGER(KIND=2) 64 32 32 16 INTEGER(KIND=4) 64 32 32 32 INTEGER(KIND=8) 64 64 64 64 LOGICAL(KIND=1) 64 32 32 8 LOGICAL(KIND=2) 64 32 32 16 LOGICAL(KIND=4) 64 32 32 32 LOGICAL(KIND=8) 64 64 64 64 REAL(KIND=4) 64 32 32 32 REAL(KIND=8) 64 64 64 64 REAL(KIND=16) 128 64 128 128 S–2378–54 5Migrating Applications to the Cray X1™ Series Systems Type (KIND=byte_count) Cray SV1 series system Cray T3E system Cray X1 series system (default) Cray X1 series system (-e h) COMPLEX(KIND=4) 128 64 64 64 COMPLEX(KIND=8) 128 128 128 128 COMPLEX(KIND=16) 256 128 256 256 2.1.2 Changes in the Libraries Check the code for invalid library calls. Potential problems are calls to: • Library routines and intrinsic procedures (Section 2.1.2.1, page 6) • Fortran interfaces replaced by PXF routines (Section 2.1.2.2, page 6) • Unsupported Cray Fortran routines (Section 2.1.2.3, page 7) • Cray IEEE interface routines (Section 2.1.2.4, page 7) • Cray math routines (Section 2.1.2.5, page 7) • MPI and SHMEM routines (Section 2.1.2.6, page 7) • Cray scientific library routines (Section 2.1.2.7, page 8) 2.1.2.1 Library Routines and Intrinsic Procedures The Fortran library supports the standard Fortran language and some extensions. On Cray X1 series systems, the data type of the return result for the clock intrinsic function is CHARACTER*8 instead of BOOLEAN as on Cray SV1 series and Cray T3E systems. 2.1.2.2 Fortran Interfaces Replaced by PXF Routines Check the code for system calls. Many interfaces to system calls (such as CHDIR) have a counterpart PXF POSIX library routine (such as PXFCHDIR). Code using PXF routines is more portable because PXF routines are supported on many platforms by many vendors. PXF routines provide the Fortran interfaces to the system calls and return the status from a system call through a status argument. Cray recommends that you change the code to use the PXF routines. 6 S–2378–54Migrating Fortran Applications [2] For a list of the PXF POSIX library routines, see the intro_pxf(3f) man page. For a list of PXF equivalents for the system interface routines that are not supported on Cray X1 series systems, see Cray X1 User Environment Differences. If your code contains function calls that are included in the POSIX threads specification, you should not have to modify your code. 2.1.2.3 Unsupported Cray Fortran Routines Some Cray Fortran routines are not supported on Cray X1 series systems: • Fortran interfaces to system calls • Miscellaneous routines • I/O extensions • Resource management extensions • Program analysis routines For details, see Cray X1 User Environment Differences. 2.1.2.4 Cray IEEE Interface Routines The implementation of most Cray Fortran IEEE interface routines will change or be replaced in a future Programming Environment release for Cray X1 series systems to conform to the interface defined by Fortran 2003. See Cray Programming Environment Releases Overview and Installation Guide or the appropriate man pages for specific information. 2.1.2.5 Cray Math Routines Cray SV1 series systems have a different floating-point format than the IEEE format provided on Cray X1 series systems. The 32-bit and 128-bit IEEE floating-point formats are supported on Cray X1 series systems. 2.1.2.6 MPI and SHMEM Routines Check calls to Message Passing Interface (MPI) and shared memory access (SHMEM) library routines to ensure that the loader will link the correct library routines. The default libraries have 32-bit integer and real data types. The 64-bit libraries have 64-bit integer and real data types. They most closely match the MPI and SHMEM libraries on Cray SV1 series and Cray T3E systems. S–2378–54 7Migrating Applications to the Cray X1™ Series Systems If the code you are migrating uses 64-bit data types on Cray SV1 series or Cray T3E systems, use the -s default64 compiler command option during the linking and loading phases. See Section 2.3, page 15 for details about the loading and linking process. If the code you are migrating used 32-bit data types on Cray SV1 series or Cray T3E systems, you do not need to include the -s size option because on Cray X1 series systems the default is -s default32. 2.1.2.7 Cray Scientific Library Routines Check the code for calls to scientific library routines. If the code contains calls to such routines, note the following differences between Cray SV1 series or Cray T3E systems versus Cray X1 series systems: • On Cray X1 series systems, LibSci does not contain exactly the same routines as on Cray SV1 series and Cray T3E systems. Support for certain routines has been dropped and double precision names have been added. See Cray X1 User Environment Differences for details. • On Cray X1 series systems, the default data sizes are 32 bits, and the default LibSci supports these sizes. The default LibSci library contains both single and double precision routines. Integer and floating-point variables occupy 32 bits and double precision variables 64 bits. For more details on data sizes and single and double precision routines, see the intro_libsci(3s) man page and Cray X1 User Environment Differences. • On Cray SV1 series and Cray T3E systems, LibSci included only single precision routines. These routines used 64-bit floating point and integer operations. On a Cray X1 series system, to continue to use LibSci in the same manner as before on Cray SV1 series and Cray T3E systems, compile and link the code with the -s default64 compiler option. This will link the code with the 64-bit LibSci instead of the default LibSci. See Section 2.3, page 15 for details about the loading and linking process. • Calls to FFT and FFT-based convolution LibSci routines may require special attention as the table and work array types and lengths differ on Cray X1 series systems from Cray SV1 series and Cray T3E systems. See the intro_fft(3s) man page and individual FFT man pages for details. 8 S–2378–54Migrating Fortran Applications [2] 2.1.3 Unsupported Fortran Compiler Directives Cray X1 series systems support most directives that are supported on Cray SV1 series and Cray T3E systems. Some directives are not supported because either the tool is not supported on a Cray X1 series system or similar hardware does not exist. See Cray X1 User Environment Differences for details. 2.1.4 Conditional Compilation Macros Cray X1 series systems support the _CRAY macro used on Cray SV1 series and Cray T3E systems. Code developed on Cray SV1 series or Cray T3E systems may require attention if it calls code developed by another vendor that uses this macro. If you have code that uses a particular hardware or software feature that is not supported on Cray X1 series systems, replace the _CRAY macro with a combination of macros that indicate Cray SV1 series or Cray T3E systems: _CRAY1 and CRAYMPP. See Cray X1 User Environment Differences for details. 2.1.5 Subprogram Arguments Check the data type information of subprogram arguments and choose the proper library entry points when generic interfaces are present. Either add the USE FTN_LIB_DEFINITIONS statement at the beginning of each Fortran program unit or use the -A FTN_LIB_DEFINITIONS option on the compiler command line. The Fortran module FTN_LIB_DEFINITIONS provides a set of interface blocks. For example, if you call the flush routine with only one argument (or with two 64-bit arguments), the Fortran compiler uses the generic interface block in FTN_LIB_DEFINITIONS to pass correct information to the proper library routine. Otherwise, the library routine will abort with a memory fault for the missing argument or return the error not connected if the stat argument is present. Library routines such as second and timef cannot have a generic interface since they do not have any arguments, but the compiler can diagnose incorrect typing of these functions if they are not declared EXTERNAL procedures. Verify that the KIND values of the arguments to a function or subroutine are correct. Using the -s default64 compiler option changes the size of any default integer or logical data type from 32 bits to 64 bits, the size of real data types from 32 bits to 64 bits, and the size of double precision data types S–2378–54 9Migrating Applications to the Cray X1™ Series Systems from 64 bits to 128 bits. A nonintrinsic library routine will still expect the default sizes unless it specifically calls for the larger size. Consider the following examples: • Return Elapsed CPU Time secnd=second() call second(secnd) The second routine is REAL(KIND=4). Therefore, if the data size is 32 bits, make sure that secnd is either REAL or REAL(KIND=4). If the data size is 64 bits, make sure that secnd is REAL(KIND=8). • Return Elapsed Wall Clock Time timf=timef() call timef(timf) The timef routine is REAL(KIND=8). Therefore, if the data size is 32 bits, make sure that timf is REAL(KIND=8). If the data size is 64 bits, make sure that timf is either REAL or REAL(KIND=8). • Return a Command Line Argument call pxfgetarg(m,buf,ilen,ierror) The m, buf, ilen, and ierror arguments are INTEGER(KIND=4). Therefore, if the data size is 32 bits, make sure that these values are either INTEGER or INTEGER(KIND=4). If the data size is 64 bits, make sure that these values are INTEGER(KIND=8). 2.1.6 POINTER Objects Check the code for POINTER objects. On Cray SV1 series and Cray T3E systems, pointer arithmetic involving integers causes Cray pointers to be moved on word boundaries. The default for Cray X1 series systems is for Cray pointers to be moved on byte boundaries. You can have pointer movements occur on word boundaries on a Cray X1 series system by using compiler options to apply the proper scaling factor to integers used in pointer arithmetic. For example, on a Cray X1 series system the compiler views this statement: Cray_ptr = Cray_ptr + integer_value 10 S–2378–54Migrating Fortran Applications [2] as Cray_ptr = Cray_ptr + (integer_value * scaling_factor) You can specify the scaling factor by using the -s byte_pointer or -s word_pointer compiler option in conjunction with the -s default32 or -s default64 option, as Table 3 shows. Table 3. Fortran Scaling Factor in Pointer Arithmetic Scaling Option Default Integer Size Scaling Factor Pointer Movement Occurs at -s byte_pointer 32 or 64 bits 1 Byte boundaries (default) -s word_pointer and -s default32 enabled 32 bits 4 32-bit word boundaries -s word_pointer and -s default64 enabled 64 bits 8 64-bit word boundaries For details on these compiler options, see the ftn(1) man page and the Cray Fortran Compiler Commands and Directives Reference Manual. 2.1.7 Input/Output Check the code for the use of cos (or blocked) files. On Cray X1 series systems, the default file format for Fortran sequential unformatted I/O is f77. On Cray SV1 series systems and Cray T3E systems, COS blocking was the default for Fortran sequential unformatted I/O. To read or write cos (or blocked) files on Cray X1 series systems, use the assign command with the –F cos option. See the assign(1) man page for details. For example, to compile, load, and execute myprog.ftn, enter: % ftn -c myprog.ftn % ftn -o myprog myprog.o % assign -F cos . . . % aprun ./myprog 2.1.8 Numeric Data Conversion Check the code for the use of numeric data conversion to store data externally (such as in a file). You can use conversion routines that are provided by Cray to convert data from Cray SV1 series Fortran (non-IEEE data types), IBM, and VAX S–2378–54 11Migrating Applications to the Cray X1™ Series Systems Fortran data types to generic IEEE data types. You can convert data implicitly by using the Cray X1 series system assign –N option or explicitly with calls to subroutines. See the assign(1) man page for details. For example, the CRI2IEG routine converts Cray IEEE (64-bit) data types to generic IEEE data types. Generic IEEE data types are the default for Cray X1 series systems. To convert data, move the data to a Cray X1 series system and then use the conversion routines. For more information about moving nontape data from one Cray system to another, see the intro_conversion(3f) man page. Since storing data on tape is infrequent, flexible-file I/O (FFIO) on a Cray X1 series system does not support the bmx (tape) layer. Therefore, move data on tape to a storage medium that a Cray X1 series system can access, such as disk, before you convert the data. Data objects stored externally by means other than FFIO have the same data size as corresponding objects in memory. Therefore, you should use the same consideration for migrating these externally stored data objects as you would in migrating data objects that use default data sizes or explicit sizes. See Cray X1 User Environment Differences for details. For example, if an integer of default size on a Cray T3E system (64 bits) was stored externally not using FFIO, its size in the file is 64 bits. If you use the default integer size on a Cray X1 series system (32 bits) and you read that same integer value from the file and assign it to a variable, you will get only the most significant 32 bits. In this case, you would need to use KIND=8 or use the -s default64 compiler option to change the default data size for integers to 64 bits before reading the data from the file. 2.1.9 OpenMP OpenMP on Cray X1 series systems differs from OpenMP on Cray SV1 series systems in the following respects: • Work-sharing constructs are supported on Cray X1 series systems. A work-sharing construct divides the execution of the enclosed code region among the members of the team that encounter it. A work-sharing construct must be enclosed within a parallel region in order for the directive to execute in parallel. See the Cray Fortran Compiler Commands and Directives Reference Manual for details. • Cray X1 series systems allow you to disable dynamic thread adjustment. The omp_set_dynamic procedure enables or disables the dynamic adjustment of 12 S–2378–54Migrating Fortran Applications [2] the number of threads available for execution of subsequent parallel regions. On Cray SV1 series systems, the dynamic threads adjustment was always enabled and could not be turned off. See the omp_threads man page for details. 2.1.10 Tasking Cray X1 series systems do not support Autotasking or microtasking. If you are migrating applications from Cray SV1 series systems that use Autotasking compiler command options or microtasking directives, consider replacing them with OpenMP directives. See the Cray Fortran Compiler Commands and Directives Reference Manual and Cray X1 User Environment Differences for details. 2.1.11 Vector Dependencies Check the code for the use of IVDEP (Ignore Vector Dependencies) directives: !DIR$ IVDEP [SAFEVL=vlen | INFINITEVL] The default safe vector length for the IVDEP directive has changed from the hardware vector length to infinity on a Cray X1 series system. If a loop cannot safely execute with the unbounded vector length, use the SAFEVL clause to set vector length vlen to a safe value or use the -O noinfinitevl compiler option. The -O noinfinitevl option applies to the entire file being compiled. You may want to consider changing IVDEP directives to CONCURRENT directives, which enable multistreaming and may improve the performance of vector code. Note that the IVDEP directive is supported by several vendors, although implementation may vary, but the CONCURRENT directive is specific to Cray. 2.2 Compiling the Code Once the source code appears to be clean, try to compile it. Although most Cray SV1 series and Cray T3E system compiler options are valid on Cray X1 series systems, there are some differences. See Cray X1 User Environment Differences for details. 1. Review the makefile and make changes as necessary. Change all compiler command references from f90 to ftn. 2. The double precision data type option is disabled by default on Cray T3E systems and enabled by default on Cray X1 series systems. If you want to disable the double precision data type option, add -dp to the compiler S–2378–54 13Migrating Applications to the Cray X1™ Series Systems command line. The -dp option can be used only in conjunction with the -s default64 or -s real64 option. 3. If your code uses Cray pointer arithmetic, add the -s word_pointer option to the compiler command line. This option ensures that the program will have the same pointer arithmetic behavior on Cray X1 series systems as on Cray SV1 series or Cray T3E systems. 4. The -J dir_name option now causes automatic search for modules. On Cray X1 series systems, the compiler automatically searches for modules in directories specified by the -J dir_name option of the current compilation to satisfy USE statements. That is, you do not need to explicitly specify the same directories in the -p module_site option to have the compiler search these directories. The compiler will automatically specify a -p option with the dir_name path and place it at the end of the command line. 5. Inlining is turned on by default. To speed up compiling, use the -O inline0 option to turn off inlining during initial testing. 6. Use the -c option to instruct the compiler to compile the program without linking. This option is useful when doing initial debugging but should be omitted once the code compiles without error. 7. To use TotalView in debugging your code, use the -g option. 8. Compile the source programs to create relocatable object files using the following guidelines: • If numerical objects have default data sizes, use a compiler command in this format: % ftn -O inline0 -c -g -s size file.ftn • If numerical objects have explicit data sizes, change KIND or star values as required and use a compiler command in this format: % ftn -O inline0 -c -g file.ftn These commands create a relocatable object file file.o, do not inline code, and do not invoke the loader directly. 9. If compilation produces error messages, use the explain command to further identify and locate the problem in the source code. For example, to get more information about an error that has a group code of ftn (Fortran compiler) and a message number of mmm, enter this command: explain ftn-mmm 14 S–2378–54Migrating Fortran Applications [2] For example, the command: % explain ftn-101 produces this message: Error : The length of the kind parameter exceeds the maximum length of 31. The length of the kind parameter exceeds 31 characters. Shorten the length of the kind parameter. 10. Correct the source code and recompile as needed. For more information about the ftn command line options and defaults, see the Cray Fortran Compiler Commands and Directives Reference Manual and the ftn(1) man page. 2.3 Linking the Code After you have successfully compiled the program segments, you are ready to begin linking them. Cray encourages you to use the ftn command, not ld, to invoke the loader because the compiler calls the loader with the appropriate default libraries. 1. Use the ftn compiler command to load and link programs and library routines. For example, the following command: % ftn -o progabc proga.o progb.o progc.o combines object files proga.o, progb.o, and progc.o; loads and links required library routines; and creates the executable file progabc. 2. If the code includes calls to MPI or SHMEM library routines, make sure you load and link the appropriate routines. If you adhere to the following guidelines, you will not have to change library routine names. • There are two versions of MPI or SHMEM routines. The 64-bit versions have 64-bit integer and real data types. They most closely match the MPI and SHMEM routines on Cray SV1 series and Cray T3E systems. If the code you are migrating called the 64-bit library routines on Cray SV1 series or Cray T3E systems, use the 64-bit library on a Cray X1 series system by including the -s default64 option on the compiler command. S–2378–54 15Migrating Applications to the Cray X1™ Series Systems For example, the following command: % ftn -s default64 -o trio x1.o x2.o x3.o creates the executable file trio. The compiler passes to the loader the path to the 64-bit library, ensuring that the correct routines are loaded. • The default version on a Cray X1 series system is the 32-bit version of MPI or SHMEM routines. Those routines have 32-bit integer and real data types and 64-bit double precision data types. They include single and double precision routines. For more about the MPI and SHMEM library routines, start with the intro_mpi(3) and intro_shmem(3) man pages. 3. If the code includes calls to Cray scientific library (LibSci) routines, make sure you load and link the appropriate routines. If you adhere to the following guidelines, you will not have to change library routine names: • The 64-bit versions of the scientific libraries have 64-bit real and integer data types. They most closely match the scientific libraries on Cray SV1 series and Cray T3E systems. If the code you are migrating called the 64-bit library routines on Cray SV1 series or Cray T3E systems, use the 64-bit library on a Cray X1 series system by including the -s default64 option in the compiler command. For example, the following command: % ftn -s default64 -o abc a.o b.o c.o creates the executable file abc. • The default version on a Cray X1 series system is the 32-bit version of scientific library routines. They have 32-bit integer and real data types and 64-bit double precision data types. They include single and double precision routines. For more about the scientific library routines, start with the intro_libsci(3s) man page. 4. Analyze error messages. Compilation error messages are displayed on the terminal screen and sent to the standard error file, stderr. You can also use the -r compiler command option to send error messages to the listing file source_file_name.lst. 16 S–2378–54Migrating Fortran Applications [2] If warnings or messages similar to the following examples display, they mean that the code contains calls to routines that the compiler could not resolve: ld-400 ld: WARNING Unresolved text symbol ... ld-401 ld: ERROR Unresolved text symbol(s). Loading terminated. Unresolved external errors generally indicate that the code is making calls to user routines that did not compile successfully, intrinsics that are changed or not supported on Cray X1 series systems, or library routines that are not supported or whose entry points have been changed. Use the explain command with the message ID from the error message to further identify and locate the problem in the source code. 5. If you want to verify that the code is linking to the correct library, use the -v compiler option. For example, if the data size is 64 bits, the path to the MPI library should appear in the compiler messages in the following format: /opt/ctl/mpt/mpt/lib/default64 If the data size is 32 bits, the path to the 32-bit MPI library should appear in the compiler messages in the following format: /opt/ctl/mpt/mpt/lib 2.4 Testing the Code After the program compiles and links without errors, begin running, testing, and debugging it, using the following steps as a guide. 1. Run a program on one multistreaming processor (MSP). Use a small data set so the program will be easier to debug. Use auto aprun or the aprun command. For example, to run executable file myprog, enter: % ./myprog or: % aprun ./myprog S–2378–54 17Migrating Applications to the Cray X1™ Series Systems Calling the aprun command automatically occurs when only the name of the application program and, where applicable, associated program options are entered on the command line; this will cause the system to automatically call aprun to run the program. Auto aprun is not used to launch programs compiled as commands (-O command option). Such commands run serially on a single-streaming processor (SSP) within a support node; they execute immediately without assistance from aprun or psched. On Cray X1 series systems, the CRAY_AUTO_APRUN_OPTIONS environment variable specifies options for the aprun command when the command is called automatically. For details about the CRAY_AUTO_APRUN_OPTIONS environment variable, see the Cray Fortran Compiler Commands and Directives Reference Manual. Note: Cray X1 series systems use the aprun (or mpirun) command for running distributed memory applications. The mpirun command is used only for MPI applications. The mpprun command is specific to Cray T3E systems and not supported on Cray X1 series systems. 2. Verify stack and heap sizes. On Cray X1 series systems, the default stack and heap sizes for applications executed with aprun or mpirun is 1 GB, and the default size for all other stacks and heaps is 32 MB. Programs with larger stack requirements may run on Cray SV1 series systems, but if these defaults are exceeded on Cray X1 series systems, the program will abort unless the following environment variables are set to allow for a larger stack: • X1_COMMON_STACK_SIZE • X1_PRIVATE_STACK_SIZE • X1_STACK_SIZE • X1_COMMON_HEAP_SIZE • X1_PRIVATE_HEAP_SIZE • X1_HEAP_SIZE • X1_PRIVATE_STACK_GAP See the Cray Fortran Compiler Commands and Directives Reference Manual and memory(7) man page for details. 3. If the program aborts or terminates in an abnormal state, you can use the debugger to identify the cause. For information about the debugger, see the TotalView Release Overview, Installation Guide, and User’s Guide Addendum for Cray X1 Systems. Correct the source code, recompile, and rerun as needed. 18 S–2378–54Migrating Fortran Applications [2] 4. Examine the output and verify that it matches the expected results. If possible, compare the output to the output from the original program, using the same test data. Note: When comparing results, minor numerical differences may be present because of floating-point format differences. 5. Once the program produces accurate results while running on multiple processors, begin testing the program with larger test cases. Correct, recompile, and retest as needed, until the program produces verifiably correct results with large test cases. 2.5 Optimizing the Code Once the program is running correctly on a single MSP with large test cases, begin tuning it for maximum performance. For more information about optimization, see Optimizing Applications on Cray X1 Series Systems. S–2378–54 19Migrating Applications to the Cray X1™ Series Systems 20 S–2378–54Migrating C and C++ Applications [3] Cray X1 series systems support most Cray SV1 series and Cray T3E systems C and C++ compiler and language features. Therefore, many of the programs developed for the Cray Programming Environment 3.6 release will compile and run successfully on a Cray X1 series system without changes. However, differences in hardware architecture and Programming Environment products may cause problems when you attempt to migrate your applications to a Cray X1 series system. This chapter describes the process for migrating C and C++ applications written for Cray SV1 series or Cray T3E systems to Cray X1 series systems. Note: See Cray X1 User Environment Differences for the list of differences. The following sections describe the process for migrating applications: • Reviewing the Source Code (Section 3.1, page 21) • Compiling the Code (Section 3.2, page 29) • Linking the Code (Section 3.3, page 30) • Testing the Code (Section 3.4, page 31) • Optimizing the Code (Section 3.5, page 33) 3.1 Reviewing the Source Code Before attempting to compile your programs, examine the source code for problems. You may need to modify the code for the application to execute properly on a Cray X1 series system. 3.1.1 Data Sizes Many migration problems stem from data size mismatches. Check the code for data types that are smaller on a Cray X1 series system and replace them with the appropriate data type. Basic data types of the C and C++ languages have implied data sizes (Table 4). Most of the basic data types on a Cray X1 series system have the same or smaller data sizes than on a Cray SV1 series or Cray T3E system. S–2378–54 21Migrating Applications to the Cray X1™ Series Systems Note: Cray discourages the use of 8-bit chars and 16-bit shorts in contexts other than those described in Cray C and C++ Reference Manual because of performance penalties. Table 4. Implied Data Sizes for Basic C and C++ Data Types Basic Data Type Cray SV1 Series System Cray T3E System Cray X1 Series System bool 1 8 8 8 _Bool 2 8 8 8 char 8 8 8 wchar_t 64 64 32 short 32 32 16 int 64 64 32 long 64 64 64 long long3 64 64 64 float 64 32 32 double 64 64 64 long double 128 64 128 float complex 128 (each part is 64 bits) 64 (each part is 32 bits) 64 (each part is 32 bits) double complex 128 (each part 64 bits) 128 (each part is 64 bits) 128 (each part is 64 bits) long double complex 256 (each part 128 bits) 128 (each part is 64 bits) 256 (each part is 128 bits) void and char pointers 64 64 64 Other pointers 32 4 64 64 3.1.2 pragma Directives The #pragma _CRI cache_bypass directive supported on Cray T3E systems is not supported on Cray X1 series systems 1 Cray C++ only 2 Cray C c99 3 Available in extended mode only. 4 Uses 64-bit storage. 22 S–2378–54Migrating C and C++ Applications [3] 3.1.3 Changes in the Libraries Check the code for invalid library calls. Potential problems are calls to: • Library routines and intrinsic procedures (Section 3.1.3.1, page 23) • Unsupported library routines (Section 3.1.3.2, page 23) • MPI and SHMEM routines (Section 3.1.3.3, page 24) • Cray scientific library routines (Section 3.1.3.4, page 24) 3.1.3.1 Library Routines and Intrinsic Procedures The scalb function differs on Cray X1 series systems from Cray T3E or Cray SV1 series systems. The second argument has changed from int to float. The Cray SV1 series _cmr intrinsic is not supported on Cray X1 series systems. The leadz, popcnt, and poppar functions return 64-bit longs on Cray X1 series systems. C and C++ programmers can cause the leadz, popcnt, and poppar functions to return a 32-bit int instead of 64-bit long in one of the following ways: • Define the macro _CRAY_INT32_BIT_INTRINSICS before including • Declare the intrinsic manually with int as the return type For details, see the leadz(3i), popcnt(3i), and poppar(3i) man pages. Details about C and C++ library routines are documented in the Dinkum C++ Library Documentation. 3.1.3.2 Unsupported Library Routines Check for calls to unsupported library routines. Some library calls valid on Cray SV1 series or Cray T3E systems are not supported on Cray X1 series systems. See Cray X1 User Environment Differences for a list of the UNICOS (Cray SV1 series system) and UNICOS/mk (Cray T3E system) system calls that are not supported on Cray X1 series systems and the equivalent functionality within UNICOS/mp system, if it exists. S–2378–54 23Migrating Applications to the Cray X1™ Series Systems 3.1.3.3 MPI and SHMEM Routines Check calls to Message Passing Interface (MPI) and shared memory access (SHMEM) library routines to ensure that the loader will link the correct library routines. The Cray X1 series default libraries have 32-bit int and float data types. The 64-bit libraries have 64-bit long and double data types; they most closely match the MPI and SHMEM libraries on Cray SV1 series and Cray T3E systems. 3.1.3.4 Cray Scientific Library Routines Check the code for calls to scientific library routines. If the code contains calls to these routines, note the following differences between Cray SV1 series or Cray T3E systems versus Cray X1 series systems: • On Cray X1 series systems, LibSci does not contain exactly the same routines as on Cray SV1 series and Cray T3E systems. Support for certain routines has been dropped and double precision names have been added. See Cray X1 User Environment Differences for details. • On Cray X1 series systems, the default data sizes are 32 bits, and the default LibSci supports these sizes. The default LibSci library contains both single and double precision routines. The int and float data types each occupies 32 bits and the double type occupies 64 bits. For more details on data sizes and single and double precision routines, see the intro_libsci(3s) man page and Cray X1 User Environment Differences. • On Cray SV1 series and Cray T3E systems, LibSci included only single precision routines. These routines used 64-bit floating point and integer operations. On Cray X1 series systems, to continue to use LibSci in the same manner as before on Cray SV1 series and Cray T3E systems, use the -lsci64 compiler option to link the code with the 64-bit LibSci instead of the default LibSci. See Section 3.3, page 30 for details about the loading and linking process. • Calls to FFT and FFT-based convolution LibSci routines may require special attention as the table and work array types and lengths differ on Cray X1 series systems. Refer to the intro_fft(3s) man page and individual FFT man pages for details. 24 S–2378–54Migrating C and C++ Applications [3] 3.1.4 Bit-wise Operations The rules governing bit-wise operations in C and C++ structure declarations are the same on Cray X1 series systems as they are on Cray SV1 series and Cray T3E systems, with two exceptions: • The maximum size of a bit-field for a given basic data type on a Cray X1 series system can be smaller than on a Cray SV1 series and Cray T3E systems. • The bit-fields on a Cray X1 series system are signed integers by default. They are unsigned integers by default on Cray SV1 series and Cray T3E systems. For example, on Cray SV1 series systems you can declare a bit-field of type short as short x:17 because the data size of a short type is 32 bits. On a Cray X1 series system, a short type is 16 bits. Because a bit-field can never straddle the alignment boundary of its basic data type, you should replace the short data type with an int data type. 3.1.5 Shift Operations Signed integers can be migrated to a Cray X1 series system with no changes, except for a few cases such as shifting. Because signed integers on a Cray X1 series system are 32 bits versus 64 bits on Cray SV1 series and Cray T3E systems, shift operations may behave differently when a bit is lost due to shifting it more than 31 times or when sign extension is undesirable. This can be resolved, however, by using data modifiers such as L (long) on a signed integer to promote its data size to 64 bits. The following example illustrates what happens when the modifier is not used in code migrated to a Cray X1 series system. Example 1: Shift Operation (no data modifier) Source code: main(){ unsigned long a[64]; int i; /*Set the ith bit of array member a[i]*/ for(i = 0; i < 64; i++) a[i] = 1 << i; /* 1 and i are 32 bit signed integers */ for(i = 0; i< 64;i++) printf("a[%d] = %lx\n",i,a[i]); } S–2378–54 25Migrating Applications to the Cray X1™ Series Systems Test Output: a[0] = 1 a[1] = 2 a[2] = 4 a[3] = 8 a[4] = 10 ... a[28] = 10000000 a[29] = 20000000 a[30] = 40000000 a[31] = ffffffff80000000 a[32] = 0 a[33] = 0 ... a[63] = 0 Note that all of the array values with indexes of 32 and above have values of 0. When any bit is shifted past the end of a 32-bit integer, the result of the expression 1 << n evaluates to 0 when n is greater than or equal to 32. Sign extension occurs in array element a[31] because a 1 was shifted into the most significant bit position of the 32-bit result and the result was assigned to a larger unsigned data type. The assignment causes the result to be promoted from a 32-bit signed int to an unsigned long, which in turn causes the sign extension. You can prevent the loss or misinterpretation of shifted signed integers by placing a data modifier after the signed integer—that is, UL (unsigned long), L (long), or LL (long long). The following example adds a data modifier L (long) to the signed integer constant. Example 2: Shift Operation (with data modifier) Source code: main() { unsigned long a[64]; int i; /*Set the ith bit of array member a[i]*/ for(i = 0; i < 64; i++) a[i] = 1L << i; /* Using data modifier L */ for(i = 0; i < 64; i++) 26 S–2378–54Migrating C and C++ Applications [3] printf("a[%d] = %lx\n",i,a[i]); } Test Output: a[0] = 1 a[1] = 2 a[2] = 4 a[3] = 8 a[4] = 10 ... a[28] = 10000000 a[29] = 20000000 a[30] = 40000000 a[31] = 80000000 a[32] = 100000000 a[33] = 200000000 a[34] = 400000000 a[35] = 800000000 ... a[59] = 800000000000000 a[60] = 1000000000000000 a[61] = 2000000000000000 a[62] = 4000000000000000 a[63] = 8000000000000000 Note: Both Cray T3E systems and Cray X1 series systems use signed right shift. You may need to use the -h nosignedshift option if you are porting code from Cray SV1 series systems or if you used the -h nosignedshift option when compiling on Cray T3E systems. 3.1.6 Wide Characters and Multiple Locales Check for the use of wide characters and multiple locales. The Cray X1 series C++ compiler supports the C++ standard except for: • String classes that use basic string class templates with wide character types or string classes that use the wstring standard template class • I/O streams using wide character objects • File-based streams using file streams with wide character types (wfilebuf, wifstream, wofstream, and wfstream) S–2378–54 27Migrating Applications to the Cray X1™ Series Systems • Multiple localization libraries; Cray C++ supports only one locale Note: The C++ standard provides a standard naming convention for library routines. Therefore, classes or routines that use wide characters are named appropriately. For example, the fscanf and sprintf functions do not use wide characters, but the fwscanf and swprintf function do. 3.1.7 C++ Headers If you are working with C++ code, change to standard headers. Be aware that these headers use namespace std. Alternatively, use the stl.h header to include most of the library in the global namespace. For example, change: #include to: #include using namespace std; 3.1.8 Instantiation Files The set of instantiations assigned to a given file (for example, a.c) is recorded in an associated file that has a .ii suffix (for example, a.ii). On Cray X1 series systems, template instantiation has changed. If you will be using automatic instantiation instead of the simple template instantiation feature, remove all existing *.ii files, thereby allowing the system to regenerate them. See Cray C and C++ Reference Manual for details on template instantiation. 3.1.9 AT&T C++ Compatibility Headers Search for code that uses AT&T C++ compatibility headers, such as stream.h. These headers are documented in the Cray C and C++ Reference Manual. Code using these headers will require the use of the CRAYOLDCPPLIB environment variable for correct behavior. Consider updating the code to use the standard headers. 3.1.10 Vector Dependencies Check the code for #pragma _CRI ivdep directives. On Cray X1 series systems, the syntax for the ivdep directive is: #pragma _CRI ivdep safevl=vlen 28 S–2378–54Migrating C and C++ Applications [3] The default safe vector length has changed from the hardware vector length to infinity. If a loop cannot safely execute with unbounded vector length, use the -h noinfinitevl compiler option. The vlen option specifies a vector length in which no dependency will occur. vlen must be an integer between 1 and 1024 inclusive. You may want to consider changing ivdep directives to concurrent directives. The concurrent directive enables multistreaming and may improve the performance of vector code. Note that the ivdep directive is supported by several vendors, although implementation may vary, whereas the concurrent directive is specific to Cray. See Cray C and C++ Reference Manual for details. 3.1.11 Optimizations for std::complex Class Causes Byte Alignment Changes To improve optimization of C++ code using the std::complex class in structures or classes, we changed the byte alignment for members within these structures or classes from 4 bytes to 8 bytes. This change requires you to recompile any code using the std::complex class in structures or classes. 3.2 Compiling the Code Once the source code appears to be clean, try to compile it. For a summary of compiler option differences between Cray SV1 series and Cray T3E systems and Cray X1 series systems, see Cray X1 User Environment Differences. 1. Some compiler command options have changed. For example, on Cray X1 series systems the -h fp0 option replaces the -h noieeeconform option used on Cray SV1 series systems. Review makefiles and make changes as necessary. 2. To create relocatable object files without calling the loader, include the -c option. For example, to compile but not load program prog1.C, enter: % CC -c prog1.C This command compiles prog1.C and creates relocatable object file prog1.o. S–2378–54 29Migrating Applications to the Cray X1™ Series Systems 3. To use TotalView in debugging your code, include the -g or -G option. For details, see TotalView Release Overview and Installation Guide. 4. If compilation produces error messages, use the explain command with the message ID from the error message to further identify and locate the problem in the source code. For example, to get more information about an error that has a group code of cc (C Compiler) and a message number of mmm, enter this command: explain cc-mmm For example, the command: % explain cc-65 produces this message: A semicolon was expected at this point. A semicolon was not found where one was expected. 5. If you get overflow errors, you may have invalid data sizes. See Section 3.1.1, page 21 for valid data sizes. 6. Correct the source code and recompile as needed. For more information about the CC and cc command line options and defaults, see Cray C and C++ Reference Manual or the CC(1) man page. 3.3 Linking the Code After you have successfully compiled the program segments, you are ready to begin linking them. Cray encourages you to use the CC or cc command, not ld, to invoke the loader because the compiler calls the loader with the appropriate default libraries. 1. Use the CC or cc compiler command to link and load programs and libraries. For example, the following command: % cc -o app123 proga.o progb.o progc.o combines object files proga.o, progb.o, and progc.o; loads and links the required library routines, and creates the executable file app123. To use the 64-bit version of LibSci, enter: % cc -o app123 -l sci64 proga.o progb.o progc.o 30 S–2378–54Migrating C and C++ Applications [3] No compiler command option is required to designate a 64-bit or 32-bit MPT library. If a program uses 64-bit data types (long and double), routines will be loaded from the 64-bit MPT library. 2. Analyze error messages. Compilation error messages are displayed on the terminal screen and sent to the standard error file, stderr. If warnings or messages similar to the following examples display, it means the code contains calls to routines that the compiler could not resolve: ld-400 ld: WARNING Unresolved text symbol ld-401 ld: ERROR Unresolved text symbol(s). Loading terminated. Unresolved external errors generally indicate that the code is making calls to user routines that did not compile successfully, to intrinsic functions that are changed or not supported on Cray X1 series systems, or to library routines that are not supported or whose entry points have been changed. Use the explain command with the message ID from the error message to further identify and locate the problem in the source code. 3.4 Testing the Code After the program compiles and links without errors, begin running, testing, and debugging it using the following steps as a guide. 1. Run a program on one multistreaming processor (MSP). This is the default for Cray X1 series systems. Use a small data set so the program will be easier to debug. Use auto aprun or the aprun command. For example, to run executable file myprog, enter: % ./myprog or: % aprun ./myprog Calling the aprun command automatically occurs when only the name of the application program and, where applicable, associated program options are entered on the command line; this will cause the system to automatically call aprun to run the program. Auto aprun is not used to launch programs compiled as commands (-h command option). Such commands run serially S–2378–54 31Migrating Applications to the Cray X1™ Series Systems on a single-streaming processor (SSP) within a support node; they execute immediately without assistance from aprun or psched. On Cray X1 series systems, the CRAY_AUTO_APRUN_OPTIONS environment variable specifies options for the aprun command when the command is called automatically. For details about the CRAY_AUTO_APRUN_OPTIONS environment variable, see the Cray C and C++ Reference Manual. Note: Cray X1 series systems use the aprun (or mpirun) command for running distributed memory applications. The mpirun command is used only for MPI applications. The mpprun command is specific to Cray T3E systems and not supported on Cray X1 series systems. 2. Verify stack and heap sizes. On Cray X1 series systems, the default stack and heap sizes for applications executed with aprun or mpirun is 1 GB, and the default size for all other stacks and heaps is 32 MB. Programs with larger stack requirements may run on Cray SV1 series systems, but if these defaults are exceeded on Cray X1 series systems, the program will abort unless the following environment variables are set to allow for a larger stack: • X1_COMMON_STACK_SIZE • X1_PRIVATE_STACK_SIZE • X1_STACK_SIZE • X1_LOCAL_HEAP_SIZE • X1_SYMMETRIC_HEAP_SIZE • X1_HEAP_SIZE • X1_PRIVATE_STACK_GAP See Cray C and C++ Reference Manual and memory(7) man page for details. 3. If the program aborts or terminates in an abnormal state, you can use the debugger to identify the cause. For information about the debugger, see TotalView Release Overview, Installation Guide, and User’s Guide Addendum for Cray X1 Systems. Correct the source code, recompile, and rerun as needed. 4. Examine the output and verify that it matches the expected results. If possible, compare the output to the output from the original program, running the same test data. 5. Once the program produces accurate results while running on multiple processors, begin testing the program with larger test cases. Correct, 32 S–2378–54Migrating C and C++ Applications [3] recompile, and retest as needed, until the program produces verifiably correct results with large test cases. 3.5 Optimizing the Code Once the program is running correctly on one MSP with large test cases, begin tuning it for maximum performance. See Optimizing Applications on Cray X1 Series Systems for more information about optimization. S–2378–54 33Migrating Applications to the Cray X1™ Series Systems 34 S–2378–54Interlanguage Communications [4] This chapter addresses migration issues related to Fortran programs calling C or C++ routines and C or C++ programs calling Fortran routines. The interlanguage communications rules for accessing C and C++ objects from Fortran and vice versa, as documented in the Cray C and C++ Reference Manual for the Programming Environment 3.6 release, remain largely unchanged on Cray X1 series systems. Note: The Cray Programming Environment 3.6 release introduced the Fortran 2000 C interoperability feature that Fortran and C or C++ programs could use to share data and procedures. Fortran and C or C++ programs that used this feature instead of the interlanguage communications capability do not require the modifications that are described in this chapter. 4.1 Fortran to C or C++ Calls Review the source code to determine if you need to make changes to Fortran-to-C or Fortran-to-C++ calls. Note: C++ on Cray X1 series systems requires a C++ main routine for program entry. Interlanguage communication in which the program entry is a Fortran routine is not supported. 4.1.1 Passing or Receiving char Objects If the Fortran programs pass C or C++ char objects to C or C++ programs or receive char objects from C or C++ programs, you may need to make minor changes. Passing character objects between Fortran and C or C++ programs on Cray Cray SV1 series and Cray T3E systems depended on the Fortran Character Descriptor (FCD) to indicate the length of a character object and maintain character pointer information. C and C++ programs used the following macros to retrieve the length value and convert Fortran character pointers to C or C++ pointers and vice versa: _fcdlen, _cptofcd, and _fcdtocp. The Fortran Programming Environment on a Cray X1 series system does not use the FCD feature; instead, it uses the common industry practice of passing S–2378–54 35Migrating Applications to the Cray X1™ Series Systems the character length as an additional argument at the end of the argument list. Without FCDs, interlanguage communications programming is simpler but requires some changes to the existing code. The following examples illustrate the changes to make to correctly pass string arguments. Example 3, page 36 shows the Fortran program that calls the C function. Example 4, page 36 shows the unconverted C routine. Example 5, page 37 shows the converted C routine. No changes are required to the Fortran program. The Fortran program is followed by the unconverted C code, then the converted C code. In the converted C code, we changed the name of the C function to lowercase and appended the underscore character, changed the argument type from _fcd to char *, and appended the lengths of each string to the end of the argument list in the order that the strings appear in the argument list. Note that even though the Fortran code did not pass the string lengths, the Cray Fortran Compiler will pass them to the C function. Also note the replacement of the _fcdtocp calls with equivalent C mechanisms in the migrated code. Example 3: Fortran Program Calling C Function Source code: ! Fortran program on the Cray SV1 series, ! Cray T3E, or Cray X1 system ! No conversion is required program tst character*(5) ch character*(10) ch2 real xx ch = "hello" ch2 = "second str" xx = 2.3 call c_routine(ch, xx, ch2) end Compiler command: % ftn -c f2cilc1.ftn Example 4: Called C Routine (Cray SV1 series or Cray T3E system) Source code: 36 S–2378–54Interlanguage Communications [4] /* The unconverted C function on the */ /* Cray SV1 series or Cray T3E system */ /* using the FCD feature */ #include #include #include #include void C_ROUTINE(_fcd fch1, float *x, _fcd fch2) { char *str1; char *str2; int len1; int len2; len1 = _fcdlen(fch1); str1 = (char *)malloc(len1 + 1); strncpy(str1, _fcdtocp(fch1), len1); str1[len1] = ’\0’ ; len2 = _fcdlen(fch2); str2 = (char *)malloc(len2 + 1); strncpy(str2, _fcdtocp(fch2), len2); str2[len2] = (char)0 ; printf("len1 = %d\n", len1); printf("str1 = >%s<\n", str1); printf("x = %f\n", *x); printf("len2 = %d\n", len2); printf("str2 = >%s<\n", str2); } Compiler command: % cc -c c2filc1.c Example 5: Called C Routine (Converted for Cray X1 series system) Source code: /* The converted C function on a Cray X1 series system */ void c_routine_(char *fch1, float *x, char *fch2, long len1, long len2) S–2378–54 37Migrating Applications to the Cray X1™ Series Systems /* Changed the routine name to lowercase and appended an underscore. */ /* Changed arguments from _fcd to char * and added string lengths. */ { char *str1; char *str2; str1 = (char *)malloc(len1 + 1); strncpy(str1, fch1, len1); str1[len1] = ’\0’ ; str2 = (char *)malloc(len2 + 1); strncpy(str2, fch2, len2); str2[len2] = (char)0; printf("len1 = %d\n", len1); printf("str1 = >%s<\n", str1); printf("x = %f\n", *x); printf("len2 = %d\n", len2); printf("str2 = >%s<\n", str2); } Compiler command: % cc -c c2filc2.c The following example shows how to write a function in C that has a return type of CHARACTER string type. Example 6: Function call with return type of CHARACTER Fortran source code: program fs character*80 charfunc print *,charfunc(1) end program fs C source code: void charfunc_(char *buf, int *index, int len) { 38 S–2378–54Interlanguage Communications [4] printf("index=%d, len=%d\n",*index,len); memset(buf,’ ’,len); memcpy(buf,"Message",7); } Compiler commands: % cc -c charfunc.c % ftn fs.f90 charfunc.o % ./a.out Program output: index=1, len=80 Message 4.1.2 Passing Noncharacter Data Types For noncharacter data types, verify that the data types are passed correctly. You may need to change to a different data type to ensure that the correct data sizes are used in both the Fortran and C or C++ code (for example, change a short to an int). For a comparison of Fortran, C, and C++ data types, see Table 5, page 43. 4.1.3 Interfacing to System Calls On Cray X1 series systems, POSIX PXF routines replace many of the Fortran interfaces to system calls supported on Cray SV1 series and Cray T3E systems. The older Fortran interfaces used C to perform system calls. The PXF routines now provide the Fortran interfaces to POSIX system calls and return the status from a system call through a status argument. For the list of Cray SV1 series and Cray T3E system function calls and their PXF replacements, see Cray X1 User Environment Differences. If the code makes extensive use of C or C++ calls, consider rewriting it to take advantage of the new C and C++ interoperability features in Fortran 2003. For more information on Fortran and C or C++ interoperability, see Fortran Language Reference Manual, Volume 2. S–2378–54 39Migrating Applications to the Cray X1™ Series Systems 4.2 C or C++ to Fortran Calls Review the source code to determine if you need to make changes to calls from C or C++ programs to Fortran routines. 4.2.1 Passing or Receiving String Arguments If your C or C++ code calls Fortran procedures to pass or receive strings, no changes are required to the Fortran code for string arguments, but some changes are needed to the C or C++ code. The following examples illustrate the changes to make to correctly pass string arguments. Example 7, page 40 shows the called Fortran subroutine. Example 8, page 41 shows the unconverted C program that calls the Fortran subroutine. Example 9, page 41 shows the converted C program. No changes to the Fortran subroutine are required. In the converted C code, we changed the name of the Fortran subroutine to lower case and appended the underscore character to it, changed the argument type from _cptofcd to equivalent Fortran mechanisms, and appended the lengths of each string to the end of the argument list in the order that the strings appear in the argument list. Note: The form of a reference from a C or C++ program to a Fortran common block is unchanged from Cray SV1 series and Cray T3E series. The name of the common block must be upper case. Example 7: Called Fortran Subroutine Source code: ! The called Fortran procedure subroutine fortran_routine(ch, xx, ch2) character*(*) ch character*(*) ch2 real xx print *, "len(ch) = ", len(ch) print *, "ch = >",ch,"<" print *, "xx = ", xx print *, "len(ch2) = ", len(ch2) print *, "ch2 = >",ch2,"<" end 40 S–2378–54Interlanguage Communications [4] Compiler command: % ftn -c f2cilc2.ftn Example 8: C Program Calling Fortran Routine (Cray SV1 series or Cray T3E system) Source code: /* Unmigrated C code developed on */ /* the Cray SV1 series system */ /* or Cray T3E system using the FCD feature */ #include void c_routine() { char str[20]; char str2[20]; float x; strcpy(str, "hello"); strcpy(str2, "second str"); x = 2.3; FORTRAN_ROUTINE(_cptofcd(str, 5), &x, _cptofcd(str2, 10)); } Compiler command: % cc -c c2filc3.c Example 9: C Program Calling Fortran Routine (Converted for Cray X1 series system) Source code: /* Migrated C code on a Cray X1 series system */ void c_routine() { char str[20]; char str2[20]; float x; strcpy(str, "hello"); strcpy(str2, "second str"); x = 2.3; S–2378–54 41Migrating Applications to the Cray X1™ Series Systems fortran_routine_(str, &x, str2, 5L, 10L); /* Changed routine name to lowercase and appended an underscore */ /* Removed references to _cptofcd macros and added lengths to end */ /* of argument list */ } Compiler command: % cc -c c2filc4.c 4.2.2 Passing or Receiving Noncharacter Arguments For noncharacter arguments, you might need to change the data type to ensure that the correct data sizes are used in both the C and Fortran code. For a comparison of Fortran and C data types, see Table 5. For more information on C and C++ and Fortran interlanguage communications, see Cray C and C++ Reference Manual. 4.3 Fortran, C, and C++ Data Sizes As you migrate Fortran and C programs, you must ensure that corresponding C and Fortran arguments have the same data sizes. To ensure that the correct data sizes are used, select from Table 5, page 43 the data type to use for calls to C, C++, or Fortran routines. The left side of the table shows the data types you could use in a C or C++ argument list. The remaining columns show the Fortran data types that have the same size as the corresponding C or C++ argument. These columns show the Fortran data types to use when the default data size is either 32 or 64 bits. Some of the explicit KIND types can be used only when support for 8-bit and 16-bit data is enabled, as shown in the table. 42 S–2378–54Interlanguage Communications [4] Table 5. Fortran, C, and C++ Data Objects with the Same Data Sizes C and C++ basic data types on Cray X1 series systems Fortran data type (-s default32) Fortran data type (-s default64) bool LOGICAL(KIND=1) 1 LOGICAL(KIND=1) 1 char INTEGER(KIND=1) 1 INTEGER(KIND=1) 1 short 2 INTEGER(KIND=2) 1 INTEGER(KIND=2) 1 int INTEGER(KIND=1) 3 INTEGER(KIND=2) 3 INTEGER(KIND=4) INTEGER*4 INTEGER INTEGER(KIND=1) 3 INTEGER(KIND=2) 3 INTEGER(KIND=4) long INTEGER(KIND=8) INTEGER(KIND=8) INTEGER long long INTEGER(KIND=8) INTEGER(KIND=8) INTEGER float REAL(KIND=4) REAL REAL(KIND=4) double REAL(KIND=8) DOUBLE PRECISION REAL(KIND=8) REAL long double REAL(KIND=16) REAL(KIND=16) DOUBLE PRECISION float complex COMPLEX(KIND=4) COMPLEX COMPLEX(KIND=4) double complex COMPLEX(KIND=8) DOUBLE COMPLEX COMPLEX(KIND=8) COMPLEX long double complex COMPLEX(KIND=16) COMPLEX(KIND=16) DOUBLE COMPLEX 1 Only when the -e h option is used. The use of this option on Cray X1 series systems is not recommended because of performance penalties. 2 Not recommended on Cray X1 series systems because of performance penalties. 3 Only when the -d h option is used, which is the default. S–2378–54 43Migrating Applications to the Cray X1™ Series Systems 44 S–2378–54Glossary application node For UNICOS/mp systems, a node that is used to run user applications. Application nodes are best suited for executing parallel applications and are managed by the strong application placement scheduling and gang scheduling mechanism Psched. See also node; node flavor. blocking An optimization that involves changing the iteration order of loops that access large arrays so that groups of array elements are processed as many times as possible while they reside in cache. C interoperability A Fortran 2003 feature that allows Fortran programs to call C functions and access C global objects and also allows C programs to call Fortran procedures and access Fortran global objects. common block An area of memory, or block, that can be referenced by any program unit. In Fortran, a named common block has a name specified in a Fortran COMMON or TASKCOMMON statement, along with specified names of variables or arrays stored in the block. A blank common block, sometimes referred to as blank common, is declared in the same way but without a name. compute module For a Cray X1 series mainframe, the physical, configurable, scalable building block. Each compute module contains either one node with 4 MCMs/4MSPs (Cray X1 modules) or two nodes with 4 MCMs/8MSPs (Cray X1E modules). Sometimes referred to as a node module. See also node. construct A sequence of statements in Fortran that starts with a SELECT CASE, DO, IF, or WHERE statement and ends with the corresponding terminal statement. S–2378–54 45Migrating Applications to the Cray X1™ Series Systems Cray Fortran Compiler The compiler that translates Fortran programs into Cray object files. The Cray Fortran Compiler fully supports the Fortran language through the Fortran 95 Standard, ISO/IEC 1539-1:1997. Selected features from the proposed Fortran 2003 Standard are also supported. Cray pointer A variable whose value is the address of another entity, which is called a pointee. The Cray pointer type statement declares both the pointer and its pointee. The Cray pointee does not have an address until the value of the Cray pointer is defined; the pointee is stored starting at the location specified by the pointer. distributed memory The kind of memory in a parallel processor where each processor has fast access to its own local memory and where to access another processor’s memory it must send a message via the interprocessor network. dynamic thread adjustment In OpenMP, the automatic adjustment of the number of threads between parallel regions. Also known as dynamic threads or the dynamic thread mechanism. environment variable A variable that stores a string of characters for use by your shell and the processes that execute under the shell. Some environment variables are predefined by the shell, and others are defined by an application or user. Shell-level environment variables let you specify the search path that the shell uses to locate executable files, the shell prompt, and many other characteristics of the operation of your shell. Most environment variables are described in the ENVIRONMENT VARIABLES section of the man page for the affected command. kind Data representation (for example, single precision, double precision). The kind of a type is referred to as a kind parameter or kind type parameter of the type. The kind type parameter KIND indicates the decimal range for the integer type, the decimal precision and exponent range for the real and complex types, and the machine representation method for the character and logical types. 46 S–2378–54Glossary locale For UNICOS/mp systems, a collection of culture-dependent information used by an application to interact with a user. Message Passing Interface (MPI) A widely accepted standard for communication among nodes that run a parallel program on a distributed-memory system. MPI is a library of routines that can be called from Fortran, C, and C++ programs. Modules A package on the UNICOS/mp system that allows you to dynamically modify your user environment by using module files. (This term is not related to the module statement of the Fortran language; it is related to setting up the UNICOS/mp system environment.) The user interface to this package is the module command, which provides a number of capabilities to the user, including loading a module file, unloading a module file, listing which module files are loaded, determining which module files are available, and others. MSP mode (multistreaming mode) One of two types of application modes for UNICOS/mp systems. Programs are compiled either as MSP-mode applications (default) or SSP-mode applications. MSP-mode applications run on one or more MSPs. For MSP-mode applications, each MSP coordinates the interactions of its associated four SSPs. See also SSP mode. multichip module (MCM) For Cray X1 series systems, the physical packaging that contains processor chips and cache chips. The chips implement either one multistreaming processor (Cray X1 MCM) or two multistreaming processors (Cray X1E MCM). See also MSP. multistreaming processor (MSP) For UNICOS/mp systems, a basic programmable computational unit. Each MSP is analogous to a traditional processor and is composed of four single-streaming processors (SSPs) and E-cache that is shared by the SSPs. See also node; SSP; MSP mode; SSP mode. S–2378–54 47Migrating Applications to the Cray X1™ Series Systems node For UNICOS/mp systems, the logical group of four multistreaming processors (MSPs), cache-coherent shared local memory, high-speed interconnections, and system I/O ports. A Cray X1 system has one node with 4 MSPs per compute module. A Cray X1E system has two nodes of 4 MSPs per node, providing a total of 8 MSPs on its compute module. Software controls how a node is used: as an OS node, application node, or support node. See also compute module; MCM; MSP, node flavor; SSP. node flavor For UNICOS/mp systems, software controls how a node is used. A node’s software-assigned flavor dictates the kind of processes and threads that can use its resources. The three assignable node flavors are application, OS, and support. See also application node; OS node; support node; system node. OpenMP An industry-standard, portable model for shared memory parallel programming. OS node For UNICOS/mp systems, the node that provides kernel-level services, such as system calls, to all support nodes and application nodes. See also node; node flavor. parallel region See serial region. pointer A data item that consists of the address of a desired item. Psched The UNICOS/mp application placement scheduling tool. The psched command can provide job placement, load balancing, and gang scheduling for all applications placed on application nodes. serial region An area within a program in which only the master task is executing. Its opposite is a parallel region. 48 S–2378–54Glossary SHMEM A library of optimized functions and subroutines that take advantage of shared memory to move data between the memories of processors. The routines can either be used by themselves or in conjunction with another programming style such as Message Passing Interface. SHMEM routines can be called from Fortran, C, and C++ programs. single-streaming processor (SSP) For UNICOS/mp systems, a basic programmable computational unit. See also node; MSP; MSP mode; SSP mode. SSP mode (single-streaming mode) One of two types of application modes for UNICOS/mp systems. Programs are compiled either as MSP-mode applications (default) or SSP-mode applications. SSP-mode applications run on one or more SSPs. Each SSP runs independently of the others, executing its own stream of instructions. In contrast, compiler options enable the programmer to develop command-mode programs that run on an SSP on the support node. See also MSP mode. support node For UNICOS/mp systems, the node that is used to run serial commands, such as shells, editors, and other user commands (ls, for example). See also node; node flavor. system node For UNICOS/mp systems, the node that is designated as both an OS node and a support node; this node is often called a system node; however, there is no node flavor of "system." See also node; node flavor. thread The active entity of execution. A sequence of instructions together with machine context (processor registers) and a stack. On a parallel system, multiple threads can be executing parts of a program at the same time. type A means for categorizing data. Each intrinsic and user-defined data type has four characteristics: a name, a set of values, a set of operators, and a means to represent constant values of the type in a program. S–2378–54 49Migrating Applications to the Cray X1™ Series Systems unformatted I/O Transfer of binary data without editing between the current record and the entities specified by the I/O list. Exactly one record is read or written. The unit must be an external unit. UNICOS/mp The operating system for Cray X1 series (Cray X1 and Cray X1E) systems. vector A series of values on which instructions operate; this can be an array or any subset of an array such as row, column, or diagonal. Applying arithmetic, logical, or memory operations to vectors is called vector processing. See also vector processing. vector length The number of elements in a vector. vector processing A form of instruction-level parallelism in which the vector registers are used to perform iterative operations in parallel on the elements of an array, with each iteration producing up to 64 simultaneous results. See also vector. 50 S–2378–54Index B Bit-fields, C and C++, 25 Byte boundaries, 10 C C and C++ compiling, 29 data sizes, 21 error messages, 31 Fortran interoperability, 40 intrinsic procedures, 23 leadz, 23 libraries, 28 library routines, 23 linking code, 30 migrating, 21 MPI routines, 24 popcnt, 23 poppar, 23 scientific library routines, 24 SHMEM routines, 24 C calling Fortran programs, 40 C or C++ to Fortran calls, 40 C++ headers, 28 locales, 27 wide characters, 27 Cache bypass, 22 Compiling C and C++, 29 CRAY macros, 9 D Data conversion, 11 Data sizes C and C++, 21, 42 Fortran, 3, 42 Data types IBM, 11 IEEE, 11 VAX, 11 Debugger TotalView, 18 E Error messages C and C++, 30 Fortran, 14 F FFIO, 11 Fortran C interoperability, 39 compiling, 13 data sizes, 3 error messages, 14 IEEE interface routines, 7 inlining, 14 intrinsic procedures, 6 KIND values, 5 library routines, 6 linking code, 15 math routines, 7 migrating, 3 MPI routines, 7 Open/MP, 12 pointer arithmetic, 14 pointers, 10 PXF routines, 6 scientific library routines, 8 SHMEM routines, 7 system calls, 6 Tasking, 13 Fortran I/O, 11 Fortran to C or C++ calls, 35 S–2378–54 51Migrating Applications to the Cray X1™ Series Systems H Header files, 28 Headers, C++, 28 Heap size, 18, 32 I I/O Fortran, 11 IEEE interface routines Fortran, 7 inlining Fortran, 14 Instantiation files, 28 Interlanguage communications, 35 Intrinsic procedures Fortran, 6 Intrinsic Procedures C and C++, 23 K KIND values Fortran, 5 L leadz C and C++, 23 Libraries C and C++, 23, 28 Library routines Fortran, 6 LibSci Fortran, 8 Linking C and C++ code, 30 Linking Fortran code, 15 Loader, 15, 30 M Math routines Fortran, 7 Message Passing Interface (MPI), 17 Migrating C and C++, 21 MPI 32-bit routines, 15, 24 64-bit routines, 15, 24 MPI routines C and C++, 24 Fortran, 7 O Open/MP Fortran, 12 Optimizing C and C++, 33 Fortran, 19 P Pointer arithmetic Fortran, 14 Pointers, Fortran, 10 popcnt C and C++, 23 poppar C and C++, 23 Pragma directives cache_bypass, 22 PXF routines Fortran, 6 S Scientific library routines Fortran, 8 Shift operations, 25 SHMEM 32-bit routines, 15, 24 64-bit routines, 15, 24 SHMEM routines C and C++, 24 Fortran, 7 Sign bit, 26 Signed integers, 25 Signed long, 26 Stack size, 18, 32 System calls 52 S–2378–54Index Fortran, 6 T Tasking Fortran, 13 TotalView, 30 Fortran, 14 U Unsigned long, 26 W Word boundaries, 10 S–2378–54 53 Last changed: 08-16-2007 intro_biolib(3) NAME intro_biolib -- Introduction the Cray Bioinformatics Library routines IMPLEMENTATION Cray SV1 series and Cray X1 series systems Cray Fortran, C, and C++, except where noted DESCRIPTION The Cray Bioinformatics Library (BioLib) routines perform low level bit manipulation and searching operations useful in the analysis of nucleotide and amino acid sequence data. Except as noted otherwise, you can reference the routines from either Fortran, C, or C++. Routines marked as "C and C++ only" provide functionality that is unnecessary for Fortran users. Unless noted, the routines are supported for both Cray SV1ex and Cray X1 series systems. The routines are organized in the following subsections according to functionality. Data Compression Routines The data compression routines convert ASCII data to coded forms that use less memory and allow the BioLib routines to perform their tasks faster. cb_compress(3) optimizes nucleotide, amino acid, or hex data by using fewer bits to convey the same information. cb_uncompress(3) decompresses data compressed by the cb_compress routine. Nucleotide Sequence Characterization or Transformation Routines The following routines characterize or transform nucleotide data: cb_cghistn(3) creates a histogram of combined cytosine and guanine (C and G) density found in a nucleotide string. cb_countn_ascii(3) counts the number of A, C, T, G, and N characters within a nucleotide string. cb_nmer(3) creates a list of nmers cb_revcompl(3) reverses the order of a nucleotide string and replaces each nucleotide with its complement. cb_amino_translate_ascii(3) translates a nucleotide string into three amino acid strings. Search and Sort Routines cb_searchn(3) performs a gap-free search in a nucleotide string for approximate matches to a specified nucleotide string. cb_repeatn(3) finds regions within a nucleotide string that contain a short pattern of nucleotides that repeat consecutively. cb_sort(3) provides a multi-pass sort mechanism that allows you to sort large blocks of data within available memory. cb_isort(3) performs a radix sort on an array of integer data and a parallel index array. cb_isort_p(3) performs a radix sort on an array of integer data and a parallel index array. This routine is similar to the cb_isort routine, but operates on data that is distributed across multiple processors or images. (Fortran only) cb_isort1 performs a radix sort on an array of integer data. This routine is similar to the cb_isort routine, but does not use parallel index arrays. The cb_isort1 routine is appropriate to use if the old location of the data is not needed. For more information, see the cb_isort(3) man page. cb_isort1_p performs a radix sort on an array of integer data. This routine is similar to the cb_isort1 routine, but operates on data that is distributed across multiple processors or images. For more information, see the cb_isort_p(3) man page. (Fortran only) cb_usort performs a radix sort on an array of unsigned integer data and a parallel index array. For more information, see the cb_isort(3) man page. (C and C++ only) cb_usort1 performs a radix sort on an array of unsigned integer data. This routine is similar to cb_usort, but it does not use a parallel index array. The cb_usort1 is appropriate to use if the old location of the data is not needed. For more information, see the cb_isort(3) man page. (C and C++ only) cb_unihist(3) counts the number of unique entries found in an ordered list cb_unique(3) returns the unique entries found in an ordered list. cb_ordered_lookup(3) returns the location(s) of a value found in an ordered list Smith-Waterman Routines The following groups of routines help to find the Smith-Waterman alignment between two sequences: Smith-Waterman initialization routines—Initialize the data structures needed to perform the Smith-Waterman alignment with scoring matrix for ASCII, 2-bit, and 4-bit encoded sequences. Smith-Waterman scoring routines—Calculate the Smith-Waterman scores and traceback information for sequences using the ASCII, 2-bit, or 4-bit encoded sequences. Smith-Waterman alignment routines—Use the output from a Smith-Waterman scoring routine to find the Smith-Waterman alignment in ASCII, 2-bit, and 4-bit encoded sequences. Smith-Waterman wrapper routines—Call the Smith-Waterman initialization, scoring, and alignment routines to find the Smith-Waterman alignment of two nucleotide sequences using ASCII, 2-bit, or 4-bit encoded data. That is, you use only one routine instead of three to find the alignment. The Smith-Waterman routines use a scoring matrices that use elements whose sizes are either full word (64-bit) or half words (32-bit). Therefore, all routines have a full word and a half word version. The full word routines are available on Cray SV1 series systems. The full and half word routines are available on Cray X1 systems. Note: The name of each routine contain abbreviations for phrases including "Smith-Waterman," "full word," or "half word." For example, a Smith-Waterman routine that works with ASCII data has the name, cb_swa_fw, where the a after sw indicates ASCII. In addition, the Smith-Waterman alignment routines and wrapper routines have another class of routines that have the same functionality and names as the previously mentioned alignment and wrapper routines, but are faster. This other class of routines are faster, especially for programs that perform many alignments, than the previously mentioned routines. The previously mentioned routines cause significant overhead when called repeatedly because they allocate memory for the alignment array arguments each time they are called. The other class of routines minimize this overhead by having the user allocate the memory, which allows the memory to be allocated only once. The new routines have the same name as the older routines but end with _a. For more information about the Smith-Waterman routines that allocate the memory for the alignment arrays, see these man pages: cb_swa(3), for full and half word routines that work on ASCII encoded sequences. The names of the routines in this group begin with cb_swa. cb_swn(3), for full and half word routines that work on 2-bit encoded nucleotide sequences. The names of the routines in this group begin with cb_swn. cb_swn4(3), for full and half word routines that work on 4-bit encoded nucleotide sequences. The names of the routines in this group begin with cb_swn4. cb_swn_multi64_nt(3B) - compute Smith-Waterman alignment scores for multiple fixed-length nucleotide string pairs. This is a low-level routine designed to screen large numbers of potential matches as part of a large-scale genomic comparison algorithm. For more information about the Smith-Waterman routines where you allocate the memory for the alignment arrays, see these man pages: cb_swa_a(3), for full and half word routines that work on ASCII encoded sequences. The names of the routines in this group begin with cb_swa and end with _a. cb_swn_a(3), for full and half word routines that work on 2-bit encoded nucleotide sequences. The names of the routines in this group begin with cb_swn and end with _a. cb_swn4_a(3), for full and half word routines that work on 4-bit encoded nucleotide sequences. The names of the routines in this group begin with cb_swn4 and end with _a. SSD Solid-state Storage Device Data Transfer Routines The following routines help you use the storage space on the SSD solid-state storage device on Cray SV1ex systems for your Bioinformatics data: cb_ssd_init prepares the SSD environment for use by the other SSD data transfer routines. cb_copy_to_ssd transfers data stored in memory to a location in the SSD. cb_copy_from_ssd transfers data from the SDD to a memory location. cb_ssd_free releases memory on the SSD for reuse. cb_ssd_errno prints a message that corresponds to the specified error number. cb_largest_ssdid returns the largest currently used ssdid value. For more information about the routines, see the cb_ssd(3) man page. FASTA Style Data Routines The following routines read and/or reorganizes data from a FASTA style file: cb_read_fasta(3) loads data from a FASTA file into memory arrays. cb_fasta_convert(3) reorganizes the memory image of a FASTA file. Miscellaneous Routines cb_copy_bits(3) copies a contiguous sequence of memory bits from one memory region to another. cb_block_zero(3) initializes a block of memory to zero. cb_irand(3) generates a list of 64-bit words containing random bit patterns. cb_irand_mt generates a list of 64-bit words containing random bit patterns using the Mersenne Twister MT19937 algorithm. For more information, see the cb_irand(3) man page. cb_irand_mt_init initializes the algorithm used in cb_irand_mt using a single seed value. For more information, see the cb_irand(3) man page. cb_irand_mt_inita initializes the algorithm used in cb_irand_mt using an array of seed values. For more information, see the cb_irand(3) man page. cb_malloc(3) allocates a block-aligned memory region (C and C++ only). cb_free(3) frees memory allocated by cb_malloc (C and C++ only). cb_version(3) returns the version number of the Cray Bioinformatics Library. NOTES Several routines in the Cray Bioinformatics Library use the Bit Matrix Multiply (BMM) hardware on the Cray system to perform bit manipulations within words. These routines do not restore the contents of the hardware BMM register when they exit. Consequently, if your code expects the BMM registers to be the same after execution of the Cray Bioinformatics Library routines, you must save the contents of the BMM registers and reload the BMM register following calls to these routines. The usage of BMM register is noted in each applicable BioLib man page. On Cray SV1ex systems, Fortran programs calling the BioLib routines require Cray Programming Environment release 3.6 or later, CrayLibs 3.6.0.1 or later, and UNICOS 10.0.2.1 or later. On Cray X1 series systems, Fortran programs require the Cray Programming Environment 5.5 release. Compiling with the BioLib Routines No special compiler options are required to use the BioLib routines. If your site's Programming Environment modulefile (PrgEnv) does not define BioLib, you must load the biolib modulefile, before compiling your programs, as following example shows: module load biolib SEE ALSO bmm(3i) bte_move(3i) (Cray SV1ex systems only) Getting Started on Cray X2™ Systems S–2471–60© 2007 Cray Inc. All Rights Reserved. This manual or parts thereof may not be reproduced in any form unless permitted by contract or by written permission of Cray Inc. U.S. GOVERNMENT RESTRICTED RIGHTS NOTICE The Computer Software is delivered as "Commercial Computer Software" as defined in DFARS 48 CFR 252.227-7014. All Computer Software and Computer Software Documentation acquired by or for the U.S. Government is provided with Restricted Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7014, as applicable. Technical Data acquired by or for the U.S. Government, if any, is provided with Limited Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7013, as applicable. Cray, LibSci, UNICOS and UNICOS/mk are federally registered trademarks and Active Manager, Cray Apprentice2, Cray C++ Compiling System, Cray Fortran Compiler, Cray SeaStar, Cray SeaStar2, Cray SHMEM, Cray Threadstorm, Cray X1, Cray X1E, Cray X2, Cray XD1, Cray XMT, Cray XT, Cray XT3, Cray XT4, CrayDoc, CRInform, Libsci, RapidArray, UNICOS/lc, and UNICOS/mp are trademarks of Cray Inc. Linux is a trademark of Linus Torvalds. Lustre was developed and is maintained by Cluster File Systems, Inc. under the GNU General Public License. OpenMP C and C++ Application Program Interface, Version 2.0, March 2002, Copyright © 1997-2002, OpenMP Architecture Review Board. PBS Pro is a trademark of Altair Grid Technologies. SUSE is a trademark of SUSE LINUX Products GmbH, a Novell business. TotalView is a trademark of TotalView Technologies LLC. UNIX, the “X device,” X Window System, and X/Open are trademarks of The Open Group in the United States and other countries. All other trademarks are the property of their respective owners.Record of Revision Version Description 6.0 September 2007 Supports the Cray X2 Programming Environment 6.0 release. S–2471–60 iContents Page Preface vii Accessing Product Documentation . . . . . . . . . . . . . . . . . . . vii Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . viii Reader Comments . . . . . . . . . . . . . . . . . . . . . . . . ix Cray User Group . . . . . . . . . . . . . . . . . . . . . . . . ix Introduction [1] 1 System Environment . . . . . . . . . . . . . . . . . . . . . . . 1 Development Environment . . . . . . . . . . . . . . . . . . . . . 2 Documentation Included with This Release . . . . . . . . . . . . . . . . 2 Setting up Your User Environment [2] 5 Setting Up a Secure Shell . . . . . . . . . . . . . . . . . . . . . . 5 RSA Authentication with a Passphrase . . . . . . . . . . . . . . . . . 6 RSA Authentication without a Passphrase . . . . . . . . . . . . . . . . 7 Additional Information . . . . . . . . . . . . . . . . . . . . . . 7 Using Modules . . . . . . . . . . . . . . . . . . . . . . . . . 7 Modifying Environment Variables . . . . . . . . . . . . . . . . . . . 8 Using Libraries [3] 11 MPICH2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 SHMEM . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 CAF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 UPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 LibSci . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 CrayLibs . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 S–2471–60 iiiGetting Started on Cray X2™ Systems Page BioLib . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 glibc and Linux System Calls . . . . . . . . . . . . . . . . . . . . . 16 Programming Considerations [4] 17 I/O Support . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Cray MPICH2 Programming Considerations . . . . . . . . . . . . . . . . 17 Lustre File System . . . . . . . . . . . . . . . . . . . . . . . . 18 Timing Functions . . . . . . . . . . . . . . . . . . . . . . . . 19 Signal Support . . . . . . . . . . . . . . . . . . . . . . . . . 19 Little-endian Support . . . . . . . . . . . . . . . . . . . . . . . 19 Using Cray Compilers [5] 21 Running Applications [6] 23 Using the aprun Application Launcher . . . . . . . . . . . . . . . . . . 23 Using PBS Pro . . . . . . . . . . . . . . . . . . . . . . . . . 24 Debugging an Application [7] 27 Using the TotalView Debugger . . . . . . . . . . . . . . . . . . . . 27 TotalView GUI . . . . . . . . . . . . . . . . . . . . . . . . 28 TotalView CLI . . . . . . . . . . . . . . . . . . . . . . . . 29 Command Shortcuts . . . . . . . . . . . . . . . . . . . . . . . 29 Analyzing Performance [8] 31 Using PAPI . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Using the High-level PAPI Interface . . . . . . . . . . . . . . . . . . 31 Using the Low-level PAPI Interface . . . . . . . . . . . . . . . . . . 32 Using CrayPat . . . . . . . . . . . . . . . . . . . . . . . . . 32 Using Cray Apprentice2 . . . . . . . . . . . . . . . . . . . . . . 35 Example Programs [9] 37 Example 1: Basics of running a Cray X2 application . . . . . . . . . . . . . 37 Example 2: Using the Cray SHMEM put() function . . . . . . . . . . . . . 38 iv S–2471–60Contents Page Example 3: Using the Cray SHMEM get() function . . . . . . . . . . . . . 40 Appendix A glibc Functions 43 Glossary 49 Index 51 Tables Table 1. Cray X2 Publications . . . . . . . . . . . . . . . . . . . . 3 Table 2. Cray Compiler Commands . . . . . . . . . . . . . . . . . . 22 Table 3. aprun Versus qsub Options . . . . . . . . . . . . . . . . . . 24 Table 4. Shortcuts for TotalView Commands . . . . . . . . . . . . . . . . 30 Table 5. Supported glibc Functions . . . . . . . . . . . . . . . . . . 43 S–2471–60 vPreface The information in this preface is common to Cray documentation provided with this software release. Accessing Product Documentation With each software release, Cray provides books and man pages, and in some cases, third-party documentation. These documents are provided in the following ways: CrayDoc The Cray documentation delivery system that allows you to quickly access and search Cray books, man pages, and in some cases, third-party documentation. Access this HTML and PDF documentation via CrayDoc at the following locations: • The local network location defined by your system administrator • The CrayDoc public website: docs.cray.com Man pages Access man pages by entering the man command followed by the name of the man page. For more information about man pages, see the man(1) man page by entering: % man man Third-party documentation Access third-party documentation not provided through CrayDoc according to the information provided with the product. S–2471–60 viiGetting Started on Cray X2™ Systems Conventions These conventions are used throughout Cray documentation: Convention Meaning command This fixed-space font denotes literal items, such as file names, pathnames, man page names, command names, and programming language elements. variable Italic typeface indicates an element that you will replace with a specific value. For instance, you may replace filename with the name datafile in your program. It also denotes a word or concept being defined. user input This bold, fixed-space font denotes literal items that the user enters in interactive sessions. Output is shown in nonbold, fixed-space font. [ ] Brackets enclose optional portions of a syntax representation for a command, library routine, system call, and so on. ... Ellipses indicate that a preceding element can be repeated. name(N) Denotes man pages that provide system and programming reference information. Each man page is referred to by its name followed by a section number in parentheses. Enter: % man man to see the meaning of each section number for your particular system. viii S–2471–60Preface Reader Comments Contact us with any comments that will help us to improve the accuracy and usability of this document. Be sure to include the title and number of the document with your comments. We value your comments and will respond to them promptly. Contact us in any of the following ways: E-mail: docs@cray.com Telephone (inside U.S., Canada): 1–800–950–2729 (Cray Customer Support Center) Telephone (outside U.S., Canada): +1–715–726–4993 (Cray Customer Support Center) Mail: Customer Documentation Cray Inc. 1340 Mendota Heights Road Mendota Heights, MN 55120–1128 USA Cray User Group The Cray User Group (CUG) is an independent, volunteer-organized international corporation of member organizations that own or use Cray Inc. computer systems. CUG facilitates information exchange among users of Cray systems through technical papers, platform-specific e-mail lists, workshops, and conferences. CUG memberships are by site and include a significant percentage of Cray computer installations worldwide. For more information, contact your Cray site analyst or visit the CUG website at www.cug.org. S–2471–60 ixIntroduction [1] This is a guide for users who develop applications for Cray X2 systems. It explains how to log on to the system, set up a user environment, and develop applications. The intended audience is application programmers and end users. Prerequisite knowledge is a familiarity with the topics in the Cray XT Series System Overview and Cray X2 System Overview. 1.1 System Environment The system on which you run your Cray X2 applications is an integrated set of Cray X2 and Cray XT series components. You log in to a Cray XT series login node or a cross-compiler machine and use the Cray X2 Programming Environment and related products to create your executables. You run your executables on Cray X2 compute nodes. Note: For details about installing and configuring a cross-compiler system, see the Cray Programming Environment Releases Overview and Installation Guide. The Cray X2 operating system, UNICOS/lc, has two components: CNL and SUSE LINUX. CNL is the Cray X2 compute node operating system. It is a low-overhead combination of a Linux kernel and other software that initiates programs, manages node memory, and terminates programs. The Cray XT series login node and other service nodes (I/O, network, and boot nodes) run a full-featured SUSE LINUX operating system. S–2471–60 1Getting Started on Cray X2™ Systems 1.2 Development Environment The Cray X2 development environment consists of the Cray X2 Programming Environment and related products and services: • Cray Fortran, C, and C++ compilers (see Chapter 5, page 21) • Parallel programming models: – Cray MPICH2 message passing interface (see Section 3.1, page 11) – Cray SHMEM shared memory access interface (see Section 3.2, page 12) – Co-array Fortran (CAF) (see Section 3.3, page 13) – Unified Parallel C (UPC) (see Section 3.4, page 13) – OpenMP (see Section 3.5, page 14) • LibSci scientific library functions (see Chapter 3, page 11) • CrayLibs functions (see Chapter 3, page 11) • C language run time library (glibc) functions (see Section 3.9, page 16) • Lustre file system (see Section 4.3, page 18) • Application launch and compute node status commands, PBS Pro batch processing (see Section 6.1, page 23) • TotalView debugger (see Chapter 7, page 27) • Performance analysis tools (see Section 8.1, page 31) • Optimization tools and techniques (see Optimizing Applications on Cray X2 Systems) 1.3 Documentation Included with This Release The Cray X2 system provides a combination of proprietary, open source, and third-party documents. All Cray manuals are provided as PDF files, and many are available as HTML files. You can view the manuals and man pages through the CrayDoc interface or move the files to another location, such as your desktop. Note: You can use the Cray X2 System Documentation Site Map on CrayDoc to link to the manuals and man pages included with this release. 2 S–2471–60Introduction [1] Table 1. Cray X2 Publications Getting Started on Cray X2 Systems (this manual) Cray XT Series System Overview Cray Programming Environment Releases Overview and Installation Guide Cray Fortran Reference Manual Fortran directives (read intro_directives(1) man page first) Cray C and C++ Reference Manual C and C++ pragmas (read intro_pragmas(1) man page first) Cray MPICH2 man pages (read intro_mpi(3) first) Cray SHMEM man pages (read intro_shmem(3) first) UPC man pages (read intro_upc(3c) first) OpenMP man pages (omp_lock(3), omp_nested(3), omp_threads(3), omp_timing(3)) LibSci man pages (read intro_libsci(3s) first) Cray FFT man pages (read intro_fft(3s) first) Cray Bioinformatics man pages (read intro_biolib(3) first) Programming Environment man pages (ftn(1), cc(1), CC(1)) Application launch man pages (aprun(1), cnselect(1)) TotalView Debugger Users Guide TotalView Release Overview and Installation Guide for Cray X2 Systems TotalView man pages (totalview(1), totalviewcli(1)) PBS Professional 8.0 User Guide PBS Pro man pages (qsub(1B), qdel(1B), qstat(1B)) PAPI User's Guide PAPI man pages (read intro_papi(3) first) Using Cray Performance Analysis Tools CrayPat and Cray Apprentice2 man pages (pat_build(1), pat_report(1), app2(1)) S–2471–60 3Getting Started on Cray X2™ Systems 4 S–2471–60Setting up Your User Environment [2] Configuring your user environment on a Cray X2 system is similar to configuring a typical Linux workstation. However, there are steps that you must take before you begin developing applications. Note: This chapter describes the user environment on a Cray XT series login node. Cray X2 users have the option of creating executables on a cross-compiler machine. For information about installing and configuring a cross-compiler machine, see the Cray Programming Environment Releases Overview and Installation Guide. The following descriptions assume that your environment is set up in the default configuration. Your site may differ. Contact your site administrator for site-specific information. 2.1 Setting Up a Secure Shell Cray X2 systems use ssh and ssh-enabled applications such as scp for secure, password-free remote access to the Cray XT login nodes. Before you can use the ssh commands, you must generate an RSA authentication key if your system administrator has not already done so. There are two ways of generating the key: RSA authentication with or without a passphrase. Although both methods are described here, you must use the latter method to access the compute nodes through a script. S–2471–60 5Getting Started on Cray X2™ Systems 2.1.1 RSA Authentication with a Passphrase To enable ssh with a passphrase, complete the following steps. 1. Generate the RSA keys by entering the following command: % ssh-keygen -t rsa and follow the prompts. You will be asked to supply a passphrase. 2. Create a $HOME/.ssh directory and set permissions so that only the file's owner can access them: % mkdir $HOME/.ssh % chmod 700 $HOME/.ssh 3. The public key is stored in the $HOME/.ssh directory. Enter the following command to copy the key to your home directory on the remote host(s): % scp $HOME/.ssh/key_filename.pub \ username@system_name:.ssh/authorized_keys Connect to the remote host by typing the following commands. If you are using a C shell, enter: % eval s` sh-agent % ` ssh-add If you are using a Bourne shell, enter: $ eval s` sh-agent -s $ ` ssh-add Enter your passphrase when prompted, followed by: % ssh remote_host_name 6 S–2471–60Setting up Your User Environment [2] 2.1.2 RSA Authentication without a Passphrase To enable ssh without a passphrase, complete the following steps. 1. Generate the RSA keys by typing the following command: % ssh-keygen -t rsa -N "" and following the prompts. 2. Create a $HOME/.ssh directory and set permissions so that only the file's owner can access them: % mkdir $HOME/.ssh % chmod 700 $HOME/.ssh 3. The public key is stored in the $HOME/.ssh directory. Type the following command to copy the key to your home directory on the remote host(s): % scp $HOME/.ssh/key_filename.pub \ username@system_name:.ssh/authorized_keys Note: This step is not required if your home directory is shared. 4. Connect to the remote host by typing the following command: % ssh remote_host_name 2.1.3 Additional Information For more information about setting up and using a secure shell, see the ssh(1), ssh-keygen(1), ssh-agent(1), ssh-add(1), and scp(1) man pages. 2.2 Using Modules The Cray X2 system includes the Modules software package, which is used to create integrated software packages and support multiple versions of software. As new versions of the supported software and associated man pages become available, they are added automatically to the Programming Environment, while earlier versions are retained to support legacy applications. By specifying the module to load, you can choose the default version of an application or another version. Modules also provide a simple mechanism for dynamically modifying your user environment. S–2471–60 7Getting Started on Cray X2™ Systems Before working on your application, make sure the appropriate modules have been loaded. Enter: % module list When you log on to a Cray XT login node, the Cray XT PrgEnv-pgi may be loaded; use the module list command to verify. If PrgEnv-pgi is loaded, switch to the Cray X2 Programming Environment by entering: % module swap PrgEnv-pgi PrgEnv-x2 The PrgEnv-x2 module loads the product modules that define the system paths and environment variables needed for a default Cray X2 environment. As you become more familiar with the Programming Environment, you can choose to add or subtract individual modules, but as a rule, the easiest way to avoid many common problems is to start by loading the complete PrgEnv-x2 module. You will need to load additional modules for some products, as noted throughout this guide. To get a list of all available modules, enter: % module avail If you need to swap out one module and replace it with another, enter: % module swap swap_out_module swap_in_module Note: Man pages are packaged in the modules with the software they document. The man pages do not become available until after you have loaded the appropriate module. For further information about the module utility, see the module(1) and modulefile(4) man pages. 2.3 Modifying Environment Variables You can use the Modules package to modify environment variables, such as PATH, MANPATH, and LD_LIBRARY_PATH. For example, to add directories to be searched for commands and programs, modify the PATH variable by adding paths; do not reinitialize the system-defined PATH. The following example shows how to modify the PATH variable to add $HOME/bin to the path. If you are using csh, enter: % set path = ($path $HOME/bin) 8 S–2471–60Setting up Your User Environment [2] If you are using bash, enter: $ export $PATH=$PATH:$HOME/bin S–2471–60 9Getting Started on Cray X2™ Systems 10 S–2471–60Using Libraries [3] This chapter describes the libraries and parallel programming model constructs that are available to application developers. 3.1 MPICH2 The MPI library for Cray X2 systems is Cray MPICH2. Cray MPICH2 implements the MPI-2 standard, except for support of spawn functions. It also implements the MPI 1.2 standard, as documented by the MPI Forum in the spring 1997 release of MPI: A Message Passing Interface Standard. For more information about Cray MPICH2 functions and environment variables, refer to the MPI man pages, starting with intro_mpi(3). Cray MPICH2 includes ROMIO, a high-performance, portable MPI-IO implementation developed by Argonne National Laboratories. For more information about using ROMIO, including optimization tips, refer to the ROMIO man pages and the ROMIO website at http://www-unix.mcs.anl.gov/romio/. Note: Cray MPICH2 is part of the Message Passing Toolkit. You must have the x2-mpt module loaded in order to compile and get viable results. If the x2-mpt module is not loaded, the code may appear to compile without error, but it is highly unlikely that the executable will run without error. To use MPICH2 functions on Cray X2 systems, you should be aware of the following limitations: • There is a name conflict between stdio.h and the MPI C++ binding in relation to the names SEEK_SET, SEEK_CUR, and SEEK_END. If your application does not reference these names, you can work around this conflict by using the compiler flag -DMPICH_IGNORE_CXX_SEEK. If your application does require these names, as defined by MPI, undefine the names (#undef SEEK_SET, for example) prior to including the mpi.h. declaration. Alternatively, if the application requires the stdio.h naming, your application should include the mpi.h declaration before the stdio.h or the iostream declaration. • The following process-creation functions are not supported and, if used, generate aborts at runtime: – MPI_Close_port() and MPI_Open_port() S–2471–60 11Getting Started on Cray X2™ Systems – MPI_Comm_accept() – MPI_Comm_connect() and MPI_Comm_disconnect() – MPI_Comm_spawn() and MPI_Comm_spawn_multiple() – MPI_Comm_get_attr() with attribute MPI_UNIVERSE_SIZE – MPI_Comm_get_parent() – MPI_Lookup_name() – MPI_Publish_name() and MPI_Unpublish_name() • The MPI_LONG_DOUBLE data type is not supported. • The behavior of the MPICH2 function MPI_Dims_create() is not consistent with the MPI standard. Therefore, Cray added a special mpi_dims_create algorithm to the MPI library. This added function is enabled by default. For an example showing how to compile and launch an MPI application, see Example 1, page 37. 3.2 SHMEM Cray SHMEM is a one-sided data passing model in which memory is private to each task. SHMEM functions can be used in programs that perform computations in separate address spaces and that explicitly pass data by means of shmem_put() and shmem_get() functions to and from different processing elements. SHMEM functions can be called from Fortran, C, and C++ programs and used either by themselves or with MPI functions. SHMEM requires symmetric data objects and explicit barriers and synchronization calls; SHMEM has no implicit barriers. Note: SHMEM is part of the Message Passing Toolkit. You must have the x2-mpt module loaded in order to compile and get viable results. If the x2-mpt module is not loaded, the code may appear to compile without error, but it is highly unlikely that the executable will run without error. For more information about Cray SHMEM functions, refer to the SHMEM man pages, starting with intro_shmem(3). To build, compile, and run SHMEM applications, you need to use start_pes(int npes) or shmem_init() as the first SHMEM call and shmem_finalize() as the last SHMEM call. 12 S–2471–60Using Libraries [3] For an example showing how to compile and launch SHMEM applications, see Example 2, page 38 and Example 3, page 40. 3.3 CAF Co-array Fortran (CAF) is a parallel programming model that uses zero-sided data transfers. Programs with co-arrays use the Single Program, Multiple Data (SPMD) execution model; the program and all its data are replicated and executed asynchronously. Each replication of the program is an image that is executed as a processing element. Any image can reference any co-array and, through pointer components, remote data with the TARGET attribute. CAF uses a notation for accessing objects across images. A CAF specification consists of the local object specification and the co-dimensions specification. Square brackets specify image indices. For example, the statement A(:) = B(:)[2] assigns all elements of co-array B on image 2 to array A on the local image. The CAF user interface includes Fortran CAF syntax, CAF intrinsic and synchronization procedures, and the ftn -Z compiler command option. Co-arrays can interoperate with MPI and SHMEM. For further information, see the Cray Fortran Reference Manual and the ftn(1) man page. 3.4 UPC Unified Parallel C (UPC) is an extension of ANSI C for developing distributed-memory parallel applications. Work is distributed among threads, and memory is divided into private and shared spaces. Threads communicate through one-sided data exchanges. Each thread has its own private space, in addition to a portion of the shared space. The user interface includes UPC keywords, statements, pragmas, and cc compiler command options. To share an array among threads, use the shared keyword in the array declaration. UPC has no implicit synchronization calls. Explicit synchronization and locking functions manage thread behavior. UPC applications can span all application instances, and UPC code is compatible with MPI and SHMEM. UPC is supported for C programs but not C++. S–2471–60 13Getting Started on Cray X2™ Systems Cray supports the UPC Language Specification 1.2 and also supports some Cray-specific functions as noted in the Cray C and C++ Reference Manual and the CC(1) and UPC man pages (read intro_upc(3c) first). 3.5 OpenMP The Cray X2 system supports version 2.0 of the OpenMP Application Program Interface (API) standard. OpenMP is a shared-memory parallel programming model that application developers can use to create and distribute work using threads. OpenMP applications can be used in hybrid OpenMP/MPI applications but may not cross node boundaries. The OpenMP user interface includes compiler command options, library functions, Fortran directives, C and C++ pragmas, and environment variables. The number of processing elements hosting OpenMP threads at any given time is fixed at program startup and specified by the aprun -d depth option. For further information, see the OpenMP man pages, the aprun(1) man page and the Cray Fortran Reference Manual or the Cray C and C++ Reference Manual. 3.6 LibSci LibSci, the Cray X2 optimized scientific library functions, include: • Fast Fourier transform (FFT), filter, and convolution functions • Basic Linear Algebra Subprograms (BLAS) • Linear Algebra Package (LAPACK) functions • Scalable LAPACK (ScaLAPACK) distributed memory parallel functions • Basic Linear Algebra Communication Subprograms (BLACS) LibSci supports 32- and 64-bit data types and provides Fortran interfaces for all functions. The default 32-bit version of LibSci provides both single- and double-precision functions. The 64-bit version of LibSci provides single-precision functions. For additional information, see the intro_libsci(3s) man page. 14 S–2471–60Using Libraries [3] 3.7 CrayLibs The Cray X2 system supports the following CrayLibs functions: • Conversion functions, such as CRI2IEG, which converts Fortran data types between Cray IEEE and generic IEEE data types. For further information, see the conversion functions man pages (read intro_conversion(3f) first). • FFIO functions, such as ASSIGN, which provides a subroutine call interface to assign processing. For further information, see the FFIO functions man pages (read intro_ffio(3f) first). • I/O functions, such as ASNQFILE, which returns attributes for a file. For further information, see the IO functions man pages (read intro_io(3f) first). • Timing functions, which return the elapsed CPU time and wall-clock time. For further information, see the timing function man pages (read intro_timing(7) first). • PXF POSIX library functions, such as PXFGETPPID, which gets the parent process ID. For further information, see the PXF POSIX functions man pages (read intro_pxf(3f) first). • Mathematical functions, such as sin(), which returns the sine of an angle. For further information, see the intro_libm(3f) man page. 3.8 BioLib The Cray Bioinformatics Library (BioLib) functions perform operations useful in the analysis of nucleotide and amino acid sequence data. Library functions include: • Data compression functions convert ASCII data to coded forms that use less memory and allow the BioLib functions to perform their tasks faster • Nucleotide sequence characterization or transformation functions • Search and sort functions • Smith-Waterman functions help to find the Smith-Waterman alignment between two sequences • FASTA style data functions read and/or reorganize data from a FASTA-style file • Miscellaneous functions S–2471–60 15Getting Started on Cray X2™ Systems You can reference the functions from Fortran, C, or C++, unless noted otherwise in the man pages (read intro_biolib(3) first). 3.9 glibc and Linux System Calls Because the Cray X2 compute node operating system, CNL, is designed to support resource-intensive, high-speed computational applications, its functionality is limited in certain areas where the service nodes are expected to take over. CNL does not support the following C language runtime library, glibc functions or Linux system calls: • Pipes, sockets, remote procedure calls, or other TCP/IP communication. The parallel programming model interfaces are the only node-to-node communication mechanisms. • Dynamic loading of executable code. • The /proc files such as cpuinfo and meminfo. (These files contain information about your login node.) • Any functions that require a daemon. • Any functions that require a database, such as ndb(). For example, there is no support for the uid() and gid() family of queries that are based on ndb(). • The fork() function is not supported. Cray has modified the system() and popen() functions to use vfork() instead. • The mprotect() function is not supported for HUGETLB pages. • Name Service Switch (NSS) functions (such as getpwnam() and gethostbyname()) are not supported. Appendix A, page 43 lists the glibc functions that Cray X2 supports. 16 S–2471–60Programming Considerations [4] The manuals and man pages for third-party and open source products provide platform-independent descriptions of product features. This chapter provides Cray X2 specific information you should consider when using those products to develop Cray X2 applications. 4.1 I/O Support I/O support for compute node applications is limited. The only operations allowed are Fortran, C, and C++ I/O calls; Cray MPICH2 and UPC I/O functions; and CNL I/O functions. Application programmers should keep in mind the following behaviors: • I/O is offloaded to the service I/O nodes. The Lustre file system handles all file operations for parallel programs. The aprun application launcher handles stdin, stderr, and stdout. • Calling an I/O function such as open() with a bad address causes the application to fail with a page fault. On the service nodes, a bad address causes the function to set errno = EFAULT and return -1. 4.2 Cray MPICH2 Programming Considerations In using MPICH2 functions, you should be aware of the following issues: • There is a name conflict between stdio.h and the MPI C++ binding in relation to the names SEEK_SET, SEEK_CUR, and SEEK_END. If your application does not reference these names, you can work around this conflict by using the compiler flag -DMPICH_IGNORE_CXX_SEEK. If your application does require these names, as defined by MPI, undefine the names (#undef SEEK_SET, for example) prior to the #include "mpi.h" statement. Alternatively, if the application requires the stdio.h naming, your application should include the #include "mpi.h" statement before the #include or #include statement. • The following process-creation functions are not supported and, if used, generate aborts at run time: – MPI_Close_port() and MPI_Open_port() S–2471–60 17Getting Started on Cray X2™ Systems – MPI_Comm_accept() – MPI_Comm_connect() and MPI_Comm_disconnect() – MPI_Comm_spawn() and MPI_Comm_spawn_multiple() – MPI_Comm_get_attr() with attribute MPI_UNIVERSE_SIZE – MPI_Comm_get_parent() – MPI_Lookup_name() – MPI_Publish_name() and MPI_Unpublish_name() The MPI_LONG_DOUBLE data type is not supported. The behavior of the MPICH2 function MPI_Dims_create() is not consistent with the MPI standard. Therefore, Cray added a special mpi_dims_create algorithm to the MPI library. This added function is enabled by default. 4.3 Lustre File System You use the Lustre file system for parallel I/O. To use Lustre, your application must direct file operations to paths within a Lustre mount point. To determine the Lustre mount points as seen by Lustre applications, ask your system administrator or search the /etc/sysio_init file for the string llite: For example, enter: % grep llite /etc/sysio_init Your output will be similar to this: {creat, ft=file,nm="/lus/nid00007/.mount",pm=0644,str="llite:7:/nid00007-mds/client"} {creat, ft=file,nm="/lus/nid00135/.mount",pm=0644,str="llite:135:/nid00135_mds/client"} {creat, ft=file,nm="/lus/nid00012/.mount",pm=0644,str="llite:12:/nid00012_mds/client"} In this example, the mount points are: /lus/nid00007 /lus/nid00135 /lus/nid00012 18 S–2471–60Programming Considerations [4] 4.4 Timing Functions CNL supports library routines that retrieve elapsed CPU time (such as cpu_time()) and wall clock time (such as rtc(). These routines use the real-time clock as well as information from the kernel. The real-time clock begins with zero. If the value is an integer, a full 64 bits should be used to represent the value. For details, see the intro_timing(7) man page. 4.5 Signal Support Signal handlers installed through sigaction() have the prototype: void (*handler) (int, siginfo_t *, void *) which allows a signal handler to optionally request two extra parameters. On compute nodes, these extra parameters are provided in a limited fashion when requested. The siginfo_t pointer points to a valid structure of the correct size but contains no data. The void * parameter points to a ucontext_t structure. The uc_mcontext field within that structure is a platform-specific data structure that, on compute nodes, is defined as a sigcontext_t structure. Within that structure, the general purpose and floating point registers are provided to the user. You should rely on no other data. 4.6 Little-endian Support The Cray X2 system supports little-endian byte ordering. The least significant byte of any value consisting of multiple bytes is stored in the byte with the lowest memory address. S–2471–60 19Getting Started on Cray X2™ Systems 20 S–2471–60Using Cray Compilers [5] The Cray X2 Programming Environment includes Cray Fortran, C, and C++ compilers. You access the compilers through Cray X2 compiler drivers. The compiler drivers perform the necessary initializations and load operations, such as linking in the header files and system libraries (libc.a and libmpich.a, for example) before invoking the compilers. The syntax for the compiler driver commands is: compiler_driver_command [Cray_compiler_options] filename,... For example, to use the Cray Fortran compiler to compile prog1.ftn and create executable prog1, enter: % ftn -o prog1 prog1.ftn The Cray compilers provide the following features: • The C and C++ compilers support ANSI C99. • The Cray Fortran compiler supports the majority of the Fortran 2003 standard. Deferred features are specified in the Cray Fortran Reference Manual. • Enhanced vectorization for C and Fortran, which includes: – Further tuning of the vectorizer to support alternate code generation – Additional idiom recognition – Vectorization of additional loops with references to transcendental functions – Processor-specific instruction selection – Support for additional vectorization directives and pragmas S–2471–60 21Getting Started on Cray X2™ Systems The commands for invoking the Cray compilers and the source file extensions are: Table 2. Cray Compiler Commands Compiler Command Source File Cray C compiler cc filename.c Cray C++ compiler CC filename.C, filename.c, filename.C++, filename.c++, filename.Cxx, filename.cxx Cray Fortran compiler ftn filename.f90, filename.F90, filename.ftn, filename.FTN (free source) filename.f, filename.F (fixed source) For examples of compiler command usage, see Chapter 9, page 37. For more information about the compiler commands, see the CC(1) and ftn(1) man pages. To verify that you are using the correct version of a compiler, enter the -V option to the cc, CC, or ftn command. For example: % CC -V Cray C++ : Version 6.0.0.0.48 Tue Aug 21, 2007 14:36:38 The explain utility retrieves and outputs a message explanation from an online explanation catalog. Its format is: % explain msgid where msgid is the message ID string associated with a compiler error message. This string consists of the product group code and the message number. For example, the explain message for the C++ compiler message number 4 is: % explain CC-4 Not enough memory is available. The compiler is unable to allocate enough memory space needed to complete the compilation. For details, see the explain(1) man page. 22 S–2471–60Running Applications [6] There are two methods of running applications: interactively and as batch jobs. To run an application interactively: • Use the aprun command, or • Use the PBS Pro qsub -I command to initiate an interactive session; then use the aprun command to run your application interactively. The basic process for creating and running batch jobs is to create a PBS Pro job script that includes one or more aprun commands and then use the PBS Pro qsub command to run the script. 6.1 Using the aprun Application Launcher You use the aprun command to specify the resources your application requires, request application placement, and initiate application launch. The basic format of the aprun command is: aprun [-n pes] [-N pes_per_node] [-d depth] [other arguments] executable_name where: aprun option Description -n pes The number of processing elements (PEs) needed for the application. A PE is an instance of an executable. -N pes_per_node The number of PEs per node (pes_per_node can be 1, 2, 3, or 4). -d depth The number of OpenMP threads per PE (depth can be 1, 2, 3, or 4). The default is 1. Note: You need to be in a Lustre-mounted directory to use the aprun command. For information about locating Lustre-mounted directories, (see Section 4.3, page 18). For more information about aprun options, see the aprun(1) man page. S–2471–60 23Getting Started on Cray X2™ Systems 6.2 Using PBS Pro Your Cray X2 Programming Environment may include the optional PBS Pro batch scheduling software package from Altair Grid Technologies. This section describes the basic process for using PBS Pro. To submit a job to the batch scheduler, use the following command: % qsub [-l resource_type=specification] jobscript where jobscript is the name of a PBS Pro job script that includes one or more aprun commands. The aprun resource options and their qsub resource_type=specification counterparts are defined as follows: Table 3. aprun Versus qsub Options aprun option qsub -l option Description -n 64 -l mppwidth=64 Width (number of PEs) = 64 -N 3 -l mppnppn=3 Number of PEs per node = 3 -d 4 -l mppdepth=4 Depth (number of threads per PE) = 4 A job script may consist of PBS Pro directives, comments, and executable statements. A PBS Pro directive provides a way to specify job attributes apart from the aprun command-line options: #PBS -N job_name #PBS -l mppwidth=pes # ... cd /lus/nidnnnnn aprun -n pes 24 S–2471–60Running Applications [6] The qstat command displays the following information about all jobs currently running under PBS Pro: • The job identifier (Job id) assigned by PBS Pro • The job name (Name) given by the submitter • The job owner (User) • CPU time used (Time Use) • The job state (S): whether job is exiting (E), held (H), in the queue (Q), running (R), suspended (S), being moved to a new location (T), or waiting for its execution time (W) • The queue (Queue) in which the job resides The qdel command removes a PBS Pro batch job from the queue. As a user, you can remove any batch job for which you are the owner. Jobs are removed from the queue in the order they are presented to qdel. For details, see the qsub(1B), qstat(1B), and qdel(1B) man pages and the PBS Professional 8.0 User Guide. S–2471–60 25Getting Started on Cray X2™ Systems 26 S–2471–60Debugging an Application [7] This chapter gives an overview of the processes and options available when using the TotalView debugging software package from TotalView Technologies. 7.1 Using the TotalView Debugger Cray X2 systems support a special implementation of the TotalView version 7.0 debugger. The TotalView debugger provides source-level debugging of Cray Fortran, C, and C++ programs. TotalView provides the following functions: • A command-line interface (CLI) with command-line help and a graphical user interface (GUI) • Support of programs written in mixed languages • Debugging up to 1024 compute node processes Note: Cray X2 systems do not support TotalView memory debugging functions or MPI message queue displays. TotalView typically is run interactively. If your site has not designated any compute nodes for interactive processing, use the PBS Pro qsub -I interactive mode described in Chapter 6, page 23. TotalView Technologies provides general documentation for version 7.x TotalView (see http://www.totalviewtech.com/Documentation/rel7.html). For information about Cray-specific commands, differences, and limitations, see the TotalView Release Overview and Installation Guide for Cray X2 Systems. For details about aprun, see Section 6.1, page 23 and the aprun(1) man page. You can run TotalView from a graphical user interface (GUI) or a command-line interface (CLI). S–2471–60 27Getting Started on Cray X2™ Systems 7.1.1 TotalView GUI To start the TotalView graphical user interface, this command could be used: % totalview aprun -a -b [other_aprun_args] application_name where: -a A TotalView argument declaring that all of the remaining arguments apply to aprun. -b The aprun bypass-binary-transfer argument. Because TotalView requires this argument on Cray X2 systems, application_name must be on a file system common to both the login node and the compute-nodes (such as a Lustre directory or /scratch). A process window acknowledges that you have a parallel program. You then start aprun by selecting Go. The application starts, and TotalView asks you if you want to let it run or stop it so you can set breakpoints. To debug a core file using the TotalView GUI, this command could be used: % totalview -e 'dcore application_name core_file_1 [core_file_2,...]' where: -e Directs TotalView to run the dcore command immediately. When the processes are attached and listed in the main window, select one and dive on it (View > Dive) to bring up a process window. For a list of TotalView command shortcuts, see Section 7.1.3, page 29. 28 S–2471–60Debugging an Application [7] 7.1.2 TotalView CLI To start the TotalView command-line interface, this command could be used: % totalviewcli aprun d1.<> drun -b [other_aprun_args] application_name < /dev/null where: -b The aprun bypass-binary-transfer argument. Because TotalView requires -b on Cray X2 systems, the application executable must be on a file system common to both the login node and the compute-nodes (such as a Lustre directory or /scratch). < /dev/null Redirects stdin so aprun does not "steal" typed characters of TotalView commands. When aprun is loaded, you start aprun with a dgo command. Once the application has been launched and attached, use the dfocus command to focus on the various processes (process 1 will be aprun). To debug a core file using the TotalView CLI, this command could be used: % totalviewcli d1.<> dcore ./application_name core_file_1 [core_file_2,...] The dcore command attaches to each process in the specified core file(s). Use the dfocus command to switch processes. 7.1.3 Command Shortcuts Here are some command shortcuts you can enter from the command line or the GUI command-line window. The lowercase commands (g, n, and so on) are applied only to the process in focus. If you want apply commands to all processes, use G, N, and so on. S–2471–60 29Getting Started on Cray X2™ Systems Table 4. Shortcuts for TotalView Commands Shortcut Alias for Definition b dbreak Set a breakpoint. g dgo Direct the process to run. f dfocus Focus on another process. st dstatus Output the status of the process. w dwhere Output the stack trace of the process. f a cmd Focus on all processes and run TotalView command cmd. G f a g Direct all processes to run. N f a n Apply the next command to all processes. S f a s Apply the step command to all processes. ST f a st Output the status of all processes. W f a w Output the stack trace of all processes. help cmd Display help text for cmd. q Quit TotalView. 30 S–2471–60Analyzing Performance [8] This chapter describes the Cray X2 performance analysis tools: • Performance API (PAPI) • CrayPat performance analyzer • Cray Apprentice2 performance data visualization tool 8.1 Using PAPI PAPI is a standard API for accessing processor registers that count events or occurrences of specific signals related to the processor's function. By monitoring these events, you can determine the extent to which your code efficiently maps to the underlying architecture. PAPI provides two interfaces to the counter hardware: • A high-level interface for basic measurements • A fully programmable, low-level interface for users with more sophisticated needs To use PAPI, you must load the PAPI module: % module load papi For more information about PAPI, see http://icl.cs.utk.edu/papi/. 8.1.1 Using the High-level PAPI Interface The high-level interface provides the ability to start, stop, and read specific events, one at a time. You include PAPI high-level functions such as PAPI_start_counters() and PAPI_stop_counters() to collect information about evens such as total cycles (PAPI_TOT_CYC) and total instructions (PAPI_TOT_INS). For further information about the high-level interface, see the PAPI User Guide. S–2471–60 31Getting Started on Cray X2™ Systems 8.1.2 Using the Low-level PAPI Interface The low-level PAPI interface deals with hardware events in groups called event sets. An event set maps the hardware counters available on the system to a set of predefined events called presets. The event set reflects how the counters are most frequently used, such as taking simultaneous measurements of different hardware events and relating them to one another. For example, relating cycles to memory references or flops to level-1 cache misses can reveal poor locality and memory management. Event sets are fully programmable and have features such as guaranteed thread safety, writing of counter values, multiplexing, and notification on threshold crossing, as well as processor-specific features. For the list of predefined event sets, see the hwpc(3) man page. For information about constructing an event set, see the PAPI User Guide and the PAPI Programmer's Reference manual. 8.2 Using CrayPat CrayPat helps you analyze the performance of programs running on Cray X2 systems. Here is an overview of how to use it: 1. Load the craypat module: % module load craypat Note: You must load the craypat module before building even the uninstrumented version of the application. 2. Compile and link your application. 3. Use the pat_build command to create an instrumented version of the application, specifying the functions to be traced through options such as -u and -g mpi. 4. Set any relevant environment variables, such as: • setenv PAT_RT_HWPC 0, which specifies the first of the ten predefined sets of hardware counter events. • setenv PAT_RT_SUMMARY 0, which specifies a full-trace data file rather than a summary. Such a file can be very large but is needed to view behavior over time with Cray Apprentice2. • setenv PAT_BUILD_ASYNC 1, which enables you to instrument a program for a sampling experiment. 32 S–2471–60Analyzing Performance [8] • setenv PAT_RT_EXPFILE_DIR dir can be used to specify a directory into which the experiment data files will be written, instead of the current working directory. If a single data file is written, its default root name is the name of the instrumented program followed by the plus sign (+), the process ID, and one or more key letters indicating the type of the experiment (such as program1+pat+2511td). If there is a data file from each process, they are written to a subdirectory with that name. For a large number of processes, it may be necessary that PAT_RT_EXPFILE_MAX be set to 0 or the number of processes and that PAT_RT_EXPFILE_DIR be set to a directory in a Lustre file system (if the instrumented program is not invoked in such a directory). The default for a multi-PE program is to write a single data file. 5. Execute the instrumented program. 6. Use pat_report on the resulting data file to generate a report. The default report is a sample by function, but alternative views can be specified through options such as: • -O calltree • -O callers • -O load_balance The -s pe=... option overrides the way that per-PE data is shown in default tables and in tables specified using the -O option. For details, see the pat_report(1) man page. These steps are illustrated in the following examples. For more information, see the man pages and run the interactive pat_help utility. CrayPat supports two types of experiments: tracing and sampling. S–2471–60 33Getting Started on Cray X2™ Systems Tracing counts an event, such as the number of times an MPI call is executed. When tracing experiments are done, selected function entry points are traced and produce a data record in the run time experiment data file, if the function is executed. The following categories of function entry points can be traced: • System calls • I/O (formatted and buffered or system calls) • Math (see math.h) • MPI • SHMEM • OpenMP • Dynamic heap memory • BLAS • LAPACK • Pthreads Note: Only true function calls can be traced. Function calls that are inlined by the compiler cannot be traced. Sampling experiments capture values from the call stack or the program counter at specified intervals or when a specified counter overflows. (Sampling experiments are also referred to as asynchronous experiments.) Supported sampling experiments are: • samp_pc_prof, which provides the total user time and system time consumed by a program and its functions. • samp_pc_time, which samples the program counter at a given time interval. This returns the total program time and the absolute and relative times each program counter was recorded. • samp_pc_ovfl, which samples the program counter at a given overflow of a hardware performance counter. • samp_cs_time, which samples the call stack at a given time interval and returns the total program time and the absolute and relative times each call stack counter was recorded (otherwise identical to the samp_pc_time experiment). 34 S–2471–60Analyzing Performance [8] • samp_cs_ovfl, which samples the call stack at a given overflow of a hardware performance counter (otherwise identical to the samp_pc_ovfl experiment). • samp_ru_time, which samples system resources at a given time interval (otherwise identical to the samp_pc_time experiment). • samp_ru_ovfl, which samples system resources at a given overflow of a hardware performance counter (otherwise identical to the samp_pc_ovfl experiment.) • samp_heap_time, which samples dynamic heap memory management statistics at a given time interval (otherwise identical to the samp_pc_time experiment). • samp_heap_ovfl, which samples dynamic heap memory management statistics at a given overflow of a hardware performance counter (otherwise identical to the samp_pc_ovfl experiment). Note: Hardware counter information can be collected only when tracing or when sampling the program counter. Recommended practice is to use sampling to obtain a profile and then trace the functions of interest to obtain hardware counter information for them. For more information about using CrayPat, see the Using Cray Performance Analysis Tools guide and the craypat(1) man page and run the pat_help utility. 8.3 Using Cray Apprentice2 Cray Apprentice2 is a performance data visualization tool. After you have used pat_build to instrument a program for a performance analysis experiment, executed the instrumented program, and used pat_report to convert the resulting data file to a Cray Apprentice2 data format, you can use Cray Apprentice2 to explore the experiment data file and generate a variety of interactive graphical reports. To run Cray Apprentice2, load the Cray Apprentice2 module, run pat_report, then enter the app2 command to launch Cray Apprentice2: % module load apprentice2 % pat_report options % app2 [--limit tag_count | --limit_per_pe tag_count] [data_files] Use the pat_report -f ap2 option to specify the data file type. S–2471–60 35Getting Started on Cray X2™ Systems For more information about using Cray Apprentice2, see the Cray Apprentice2 online help system, the Using Cray Performance Analysis Tools guide, and the app2(1) and pat_report(1) man pages. 36 S–2471–60Example Programs [9] This chapter gives examples showing how to compile, link, and run applications. Verify that your work area is in a Lustre-mounted directory. Then use the module list command to verify that the correct modules are loaded. Each example lists the modules that must be loaded. Example 1: Basics of running a Cray X2 application This example shows how to use the Cray C compiler to compile an MPI program and aprun to launch the executable. Modules required: PrgEnv-x2 x2-mpt Source code of simple.c: #include "mpi.h" int main(int argc, char *argv[]) { int rank; int numprocs; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); printf("hello from pe %d of %d\n",rank,numprocs); MPI_Finalize(); } Compile simple.c: % cc -o simple simple.c Run program simple: % aprun -n 6 ./simple hello from pe 0 of 6 hello from pe 2 of 6 hello from pe 4 of 6 hello from pe 5 of 6 hello from pe 3 of 6 S–2471–60 37Getting Started on Cray X2™ Systems hello from pe 1 of 6 Application 145631 resources: utime 0, stime 3 Example 2: Using the Cray SHMEM put() function Modules required: PrgEnv-x2 x2-mpt Source code of C program (shmem1.c): Source code of shmem1.c: /* * simple put test */ #include #include #include /* Dimension of source and target of put operations */ #define DIM 1000000 long target[DIM]; long local[DIM]; main(int argc,char **argv) { register int i; int my_partner, my_pe; /* Prepare resources required for correct functionality of SHMEM on XT3. Alternatively, shmem_init() could be called. */ start_pes(0); for (i=0; i mount /dev/cdrom 3. Make a temporary directory for the installation files. smw> mkdir /tmp/rpms/pe 4. Copy the distribution rpm files and the script x2-installme.60 from the distribution media or download file location to the temporary directory. smw> scp -r /dev/cdrom/source_dir /tmp/rpms/pe 5. Read and customize the x2-installme.60 script as needed. For example, to install the Programming Environments 6.0 releases as default, uncomment the line export CRAY_INSTALL_DEFAULT=1 in the script. This causes all rpm files to be installed as default. If you choose not to make the new installation the default at this time, you can do so later. 6. As root, log on to the boot node. smw> ssh root@boot 7. Create a target directory and copy the rpm files and the installme.60 script from your temporary directory to the shared root. boot001:~# mkdir -p /rr/current/software/rpms boot001:~# scp -r smw:/tmp/rmps/pe /rr/current/software/rpms 8. Open an xtopview(8) session using the default view. boot001:~# xtopview 58 S–5212–60Installing the Programming Environment on Cray X2 Systems [8] 9. Change to the location of the rpm files. default/:/# cd /software/rpms/pe 10. Execute the x2-installme.60 script to install the new rpm files. default/:/software/rpms # ./installme.60 After the installation script has finished installing the rpm files, no further post-processing is required. Exit from the xtopview session, log out of the boot node, and then log out of the SMW. For further information about managing your Cray X2 Programming Environment, including changing the default versions of programming environment components, see the Cray X2 Installation, Configuration, and Management Supplement. 8.3 Installing on Standalone Linux Systems Under separate license, the Cray X2 Programming Environment can be installed on 64-bit x86-based Linux systems. The advantage in doing this is that the Linux system can be used as a development and compilation system when the Cray X2 system is unavailable. Code compiled and linked on the Linux system can then be moved to the Cray X2 system for execution. Cray X2 binaries cannot be debugged on a standalone Linux system. They must be debugged on the Cray X2 system. S–5212–60 59Cray® Programming Environment 6.0 Releases Overview and Installation Guide 8.3.1 System Requirements Before installing the Cray X2 Programming Environment on a Linux system, verify that the following requirements have been met. • You have root permissions on the Linux system. • The Linux system must use at least one 64-bit, x86-based processor (AMD Opteron, Intel Pentium 4, or equivalent). Note: The Cray X2 Programming Environment requires a 64-bit processor. You cannot produce usable binaries on a 32-bit processor. • The Linux system must have at least 1 GB of RAM. • The Linux system must have at least 70 GB of total disk space, and at least 3GB of free disk space. More is preferable. • The Linux system must run SUSE Linux 9 or later, or the equivalent. • Modules 3.1.6 or later must be installed. • The /opt file system must exist and be mounted in the root of the filesystem. • The /tmp file system must have sufficient space to hold the temporary files created during installation. • Root must have write permissions into /opt. 8.3.2 Installation Procedure Follow this procedure to install the Cray X2 Programming Environment on a standalone Linux system. Procedure 10: Installing the PE on Linux systems 1. As root, create a temporary directory for the installation files. $ mkdir /tmp/rpms/pe 2. If necessary, load and mount the installation media. $ mount /dev/cdrom 60 S–5212–60Installing the Programming Environment on Cray X2 Systems [8] 3. Copy the installation rpm files and the x2-installme.60 installation script from the download location or installation media to your temporary directory. $ scp -r /dev/cdrom/source_dir /tmp/rpms/pe 4. Change to your temporary directory. $ cd /tmp/rpms/pe 5. Read and customize the x2-installme.60 script as needed. For example, to install the programming environments 6.0 releases as default, uncomment the line export CRAY_INSTALL_DEFAULT=1 in the script. This causes all rpm files to be installed as default. If you choose not to make the new installation the default at this time, you can do so later. 6. Execute the x2-installme.60 script to install the new rpm files. $ ./installme.60 After the installation script has finished installing the rpm files, no further post-processing should be required. 8.3.3 Installing Modules The Cray X2 Programming Environment for Linux distribution does not include a copy of Modules. If you do not already have the Modules software installed on your Linux system, this section describes how to download and install Modules 3.1.6. Note: Modules 3.1.6 requires that the development packages for TCL are installed. If you are using a Red Hat Enterprise Linux distribution, use up2date tcl-devel. If you are using another Linux distribution, refer to their documentation about downloading development packages for TCL. To build Modules 3.1.6 for your system: 1. Download modules-3.1.6.tar.gz from this URL: http://prdownloads.sourceforge.net/modules/modules-3.1.6.tar.gz 2. To build Modules, untar the distribution in a blank directory: linux% tar zxfv modules-3.1.6.tar.gz S–5212–60 61Cray® Programming Environment 6.0 Releases Overview and Installation Guide 3. Go to the modules-3.1.6 directory (cd modules-3.1.6) and edit the default RKOConfigure file with these changes: #!/bin/sh CFLAGS=-g LDFLAGS=-g CC=gcc export CFLAGS LDFLAGS CC #TCLTKROOT=/usr/local/tcltk/8.0.5 TCLTKROOT=/usr export TCLTKROOT ./configure \ --prefix=/opt/modules/@VERSION@ \ --without-x \ --with-module-path=/opt/modules/modulefiles \ --with-version-path=/opt/modules/versions \ --with-etc-path=/etc \ --with-skel-path=/etc/skel \ --with-split-size=960 \ --x-includes=/usr/include/X11 \ --x-libraries=/usr/X11R6/lib \ --with-tcl-include=$TCLTKROOT/include \ --with-tcl-libraries=$TCLTKROOT/lib exit 4. Execute RKOConfigure, then execute make: linux% ./RKOConfigure linux% make 5. Add this line to the init/.modulespath file in the Modules source: /opt/ctl/modulefiles 6. Execute make install: linux% make install This procedure will install Modules at /opt/modules and define a module use path. 62 S–5212–60Index A Access to files not included with Cray Programming Environment, 53 to remps command, 55 Accessing Cray documentation, 16 B biolib installing, 33 Books, 17 accessing, 16 C C and C++ compiler, 8 C and C++ features, 8 CAL, 23 CIT documentation, 17 Compatibilities and differences, 11 Compiler C, 8 C++, 8 Fortran, 6 Configuration file RKOConfigure, 61 Contact information Customer Support Center, 27 Software Distribution Center, 17 Training, 28 CPES coping libraries and files, 48 returning to secure state, 47 cpu type, 5, 7–8 Cray Assembly Language installing, 33 Cray Assembly Language version, 23 Cray Bioinformatics library installing, 33 Cray Bioinformatics library version, 23 Cray Service Bulletin, 28–29 Cray Streaming Directives, 12 Cray websites, 29 CRInform, 27 publications, 16 support, 27 training, 28 Cray X2 support, 5 Cray X2 systems installing the PE, 57 CRAY_PE_TARGET, 11 CrayLibs version, 22 CrayTools version, 22 CRInform, 27 Cross-compiler for Cray X1 series systems, 14 for Cray X2 systems, 14, 59 installation procedure, 60 Linux based, 33 Solaris based, 33 system requirements, 60 CRSB, 29 Customer services, 27 Customer Support Center, 27 Customs, 25 D Differences, 11 Disk space requirements temporary directory, 37 Distribution Center, 17, 25 Documentation, 15, 17 accessing, 16 dynamic common block, 11 S–5212–60 63Cray® Programming Environment 6.0 Releases Overview and Installation Guide E Enhancements, 5 environment variables, 11 Errata, 17 Error message explanations, 17 Export license, 25 F Features C and C++, 8 Fortran, 6 Field notices (FNs), 28 Finishing the installation, 47 Fortran 2003 features, 6 Fortran 2008 features, 6 Fortran compiler, 6 Fortran features, 6 H Hard copy, 17 Hardware and software requirements, 21 HTML, 17 I Installation choosing installation tool, 39 finishing, 47 with opt_install, 41 Installation and set-up Trigger software, 43 Installation requirements Cray X1 series systems, 35 /opt/ctl, 36 product directories, 37 Installing Cray Assembly Language, 33 Cray Programming Environment (Linux based), 33 Cray Programming Environment (Solaris based), 33 Cray X1 series Programming Environment (Solaris based), 21 Modules, 33 Motif, 33 MPT, 33 Triggers, 33 X Window System X11, 33 L Letter of assurance, 25 LibSci version, 22 Licensing, 25 Linux cross-compiler, 14 installing Cray X2 PE, 59 installing Modules, 61 Linux systems, 23 installing, 33 M Man pages accessing, 16 Modules, 23 environment, 51 installing, 33 updating, 61 version, 23 Motif, 23 version, 23 MPT, 23 installing, 33 licensing and ordering, 25 version, 23 multistreaming, 11, 13 O /opt/ctl root permissions, 52 Ordering documentation, 17 software, 25 P PDF, 17 64 S–5212–60Index PGO, 6 pragmas, 14 Pricing, 25 Problems, 27 Profile Guided Optimization, 7–8 profile-guided optimization, 6 Publications, 15, 17 accessing, 16 R Release package, 21–22 remps command, 55 Request for Technical Assistance (RTA), 27–28 Requirements, hardware and software, 21 S Shipping, 26 Site-specific options, 50 Software enhancements, 5 Software license agreement, 23 Software Problem Report (SPR), 27–28 Solaris systems, 23 installing, 21, 33 Subscriber, CRInform, 27 Sun systems, 21 Support agreement, 24–25, 27 Support Center, 27 T Technical support, 27 Temporary directory defining another, 41 space requirements, 37 TotalView, 33 Training, 28 Triggers, 23 installing, 33 version, 23 U Upgrades, 24 X X Windows System, 23 X Windows System X11 libraries installing, 33 X1_DYNAMIC_COMON_BLOCK, 11 X11 version, 23 X11 libraries, 23 S–5212–60 65 Cray® Fortran Reference Manual S–3901–60© 1995, 1997-2007 Cray Inc. All Rights Reserved. This manual or parts thereof may not be reproduced in any form unless permitted by contract or by written permission of Cray Inc. The CF90 compiler includes United States software patents 5,257,696, 5,257,372, and 5,361,354. U.S. GOVERNMENT RESTRICTED RIGHTS NOTICE The Computer Software is delivered as "Commercial Computer Software" as defined in DFARS 48 CFR 252.227-7014. All Computer Software and Computer Software Documentation acquired by or for the U.S. Government is provided with Restricted Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7014, as applicable. Technical Data acquired by or for the U.S. Government, if any, is provided with Limited Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7013, as applicable. Cray, LibSci, UNICOS and UNICOS/mk are federally registered trademarks and Active Manager, Cray Apprentice2, Cray C++ Compiling System, Cray Fortran Compiler, Cray SeaStar, Cray SeaStar2, Cray SHMEM, Cray Threadstorm, Cray X1, Cray X1E, Cray X2, Cray XD1, Cray XMT, Cray XT, Cray XT3, Cray XT4, CrayDoc, CRInform, Libsci, RapidArray, UNICOS/lc, and UNICOS/mp are trademarks of Cray Inc. AMD and AMD Opteron and Opteron are trademarks of Advanced Micro Devices, Inc. IRIX is a trademark of Silicon Graphics, Inc. MIPSpro is a trademark of MIPS Technologies, Inc. SPARC is a trademark of SPARC International, Inc. Proper use is allowed under licensing agreement. Products bearing SPARC trademarks are based on an architecture developed by Sun Microsystems, Inc. UNIX, the “X device,” X Window System, and X/Open are trademarks of The Open Group in the United States and other countries. All other trademarks are the property of their respective owners. The UNICOS, UNICOS/mk, and UNICOS/mp operating systems are derived from UNIX System V. These operating systems are also based in part on the Fourth Berkeley Software Distribution (BSD) under license from The Regents of the University of California.New Features Cray® Fortran Reference Manual S–3901–60 This document is a consolidation of the Cray Fortran Reference Manual, Fortran Language Reference Manual, Volume 1, Fortran Language Reference Manual, Volume 2, Fortran Language Reference Manual, Volume 3, and Fortran Application Programmer's I/O Reference Manual. It documents the Cray Fortran compiler command options and directives and describes how the Cray Fortran compiler differs from the Fortran 2003 standard The organization of the contents of Chapter 10, page 179 Cray Fortran Language Extensions, parallels the organization of the contents of the official manual of the Fortran 2003 Standard, ISO/IEC 1539-1:2004. The Cray Fortran compiler includes the following new features: • Abstract type. • Support for the Cray X2 series system. • Finalization for non-polymorphic objects. See Section 12.2, page 257. The Cray Fortran compiler supports the following proposed Fortran 2008 features: • Submodules • Separate module procedures. • CONTAINS followed by an END statement with no internal or module procedure. See Section 10.10.2.2, page 204. The following new directive has been documented in this release: • !PGO$ loop_info, a special form of the !DIR$ loop_info directive. See Section 3.19.28, page 60.Record of Revision Version Description 5.6 March 2007 Supports the Cray Fortran compiler 5.6 release running on Cray X1 series systems. 6.0 September 2007 Supports the Cray Fortran compiler 6.0 release running on Cray X1 series and Cray X2 systems. S–3901–60 iContents Page Preface xxiii Accessing Product Documentation . . . . . . . . . . . . . . . . . . . xxiii Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . xxiv Reader Comments . . . . . . . . . . . . . . . . . . . . . . . . xxv Cray User Group . . . . . . . . . . . . . . . . . . . . . . . . xxv Introduction [1] 1 X1-specific and X2-specific Content in this Document . . . . . . . . . . . . . 2 The Cray Fortran Programming Environment . . . . . . . . . . . . . . . . 2 Cross-compiler Platforms . . . . . . . . . . . . . . . . . . . . . 5 Cray Fortran Compiler Messages . . . . . . . . . . . . . . . . . . . . 5 Document-specific Conventions . . . . . . . . . . . . . . . . . . . . 6 Fortran Standard Compatibility . . . . . . . . . . . . . . . . . . . . 6 Fortran 95 Compatibility . . . . . . . . . . . . . . . . . . . . . 7 Fortran 90 Compatibility . . . . . . . . . . . . . . . . . . . . . 7 FORTRAN 77 Compatibility . . . . . . . . . . . . . . . . . . . . 7 Related Cray Publications . . . . . . . . . . . . . . . . . . . . . . 7 Related Fortran Publications . . . . . . . . . . . . . . . . . . . . . 8 Part I: Cray Fortran Commands and Directives The Trigger Environment (X1 Only) [2] 11 Preparing the Trigger Environment . . . . . . . . . . . . . . . . . . . 13 Working in the Programming Environment . . . . . . . . . . . . . . . . . 14 Invoking the Cray Fortran Compiler [3] 15 -A module_name [, module_name] ... . . . . . . . . . . . . . . . . . 16 -b bin_obj_file . . . . . . . . . . . . . . . . . . . . . . . 16 S–3901–60 iiiCray® Fortran Reference Manual Page -c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 -C cifopts . . . . . . . . . . . . . . . . . . . . . . . . . 17 -d disable and -e enable . . . . . . . . . . . . . . . . . . . . 18 -D identifier [=value] . . . . . . . . . . . . . . . . . . . . . 26 -f source_form . . . . . . . . . . . . . . . . . . . . . . . . 26 -F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 -g . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 -G debug_lvl . . . . . . . . . . . . . . . . . . . . . . . . . 27 -h arg . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 -h command . . . . . . . . . . . . . . . . . . . . . . . . . 28 -h cpu=target_system . . . . . . . . . . . . . . . . . . . . . 28 -h gen_private_callee (X1 only) . . . . . . . . . . . . . . . . . 29 -h ieee_nonstop . . . . . . . . . . . . . . . . . . . . . . . 29 -h keepfiles . . . . . . . . . . . . . . . . . . . . . . . . 29 -h mpmd, -h nompmd . . . . . . . . . . . . . . . . . . . . . . 30 -h msp (X1 only) . . . . . . . . . . . . . . . . . . . . . . . 31 -h ssp (X1 only) . . . . . . . . . . . . . . . . . . . . . . . 31 -I incldir . . . . . . . . . . . . . . . . . . . . . . . . . 31 -J dir_name . . . . . . . . . . . . . . . . . . . . . . . . . 32 -l libname . . . . . . . . . . . . . . . . . . . . . . . . . 32 -L ldir . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 -m msg_lvl . . . . . . . . . . . . . . . . . . . . . . . . . 33 -M msgs . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 -N col . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 -O opt [,opt] ... . . . . . . . . . . . . . . . . . . . . . . . . 35 -O n . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 -O aggress, -O noaggress . . . . . . . . . . . . . . . . . . . 38 -O cachen . . . . . . . . . . . . . . . . . . . . . . . . . 38 -O command . . . . . . . . . . . . . . . . . . . . . . . . . 39 -O fpn . . . . . . . . . . . . . . . . . . . . . . . . . . 40 iv S–3901–60Contents Page -O fusionn . . . . . . . . . . . . . . . . . . . . . . . . . 43 -Ogcpn . . . . . . . . . . . . . . . . . . . . . . . . . . 43 -O gen_private_callee (X1 only) . . . . . . . . . . . . . . . . . 44 -O infinitevl, -O noinfinitevl . . . . . . . . . . . . . . . . . 44 -O ipan and -O ipafrom=source[:source] ... . . . . . . . . . . . . . 44 Automatic Inlining . . . . . . . . . . . . . . . . . . . . . . 47 Explicit Inlining . . . . . . . . . . . . . . . . . . . . . . . 48 Combined Inlining . . . . . . . . . . . . . . . . . . . . . . 49 -O inlinelib . . . . . . . . . . . . . . . . . . . . . . . . 49 -O modinline, -O nomodinline . . . . . . . . . . . . . . . . . . 49 -O msgs, -O nomsgs . . . . . . . . . . . . . . . . . . . . . . 50 -O msp (X1 only) . . . . . . . . . . . . . . . . . . . . . . . 50 -O negmsgs, -O nonegmsgs . . . . . . . . . . . . . . . . . . . 51 -O nointerchange . . . . . . . . . . . . . . . . . . . . . . 51 -O overindex, -O nooverindex . . . . . . . . . . . . . . . . . . 51 -O pattern, -O nopattern . . . . . . . . . . . . . . . . . . . 52 -O scalarn . . . . . . . . . . . . . . . . . . . . . . . . . 53 -O shortcircuitn . . . . . . . . . . . . . . . . . . . . . . 54 -O ssp (X1 only) . . . . . . . . . . . . . . . . . . . . . . . 55 -O streamn (X1 only) . . . . . . . . . . . . . . . . . . . . . . 56 -O task0, -O task1 . . . . . . . . . . . . . . . . . . . . . . 57 -O unrolln . . . . . . . . . . . . . . . . . . . . . . . . . 58 -O vectorn . . . . . . . . . . . . . . . . . . . . . . . . . 59 -O zeroinc, -O nozeroinc . . . . . . . . . . . . . . . . . . . 59 -O -h profile_generate . . . . . . . . . . . . . . . . . . . . 60 -O -h profile_data=pgo_opt . . . . . . . . . . . . . . . . . . 60 -o out_file . . . . . . . . . . . . . . . . . . . . . . . . . 60 -p module_site . . . . . . . . . . . . . . . . . . . . . . . . 60 -Q path . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 -r list_opt . . . . . . . . . . . . . . . . . . . . . . . . . 64 S–3901–60 vCray® Fortran Reference Manual Page -R runchk . . . . . . . . . . . . . . . . . . . . . . . . . . 68 -s size . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Different Default Data Size Options on the Command Line . . . . . . . . . . . 73 Pointer Scaling Factor . . . . . . . . . . . . . . . . . . . . . . 74 -S asm_file . . . . . . . . . . . . . . . . . . . . . . . . . 75 -T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 -U identifier [,identifier] ... . . . . . . . . . . . . . . . . . . 76 -v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 -V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 -Wa"assembler_opt" . . . . . . . . . . . . . . . . . . . . . . 76 -Wl"loader_opt" . . . . . . . . . . . . . . . . . . . . . . . 77 -Wr"lister_opt" . . . . . . . . . . . . . . . . . . . . . . . 77 -x dirlist . . . . . . . . . . . . . . . . . . . . . . . . . . 77 -X npes . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 -Yphase,dirname . . . . . . . . . . . . . . . . . . . . . . . 79 -Z . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 -- . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 sourcefile[sourcefile.suffix ...] . . . . . . . . . . . . . . . . 80 Environment Variables [4] 81 Compiler and Library Environment Variables . . . . . . . . . . . . . . . . 81 CRAY_FTN_OPTIONS Environment Variable . . . . . . . . . . . . . . . 82 CRAY_PE_TARGET Environment Variable . . . . . . . . . . . . . . . . 82 FORMAT_TYPE_CHECKING Environment Variable . . . . . . . . . . . . . . 82 FORTRAN_MODULE_PATH Environment Variable . . . . . . . . . . . . . . 83 LISTIO_PRECISION Environment Variable . . . . . . . . . . . . . . . 83 NLSPATH Environment Variable . . . . . . . . . . . . . . . . . . . 84 NPROC Environment Variable . . . . . . . . . . . . . . . . . . . . 84 TMPDIR Environment Variable . . . . . . . . . . . . . . . . . . . 84 ZERO_WIDTH_PRECISION Environment Variable . . . . . . . . . . . . . . 85 OpenMP Environment Variable . . . . . . . . . . . . . . . . . . . . 85 vi S–3901–60Contents Page Run Time Environment Variables . . . . . . . . . . . . . . . . . . . . 86 Cray Fortran Directives [5] 87 Using Directives . . . . . . . . . . . . . . . . . . . . . . . . . 90 Directive Lines . . . . . . . . . . . . . . . . . . . . . . . . 91 Range and Placement of Directives . . . . . . . . . . . . . . . . . . 92 Interaction of Directives with the -x Command Line Option . . . . . . . . . . . 94 Command Line Options and Directives . . . . . . . . . . . . . . . . . 94 Vectorization Directives . . . . . . . . . . . . . . . . . . . . . . 96 Use Cache-exclusive Instructions for Vector Loads: CACHE_EXCLUSIVE . . . . . . . 97 Use Cache-shared Instructions for Vector Loads: CACHE_SHARED . . . . . . . . . 97 Avoid Placing Object into Cache: NO_CACHE_ALLOC . . . . . . . . . . . . . 98 Copy Arrays to Temporary Storage: COPY_ASSUMED_SHAPE . . . . . . . . . . 98 Limit Optimizations: HAND_TUNED . . . . . . . . . . . . . . . . . . 100 Ignore Vector Dependencies: IVDEP . . . . . . . . . . . . . . . . . . 100 Specify Scalar Processing: NEXTSCALAR . . . . . . . . . . . . . . . . . 101 Request Pattern Matching: PATTERN and NOPATTERN . . . . . . . . . . . . 102 Declare an Array with No Repeated Values: PERMUTATION . . . . . . . . . . . 102 Designate Loop Nest for Vectorization: PREFERVECTOR . . . . . . . . . . . . 103 Conditional Density: PROBABILITY . . . . . . . . . . . . . . . . . . 104 Allow Speculative Execution of Memory References Within Loops: SAFE_ADDRESS . . . 105 Allow Speculative Execution of Memory References and Arithmetic Operations: SAFE_CONDITIONAL . . . . . . . . . . . . . . . . . . . . . . 106 Designate Loops with Low Trip Counts: SHORTLOOP, SHORTLOOP128 . . . . . . . 107 Provide More Information for Loops: LOOP_INFO . . . . . . . . . . . . . . 108 Unroll Loops: UNROLL and NOUNROLL . . . . . . . . . . . . . . . . . 112 Example 1: Unrolling outer loops . . . . . . . . . . . . . . . . . 113 Example 2: Illegal unrolling of outer loops . . . . . . . . . . . . . . . 114 Example 3: Unrolling nearest neighbor pattern . . . . . . . . . . . . . 114 Enable and Disable Vectorization: VECTOR and NOVECTOR . . . . . . . . . . . 115 Enable or Disable, Temporarily, Soft Vector-pipelining: PIPELINE and NOPIPELINE . . . 115 S–3901–60 viiCray® Fortran Reference Manual Page Specify a Vectorizable Function: VFUNCTION . . . . . . . . . . . . . . . 116 Multistreaming Processor (MSP) Directives (X1 only) . . . . . . . . . . . . . . 117 Specify Loop to be Optimized for MSP: PREFERSTREAM . . . . . . . . . . . . 118 Optimize Loops Containing Procedural Calls: SSP_PRIVATE . . . . . . . . . . 118 Enable MSP Optimization: STREAM and NOSTREAM . . . . . . . . . . . . . 120 Inlining Directives . . . . . . . . . . . . . . . . . . . . . . . . 121 Disable or Enable Cloning for a Block of Code: CLONE and NOCLONE . . . . . . . . 121 Disable or Enable Inlining for a Block of Code: INLINE, NOINLINE, and RESETINLINE . . 122 Specify Inlining for a Procedure: INLINEALWAYS and INLINENEVER . . . . . . . . 122 Create Inlinable Templates for Module Procedures: MODINLINE and NOMODINLINE . . . 123 Scalar Optimization Directives . . . . . . . . . . . . . . . . . . . . 125 Control Loop Interchange: INTERCHANGE and NOINTERCHANGE . . . . . . . . . 125 Control Loop Collapse: COLLAPSE and NOCOLLAPSE . . . . . . . . . . . . . 126 Determine Register Storage: NOSIDEEFFECTS . . . . . . . . . . . . . . . 128 Suppress Scalar Optimization: SUPPRESS . . . . . . . . . . . . . . . . 129 Local Use of Compiler Features . . . . . . . . . . . . . . . . . . . . 130 Check Array Bounds: BOUNDS and NOBOUNDS . . . . . . . . . . . . . . . 130 Specify Source Form: FREE and FIXED . . . . . . . . . . . . . . . . . 132 Storage Directives . . . . . . . . . . . . . . . . . . . . . . . . 132 Permit Cache Blocking: BLOCKABLE Directive . . . . . . . . . . . . . . . 133 Declare Cache Blocking: BLOCKINGSIZE and NOBLOCKING Directives . . . . . . . 133 Request Stack Storage: STACK . . . . . . . . . . . . . . . . . . . . 135 Miscellaneous Directives . . . . . . . . . . . . . . . . . . . . . . 135 Specify Array Dependencies: CONCURRENT . . . . . . . . . . . . . . . . 136 Fuse Loops: FUSION and NOFUSION . . . . . . . . . . . . . . . . . . 137 Create Identification String: ID . . . . . . . . . . . . . . . . . . . 137 Disregard Dummy Argument Type, Kind, and Rank: IGNORE_TKR . . . . . . . . 139 External Name Mapping: NAME . . . . . . . . . . . . . . . . . . . 140 Preprocess Include File: PREPROCESS . . . . . . . . . . . . . . . . . 141 Specify Weak Procedure Reference: WEAK . . . . . . . . . . . . . . . . 141 viii S–3901–60Contents Page Cray Streaming Directives (CSDs) (X1 only) [6] 143 CSD Parallel Regions . . . . . . . . . . . . . . . . . . . . . . . 144 Start and End Multistreaming: PARALLEL and END PARALLEL . . . . . . . . . . 144 Do Loops: DO and END DO . . . . . . . . . . . . . . . . . . . . . 146 Parallel Do Loops: PARALLEL DO and END PARALLEL DO . . . . . . . . . . . . 149 Synchronize SSPs: SYNC . . . . . . . . . . . . . . . . . . . . . . 150 Specify Critical Regions: CRITICAL and END CRITICAL . . . . . . . . . . . . 150 Define Order of SSP Execution: ORDERED and END ORDERED . . . . . . . . . . . 151 Suppress CSDs: [NO]CSD . . . . . . . . . . . . . . . . . . . . . . 152 Nested CSDs within Cray Parallel Programming Models . . . . . . . . . . . . 153 CSD Placement . . . . . . . . . . . . . . . . . . . . . . . . . 153 Protection of Shared Data . . . . . . . . . . . . . . . . . . . . . . 154 Dynamic Memory Allocation for CSD Parallel Regions . . . . . . . . . . . . . 155 Compiler Options Affecting CSDs . . . . . . . . . . . . . . . . . . . 155 Source Preprocessing [7] 157 General Rules . . . . . . . . . . . . . . . . . . . . . . . . . 157 Directives . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 #include Directive . . . . . . . . . . . . . . . . . . . . . . . 158 #define Directive . . . . . . . . . . . . . . . . . . . . . . . 159 #undef Directive . . . . . . . . . . . . . . . . . . . . . . . 161 # (Null) Directive . . . . . . . . . . . . . . . . . . . . . . . 161 Conditional Directives . . . . . . . . . . . . . . . . . . . . . . 161 #if Directive . . . . . . . . . . . . . . . . . . . . . . . . 162 #ifdef Directive . . . . . . . . . . . . . . . . . . . . . . . 163 #ifndef Directive . . . . . . . . . . . . . . . . . . . . . . 163 #elif Directive . . . . . . . . . . . . . . . . . . . . . . . 163 #else Directive . . . . . . . . . . . . . . . . . . . . . . . 163 #endif Directive . . . . . . . . . . . . . . . . . . . . . . . 164 Predefined Macros . . . . . . . . . . . . . . . . . . . . . . . . 164 Command Line Options . . . . . . . . . . . . . . . . . . . . . . 166 S–3901–60 ixCray® Fortran Reference Manual Page OpenMP Fortran API [8] 167 Cray Implementation Differences . . . . . . . . . . . . . . . . . . . . 167 OMP_THREAD_STACK_SIZE Environment Variable . . . . . . . . . . . . . . 169 OpenMP Optimizations . . . . . . . . . . . . . . . . . . . . . . 170 Compiler Options that Affect OpenMP . . . . . . . . . . . . . . . . . . 172 OpenMP Program Execution . . . . . . . . . . . . . . . . . . . . . 172 Cray Fortran Defined Externals [9] 173 Conformance Checks . . . . . . . . . . . . . . . . . . . . . . . 173 Part II: Cray Fortran and Fortran 2003 Differences Cray Fortran Language Extensions [10] 179 Characters, Lexical Tokens, and Source Form . . . . . . . . . . . . . . . . 179 Low-level Syntax . . . . . . . . . . . . . . . . . . . . . . . . 179 Characters Allowed in Names . . . . . . . . . . . . . . . . . . . 179 Switching Source Forms . . . . . . . . . . . . . . . . . . . . . 180 Continuation Line Limit . . . . . . . . . . . . . . . . . . . . . 180 D Lines in Fixed Source Form . . . . . . . . . . . . . . . . . . . 180 Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 The Concept of Type . . . . . . . . . . . . . . . . . . . . . . . 180 Alternate Form of LOGICAL Constants . . . . . . . . . . . . . . . . 181 Cray Pointer Type . . . . . . . . . . . . . . . . . . . . . . 181 Cray Character Pointer Type . . . . . . . . . . . . . . . . . . . 186 Boolean Type . . . . . . . . . . . . . . . . . . . . . . . . 187 Alternate Form of ENUM Statement . . . . . . . . . . . . . . . . . . 187 TYPEALIAS Statement . . . . . . . . . . . . . . . . . . . . . 187 Data Object Declarations and Specifications . . . . . . . . . . . . . . . . 188 Attribute Specification Statements . . . . . . . . . . . . . . . . . . 188 BOZ Constants in DATA Statements . . . . . . . . . . . . . . . . . 188 Attribute Respecification . . . . . . . . . . . . . . . . . . . . 189 x S–3901–60Contents Page AUTOMATIC Attribute and Statement . . . . . . . . . . . . . . . . . 189 IMPLICIT Statement . . . . . . . . . . . . . . . . . . . . . . . 191 IMPLICIT Extensions . . . . . . . . . . . . . . . . . . . . . 191 Storage Association of Data Objects . . . . . . . . . . . . . . . . . . 191 EQUIVALENCE Statement Extensions . . . . . . . . . . . . . . . . . 191 COMMON Statement Extensions . . . . . . . . . . . . . . . . . . . 191 Expressions and Assignment . . . . . . . . . . . . . . . . . . . . . 191 Expressions . . . . . . . . . . . . . . . . . . . . . . . . . 191 Rules for Forming Expressions . . . . . . . . . . . . . . . . . . . 192 Intrinsic and Defined Operations . . . . . . . . . . . . . . . . . . 192 Intrinsic Operations . . . . . . . . . . . . . . . . . . . . . . 193 Bitwise Logical Expressions . . . . . . . . . . . . . . . . . . . . 194 Assignment . . . . . . . . . . . . . . . . . . . . . . . . . 196 Assignment . . . . . . . . . . . . . . . . . . . . . . . . 196 Execution Control . . . . . . . . . . . . . . . . . . . . . . . . 196 STOP Code Extension . . . . . . . . . . . . . . . . . . . . . . 196 Input/Output Statements . . . . . . . . . . . . . . . . . . . . . . 197 File Connection . . . . . . . . . . . . . . . . . . . . . . . . 197 OPEN Statement . . . . . . . . . . . . . . . . . . . . . . . 197 Error, End-of-record, and End-of-file Conditions . . . . . . . . . . . . . . . 198 End-of-file Condition and the END-specifier . . . . . . . . . . . . . . . 198 Multiple End-of-file Records . . . . . . . . . . . . . . . . . . . 198 Input/Output Editing . . . . . . . . . . . . . . . . . . . . . . . 198 Data Edit Descriptors . . . . . . . . . . . . . . . . . . . . . . 198 Integer Editing . . . . . . . . . . . . . . . . . . . . . . . 198 Real Editing . . . . . . . . . . . . . . . . . . . . . . . . 198 Logical Editing . . . . . . . . . . . . . . . . . . . . . . . 199 Character Editing . . . . . . . . . . . . . . . . . . . . . . . 199 Control Edit Descriptors . . . . . . . . . . . . . . . . . . . . . 199 Q Editing . . . . . . . . . . . . . . . . . . . . . . . . . 199 S–3901–60 xiCray® Fortran Reference Manual Page List-directed Formatting . . . . . . . . . . . . . . . . . . . . . 200 List-directed Input . . . . . . . . . . . . . . . . . . . . . . 200 Namelist Formatting . . . . . . . . . . . . . . . . . . . . . . 201 Namelist Extensions . . . . . . . . . . . . . . . . . . . . . . 201 I/O Editing . . . . . . . . . . . . . . . . . . . . . . . . . 201 Program Units . . . . . . . . . . . . . . . . . . . . . . . . . 204 Main Program . . . . . . . . . . . . . . . . . . . . . . . . 204 Program Statement Extension . . . . . . . . . . . . . . . . . . . 204 Block Data Program Units . . . . . . . . . . . . . . . . . . . . . 204 Block Data Program Unit Extension . . . . . . . . . . . . . . . . . 204 Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . 204 Procedure Interface . . . . . . . . . . . . . . . . . . . . . . . 204 Interface Duplication . . . . . . . . . . . . . . . . . . . . . . 204 Procedure Definition . . . . . . . . . . . . . . . . . . . . . . 204 Recursive Function Extension . . . . . . . . . . . . . . . . . . . 204 Empty CONTAINS Sections . . . . . . . . . . . . . . . . . . . . 204 Intrinsic Procedures and Modules . . . . . . . . . . . . . . . . . . . 205 Standard Generic Intrinsic Procedures . . . . . . . . . . . . . . . . . 205 Intrinsic Procedures . . . . . . . . . . . . . . . . . . . . . . 205 Exceptions and IEEE Arithmetic . . . . . . . . . . . . . . . . . . . . 208 The Exceptions . . . . . . . . . . . . . . . . . . . . . . . . 208 IEEE Intrinsic Module Extensions . . . . . . . . . . . . . . . . . . 208 Interoperability With C . . . . . . . . . . . . . . . . . . . . . . . 210 Interoperability Between Fortran and C Entities . . . . . . . . . . . . . . 210 BIND(C) Syntax . . . . . . . . . . . . . . . . . . . . . . . 210 Co-arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 Execution Model and Images . . . . . . . . . . . . . . . . . . . . 212 Specifying Co-arrays . . . . . . . . . . . . . . . . . . . . . . 212 Referencing Co-arrays . . . . . . . . . . . . . . . . . . . . . . 214 Initializing Co-arrays . . . . . . . . . . . . . . . . . . . . . . 216 xii S–3901–60Contents Page Using Co-arrays with Procedure Calls . . . . . . . . . . . . . . . . . 216 Specifying Co-arrays in COMMON and EQUIVALENCE Statements . . . . . . . . . 217 Allocatable Co-arrays . . . . . . . . . . . . . . . . . . . . . . 218 Pointer Components in Derived Type Co-arrays . . . . . . . . . . . . . . 218 Allocatable Components in Derived Type Co-arrays . . . . . . . . . . . . . 219 Intrinsic Procedures . . . . . . . . . . . . . . . . . . . . . . . 219 Program Synchronization . . . . . . . . . . . . . . . . . . . . . 220 SYNC_ALL . . . . . . . . . . . . . . . . . . . . . . . . . 220 SYNC_TEAM . . . . . . . . . . . . . . . . . . . . . . . . 221 SYNC_MEMORY . . . . . . . . . . . . . . . . . . . . . . . 222 START_CRITICAL and END_CRITICAL . . . . . . . . . . . . . . . . 222 Example 4: Using START CRITICAL and END CRITICAL . . . . . . . . . 223 SYNC_FILE . . . . . . . . . . . . . . . . . . . . . . . . 224 I/O with Co-arrays . . . . . . . . . . . . . . . . . . . . . . . 224 Compiling and Executing Programs Containing Co-arrays . . . . . . . . . . . . 225 ftn and aprun Options Affecting Co-arrays . . . . . . . . . . . . . . . 225 Using the CrayTools Tool Set with Co-array Programs . . . . . . . . . . . . 226 Debugging Programs Containing Co-arrays (Deferred implementation) . . . . . . 226 Analyzing Co-array Program Performance . . . . . . . . . . . . . . . 226 Interoperating with Other Message Passing and Data Passing Models . . . . . . . . 226 Optimizing Programs with Co-arrays . . . . . . . . . . . . . . . . . 227 Obsolete Features [11] 229 IMPLICIT UNDEFINED . . . . . . . . . . . . . . . . . . . . . . 230 Type statement with *n . . . . . . . . . . . . . . . . . . . . . . 230 BYTE Data Type . . . . . . . . . . . . . . . . . . . . . . . . . 230 DOUBLE COMPLEX Statement . . . . . . . . . . . . . . . . . . . . . 231 STATIC Attribute and Statement . . . . . . . . . . . . . . . . . . . . 231 Slash Data Initialization . . . . . . . . . . . . . . . . . . . . . . 233 DATA Statement Features . . . . . . . . . . . . . . . . . . . . . . 234 Hollerith Data . . . . . . . . . . . . . . . . . . . . . . . . . 234 S–3901–60 xiiiCray® Fortran Reference Manual Page Hollerith Constants . . . . . . . . . . . . . . . . . . . . . . . 235 Hollerith Values . . . . . . . . . . . . . . . . . . . . . . . . 236 Hollerith Relational Expressions . . . . . . . . . . . . . . . . . . . 237 PAUSE Statement . . . . . . . . . . . . . . . . . . . . . . . . 237 ASSIGN, Assigned GO TO Statements, and Assigned Format Specifiers . . . . . . . . 238 Form of the ASSIGN and Assigned GO TO Statements . . . . . . . . . . . . . 238 Assigned Format Specifiers . . . . . . . . . . . . . . . . . . . . . 240 Two-branch IF Statements . . . . . . . . . . . . . . . . . . . . . . 240 Two-branch Arithmetic IF . . . . . . . . . . . . . . . . . . . . . 240 Indirect Logical IF . . . . . . . . . . . . . . . . . . . . . . . 241 Real and Double Precision DO Variables . . . . . . . . . . . . . . . . . . 241 Nested Loop Termination . . . . . . . . . . . . . . . . . . . . . . 241 Branching into a Block . . . . . . . . . . . . . . . . . . . . . . . 241 ENCODE and DECODE Statements . . . . . . . . . . . . . . . . . . . . 242 ENCODE Statement . . . . . . . . . . . . . . . . . . . . . . . 242 DECODE Statement . . . . . . . . . . . . . . . . . . . . . . . 243 BUFFER IN and BUFFER OUT Statements . . . . . . . . . . . . . . . . . 244 Asterisk Delimiters . . . . . . . . . . . . . . . . . . . . . . . . 247 Negative-valued X Descriptor . . . . . . . . . . . . . . . . . . . . . 248 A and R Descriptors for Noncharacter Types . . . . . . . . . . . . . . . . 248 H Edit Descriptor . . . . . . . . . . . . . . . . . . . . . . . . 249 Obsolete Intrinsic Procedures . . . . . . . . . . . . . . . . . . . . . 250 Cray Fortran Deferred Implementation and Optional Features [12] 257 ISO_10646 Character Set . . . . . . . . . . . . . . . . . . . . . . 257 Finalizers . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 Restrictions on Unlimited Polymorphic Variables . . . . . . . . . . . . . . . 257 Enhanced Expressions in Initializations and Specifications . . . . . . . . . . . . 257 User-defined, Derived Type I/O . . . . . . . . . . . . . . . . . . . . 258 ENCODING= in I/O Statements . . . . . . . . . . . . . . . . . . . . 258 Allocatable Assignment (Optionally Enabled) . . . . . . . . . . . . . . . . 258 xiv S–3901–60Contents Page Cray Fortran Implementation Specifics [13] 259 Companion Processor . . . . . . . . . . . . . . . . . . . . . . . 259 INCLUDE Line . . . . . . . . . . . . . . . . . . . . . . . . . 259 INTEGER Kinds and Values . . . . . . . . . . . . . . . . . . . . . 259 REAL Kinds and Values . . . . . . . . . . . . . . . . . . . . . . 260 DOUBLE PRECISION Kinds and Values . . . . . . . . . . . . . . . . . . 260 LOGICAL Kinds and Values . . . . . . . . . . . . . . . . . . . . . 260 CHARACTER Kinds and Values . . . . . . . . . . . . . . . . . . . . 260 Cray Pointers . . . . . . . . . . . . . . . . . . . . . . . . . 260 ENUM Kind . . . . . . . . . . . . . . . . . . . . . . . . . . 261 Storage Issues . . . . . . . . . . . . . . . . . . . . . . . . . 261 Storage Units and Sequences . . . . . . . . . . . . . . . . . . . . 261 Static and Stack Storage . . . . . . . . . . . . . . . . . . . . . . 262 Dynamic Memory Allocation . . . . . . . . . . . . . . . . . . . . 263 Finalization . . . . . . . . . . . . . . . . . . . . . . . . . . 263 ALLOCATE Error Status . . . . . . . . . . . . . . . . . . . . . . . 264 DEALLOCATE Error Status . . . . . . . . . . . . . . . . . . . . . . 264 ALLOCATABLE Module Variable Status . . . . . . . . . . . . . . . . . . 264 Kind of a Logical Expression . . . . . . . . . . . . . . . . . . . . . 264 STOP Code Availability . . . . . . . . . . . . . . . . . . . . . . . 264 Stream File Record Structure and Position . . . . . . . . . . . . . . . . . 264 File Unit Numbers . . . . . . . . . . . . . . . . . . . . . . . . 265 OPEN Specifiers . . . . . . . . . . . . . . . . . . . . . . . . . 265 FLUSH Statement . . . . . . . . . . . . . . . . . . . . . . . . 266 Asynchronous I/O . . . . . . . . . . . . . . . . . . . . . . . . 266 REAL I/O of an IEEE NaN . . . . . . . . . . . . . . . . . . . . . 266 Input of an IEEE NaN . . . . . . . . . . . . . . . . . . . . . . 266 Output of an IEEE NaN . . . . . . . . . . . . . . . . . . . . . . 267 List-directed and NAMELIST Output Default Formats . . . . . . . . . . . . . 267 Random Number Generator . . . . . . . . . . . . . . . . . . . . . 268 S–3901–60 xvCray® Fortran Reference Manual Page Timing Intrinsics . . . . . . . . . . . . . . . . . . . . . . . . 268 IEEE Intrinsic Modules . . . . . . . . . . . . . . . . . . . . . . . 268 Part III: Cray Fortran Application Programmer's I/O Reference Using the Assign Environment [14] 271 assign Basics . . . . . . . . . . . . . . . . . . . . . . . . . 272 Assign Objects and Open Processing . . . . . . . . . . . . . . . . . . 272 The assign Command . . . . . . . . . . . . . . . . . . . . . . 273 Assign Library Routines . . . . . . . . . . . . . . . . . . . . . 276 assign and Fortran I/O . . . . . . . . . . . . . . . . . . . . . . 277 Alternative File Names . . . . . . . . . . . . . . . . . . . . . . 278 File Structure Selection . . . . . . . . . . . . . . . . . . . . . . 279 Unblocked File Structure . . . . . . . . . . . . . . . . . . . . 281 assign -s sbin File Processing (not recommended) . . . . . . . . . . . 282 assign -s bin File Processing . . . . . . . . . . . . . . . . . . 283 assign -s u File Processing . . . . . . . . . . . . . . . . . . . 283 text File Structure . . . . . . . . . . . . . . . . . . . . . . 283 cos or blocked File Structure . . . . . . . . . . . . . . . . . . . 284 Buffer Specifications . . . . . . . . . . . . . . . . . . . . . . . 286 Default Buffer Sizes . . . . . . . . . . . . . . . . . . . . . . 287 Library Buffering . . . . . . . . . . . . . . . . . . . . . . . 288 System Cache . . . . . . . . . . . . . . . . . . . . . . . . 289 Unbuffered I/O . . . . . . . . . . . . . . . . . . . . . . . 289 Foreign File Format Specification . . . . . . . . . . . . . . . . . . . 290 Memory Resident Files . . . . . . . . . . . . . . . . . . . . . . 290 Fortran File Truncation . . . . . . . . . . . . . . . . . . . . . . 290 The Assign Environment File . . . . . . . . . . . . . . . . . . . . . 292 Local Assign Mode . . . . . . . . . . . . . . . . . . . . . . . . 292 Example 5: Local assign mode . . . . . . . . . . . . . . . . . . . 292 xvi S–3901–60Contents Page Using FFIO [15] 295 Introduction to FFIO . . . . . . . . . . . . . . . . . . . . . . . 295 Using Layered I/O . . . . . . . . . . . . . . . . . . . . . . . . 298 I/O Layers . . . . . . . . . . . . . . . . . . . . . . . . . 299 Layered I/O Options . . . . . . . . . . . . . . . . . . . . . . 300 FFIO and Common Formats . . . . . . . . . . . . . . . . . . . . . 301 Reading and Writing Text Files . . . . . . . . . . . . . . . . . . . 301 Reading and Writing Unblocked Files . . . . . . . . . . . . . . . . . 302 Reading and Writing Fixed-length Records . . . . . . . . . . . . . . . . 303 Reading and Writing Blocked Files . . . . . . . . . . . . . . . . . . 303 Enhancing Performance . . . . . . . . . . . . . . . . . . . . . . 303 Buffer Size Considerations . . . . . . . . . . . . . . . . . . . . . 303 Removing Blocking . . . . . . . . . . . . . . . . . . . . . . . 304 The syscall Layer . . . . . . . . . . . . . . . . . . . . . . 304 The bufa and cachea Layers . . . . . . . . . . . . . . . . . . . 304 The mr Layer . . . . . . . . . . . . . . . . . . . . . . . . 305 The global Layer (Deferred Implementation) . . . . . . . . . . . . . . 305 The cache Layer . . . . . . . . . . . . . . . . . . . . . . . 305 Sample Programs . . . . . . . . . . . . . . . . . . . . . . . . 307 Example 6: Unformatted direct mr with unblocked file . . . . . . . . . . . . 307 Example 7: Unformatted sequential mr with blocked file . . . . . . . . . . . 308 FFIO Layer Reference [16] 311 Characteristics of Layers . . . . . . . . . . . . . . . . . . . . . . 312 The bufa Layer . . . . . . . . . . . . . . . . . . . . . . . . . 313 The cache Layer . . . . . . . . . . . . . . . . . . . . . . . . 315 The cachea Layer . . . . . . . . . . . . . . . . . . . . . . . . 316 The cos Blocked Layer . . . . . . . . . . . . . . . . . . . . . . . 318 The event Layer . . . . . . . . . . . . . . . . . . . . . . . . 319 The f77 Layer . . . . . . . . . . . . . . . . . . . . . . . . . 321 The fd Layer . . . . . . . . . . . . . . . . . . . . . . . . . . 323 S–3901–60 xviiCray® Fortran Reference Manual Page The global Layer (Deferred Implementation) . . . . . . . . . . . . . . . . 323 The ibm Layer . . . . . . . . . . . . . . . . . . . . . . . . . 324 The mr Layer . . . . . . . . . . . . . . . . . . . . . . . . . . 327 The null Layer . . . . . . . . . . . . . . . . . . . . . . . . . 330 The syscall Layer . . . . . . . . . . . . . . . . . . . . . . . . 331 The system Layer . . . . . . . . . . . . . . . . . . . . . . . . 332 The text Layer . . . . . . . . . . . . . . . . . . . . . . . . . 332 The user and site Layers . . . . . . . . . . . . . . . . . . . . . 334 The vms Layer . . . . . . . . . . . . . . . . . . . . . . . . . 334 Creating a user Layer [17] 337 Internal Functions . . . . . . . . . . . . . . . . . . . . . . . . 337 The Operations Structure . . . . . . . . . . . . . . . . . . . . . 338 FFIO and the stat Structure . . . . . . . . . . . . . . . . . . . . 340 user Layer Example . . . . . . . . . . . . . . . . . . . . . . . 341 Numeric File Conversion Routines [18] 363 Conversion Overview . . . . . . . . . . . . . . . . . . . . . . . 363 Transferring Data . . . . . . . . . . . . . . . . . . . . . . . . 364 Using fdcp to Transfer Files . . . . . . . . . . . . . . . . . . . . 364 Using ftp to Move Data between Systems . . . . . . . . . . . . . . . . 364 Data Item Conversion . . . . . . . . . . . . . . . . . . . . . . . 364 Explicit Data Item Conversion . . . . . . . . . . . . . . . . . . . . 365 Implicit Data Item Conversion . . . . . . . . . . . . . . . . . . . . 365 Choosing a Conversion Method . . . . . . . . . . . . . . . . . . . 369 Explicit Conversion . . . . . . . . . . . . . . . . . . . . . . 369 Implicit Conversion . . . . . . . . . . . . . . . . . . . . . . 369 Disabling Conversion Types . . . . . . . . . . . . . . . . . . . . 369 Foreign Conversion Techniques . . . . . . . . . . . . . . . . . . . . 370 UNICOS/mp and UNICOS/lc Conversions . . . . . . . . . . . . . . . . 370 IBM Overview . . . . . . . . . . . . . . . . . . . . . . . . 371 xviii S–3901–60Contents Page IEEE Conversion . . . . . . . . . . . . . . . . . . . . . . . . 372 VAX/VMS Conversion . . . . . . . . . . . . . . . . . . . . . . 374 Named Pipe Support [19] 377 Piped I/O Example without End-of-file Detection . . . . . . . . . . . . . . . 378 Example 8: No EOF Detection: program writerd . . . . . . . . . . . . . 379 Example 9: No EOF Detection: program readwt . . . . . . . . . . . . . 379 Detecting End-of-file on a Named Pipe . . . . . . . . . . . . . . . . . . 380 Piped I/O Example with End-of-file Detection . . . . . . . . . . . . . . . . 380 Example 10: EOF Detection: program writerd . . . . . . . . . . . . . . 381 Example 11: EOF Detection: program readwt . . . . . . . . . . . . . . 381 Glossary 383 Index 399 Figures Figure 1. ftn Command Example . . . . . . . . . . . . . . . . . . . 4 Figure 2. Optimization Values . . . . . . . . . . . . . . . . . . . . 36 Figure 3. Memory Use . . . . . . . . . . . . . . . . . . . . . . 263 Figure 4. Access Methods and Default Buffer Sizes . . . . . . . . . . . . . . 291 Figure 5. Typical Data Flow . . . . . . . . . . . . . . . . . . . . . 295 Tables Table 1. Compiling Options . . . . . . . . . . . . . . . . . . . . . 18 Table 2. Floating-point Optimization Levels . . . . . . . . . . . . . . . . 42 Table 3. Automatic Inlining Specifications . . . . . . . . . . . . . . . . 47 Table 4. File Types . . . . . . . . . . . . . . . . . . . . . . . 48 Table 5. Scaling Factor in Pointer Arithmetic . . . . . . . . . . . . . . . . 75 Table 6. -Yphase Definitions . . . . . . . . . . . . . . . . . . . . 79 Table 7. Directives . . . . . . . . . . . . . . . . . . . . . . . 87 Table 8. Explanation of Ignored TKRs . . . . . . . . . . . . . . . . . . 140 S–3901–60 xixCray® Fortran Reference Manual Page Table 9. Compiler-calculated Chunk Size . . . . . . . . . . . . . . . . . 147 Table 10. Operand Types and Results for Intrinsic Operations . . . . . . . . . . 193 Table 11. Cray Fortran Intrinsic Bitwise Operators and the Allowed Types of their Operands . 194 Table 12. Data Types in Bitwise Logical Operations . . . . . . . . . . . . . . 195 Table 13. Values for Keyword Specifier Variables in an OPEN Statement . . . . . . . . 197 Table 14. Default Fractional and Exponent Digits . . . . . . . . . . . . . . 199 Table 15. Summary of Control Edit Descriptors . . . . . . . . . . . . . . . 202 Table 16. Summary of Data Edit Descriptors . . . . . . . . . . . . . . . . 202 Table 17. Default Compatibility Between I/O List Data Types and Data Edit Descriptors . . 202 Table 18. RELAXED Compatibility Between Data Types and Data Edit Descriptors . . . . 203 Table 19. STRICT77 Compatibility Between Data Types and Data Edit Descriptors . . . . 203 Table 20. STRICT90 and STRICT95 Compatibility Between Data Types and Data Edit Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . 203 Table 21. Cray Fortran IEEE Intrinsic Module Extensions . . . . . . . . . . . . 209 Table 22. Obsolete Features and Preferred Alternatives . . . . . . . . . . . . 229 Table 23. Summary of String Edit Descriptors . . . . . . . . . . . . . . . 250 Table 24. Obsolete Procedures and Alternatives . . . . . . . . . . . . . . . 250 Table 25. Fortran access methods and options . . . . . . . . . . . . . . . 281 Table 26. Default Buffer Sizes for Fortran I/O Library Routines . . . . . . . . . . 288 Table 27. FFIO Layers . . . . . . . . . . . . . . . . . . . . . . 299 Table 28. Data Manipulation: bufa Layer . . . . . . . . . . . . . . . . 314 Table 29. Supported Operations: bufa Layer . . . . . . . . . . . . . . . 314 Table 30. Data Manipulation: cache Layer . . . . . . . . . . . . . . . . 315 Table 31. Supported Operations: cache Layer . . . . . . . . . . . . . . . 316 Table 32. Data Manipulation: cachea Layer . . . . . . . . . . . . . . . . 317 Table 33. Supported Operations: cachea Layer . . . . . . . . . . . . . . . 317 Table 34. Data Manipulation: cos Layer . . . . . . . . . . . . . . . . . 318 Table 35. Supported Operations: cos Layer . . . . . . . . . . . . . . . . 319 Table 36. Data Manipulation: f77 Layer . . . . . . . . . . . . . . . . . 322 Table 37. Supported Operations: f77 Layer . . . . . . . . . . . . . . . . 322 Table 38. Data Manipulation: global Layer . . . . . . . . . . . . . . . . 324 xx S–3901–60Contents Page Table 39. Supported Operations: global Layer . . . . . . . . . . . . . . . 324 Table 40. Values for Maximum Record Size on ibm Layer . . . . . . . . . . . . 326 Table 41. Values for Maximum Block Size in ibm Layer . . . . . . . . . . . . 326 Table 42. Data Manipulation: ibm Layer . . . . . . . . . . . . . . . . . 326 Table 43. Supported Operations: ibm Layer . . . . . . . . . . . . . . . . 327 Table 44. Data Manipulation: mr Layer . . . . . . . . . . . . . . . . . 330 Table 45. Supported Operations: mr Layer . . . . . . . . . . . . . . . . 330 Table 46. Data Manipulation: syscall Layer . . . . . . . . . . . . . . . 331 Table 47. Supported Operations: syscall Layer . . . . . . . . . . . . . . 332 Table 48. Data Manipulation: text Layer . . . . . . . . . . . . . . . . 333 Table 49. Supported Operations: text Layer . . . . . . . . . . . . . . . 333 Table 50. Values for Record Size: vms Layer . . . . . . . . . . . . . . . . 335 Table 51. Values for Maximum Block Size: vms Layer . . . . . . . . . . . . . 335 Table 52. Data Manipulation: vms Layer . . . . . . . . . . . . . . . . . 336 Table 53. Supported Operations: vms Layer . . . . . . . . . . . . . . . . 336 Table 54. C Program Entry Points . . . . . . . . . . . . . . . . . . . 339 Table 55. Explicit Data Conversion Routines . . . . . . . . . . . . . . . . 365 Table 56. Implicit Data Conversion Types . . . . . . . . . . . . . . . . . 367 S–3901–60 xxiPreface The information in this preface is common to Cray documentation provided with this software release. Accessing Product Documentation With each software release, Cray provides books and man pages, and in some cases, third-party documentation. These documents are provided in the following ways: CrayDoc The Cray documentation delivery system that allows you to quickly access and search Cray books, man pages, and in some cases, third-party documentation. Access this HTML and PDF documentation via CrayDoc at the following locations: • The local network location defined by your system administrator • The CrayDoc public website: docs.cray.com Man pages Access man pages by entering the man command followed by the name of the man page. For more information about man pages, see the man(1) man page by entering: % man man Third-party documentation Access third-party documentation not provided through CrayDoc according to the information provided with the product. S–3901–60 xxiiiCray® Fortran Reference Manual Conventions These conventions are used throughout Cray documentation: Convention Meaning command This fixed-space font denotes literal items, such as file names, pathnames, man page names, command names, and programming language elements. variable Italic typeface indicates an element that you will replace with a specific value. For instance, you may replace filename with the name datafile in your program. It also denotes a word or concept being defined. user input This bold, fixed-space font denotes literal items that the user enters in interactive sessions. Output is shown in nonbold, fixed-space font. [ ] Brackets enclose optional portions of a syntax representation for a command, library routine, system call, and so on. ... Ellipses indicate that a preceding element can be repeated. name(N) Denotes man pages that provide system and programming reference information. Each man page is referred to by its name followed by a section number in parentheses. Enter: % man man to see the meaning of each section number for your particular system. xxiv S–3901–60Preface Reader Comments Contact us with any comments that will help us to improve the accuracy and usability of this document. Be sure to include the title and number of the document with your comments. We value your comments and will respond to them promptly. Contact us in any of the following ways: E-mail: docs@cray.com Telephone (inside U.S., Canada): 1–800–950–2729 (Cray Customer Support Center) Telephone (outside U.S., Canada): +1–715–726–4993 (Cray Customer Support Center) Mail: Customer Documentation Cray Inc. 1340 Mendota Heights Road Mendota Heights, MN 55120–1128 USA Cray User Group The Cray User Group (CUG) is an independent, volunteer-organized international corporation of member organizations that own or use Cray Inc. computer systems. CUG facilitates information exchange among users of Cray systems through technical papers, platform-specific e-mail lists, workshops, and conferences. CUG memberships are by site and include a significant percentage of Cray computer installations worldwide. For more information, contact your Cray site analyst or visit the CUG website at www.cug.org. S–3901–60 xxvIntroduction [1] This manual describes the differences between the language specified by the Fortran 2003 Standard and the language implemented by the Cray Fortran compiler for the Programming Environment 6.0 Release. The Cray Fortran compiler version 6.0 targets both the Cray X1 series systems and the Cray X2 systems using the UNICOS/mp (3.1 release or later) and UNICOS/lc operating systems. The Cray Fortran compiler was developed to support the Fortran 2003 standard adopted by the International Organization for Standardization (ISO). This standard, commonly referred to in this manual as the Fortran standard, is ISO/IEC 1539-1:2004. Note: The standards organizations continue to interpret the Fortran standard for Cray and other vendors. To maintain conformance to the Fortran standard, Cray may need to change the behavior of certain Cray Fortran compiler features in future releases based on the outcomes of interpretations to the standard. Because the Fortran 2003 standard is a superset of previous standards, the Cray Fortran compiler compiles code written in accordance with previous Fortran standards. Note: The ftn(1) man page may get updated more often than this document. Where the information differs, the information in the man page supersedes the information contained in this manual. S–3901–60 1Cray® Fortran Reference Manual 1.1 X1-specific and X2-specific Content in this Document Unless explicitly indicated by the notations defined below, the contents of this manual apply to both the Cray X1 and the Cray X2 systems. Convention Meaning (X1 only) This notation indicates that the feature applies only to the Cray X1 series system. Depending on context, the notation occurs either before the text (for example, the second paragraph in section 4.2) or after the text (for example, the chapter title for Chapter 2, The Trigger Environment). (X2 only) This notation indicates that the feature applies only to the Cray X2 system. Depending on context, the notation occurs either before the text (for example, the fourth paragraph in section 4.2) or after the text (for example, the third bullet item in section 3.19.3). 1.2 The Cray Fortran Programming Environment The Cray Fortran Programming Environment consists of the tools and libraries that you use to develop Fortran applications. To effectively use these tools and libraries, you must have an understanding of the development environment as discussed in the two documents: Cray X1 Series System Overview and Cray X2 System Overview. 2 S–3901–60Introduction [1] The Cray Fortran Programming Environment provides the following tools and libraries: • The ftn command, which invokes the Cray Fortran compiler. For more information about ftn, see Chapter 3, page 15 or the ftn(1) man page. • The CrayLibs libraries, which provides library routines, intrinsic procedures, I/O routines, and data conversion routines. • The LibSci libraries, which provide scientific library routines. • The ftnlx command, which generates listings and checks for possible errors in Fortran programs. See the ftnlx(1) man page for more information. • The ld command, which invokes the Cray loader. See the ld(1) man page for more information. Note: Cray recommends that you use the ftn compiler command to invoke the loader, because the compiler calls the loader with the appropriate default libraries. The appropriate default libraries may change from release to release. • The CrayPat performance analyzer tool, which can help you analyze program performance. See the pat(1) man page for more information. • The Cray Apprentice2 report visualization tool, which can help you further analyze performance data captured by CrayPat. See the app2(1) man page for more information. • The Etnus TotalView debugger, which can help you debug your program. It includes standard debugging capabilities, such as stepping through code and setting breakpoints. The -g and -G debug options to the ftn command line generate symbol tables, which can be used by the debugger. For more debugger information, see the totalview and totalviewcli man pages. In the most basic case, the Cray Fortran compiler products are used as follows. The ftn command invokes the Cray Fortran compiler, processes the input files named on the command line, and generates a binary file. The compiler then invokes the loader, which loads the binary file(s) and generates an executable output file (the default output file is a.out). The ftnlx command generates a program listing file, if requested. S–3901–60 3Cray® Fortran Reference Manual In the following simple example, the ftn command invokes the Fortran compiler. Option -r s is specified to generate a source listing. File pgm.f is your source code input file. You run the program by entering the output file name as a command; in this example, the default output file name, a.out, is used. Figure 1 illustrates this example. % ftn -r s pgm.f % ./a.out Command Cray Fortran Compiler pgm.f Source code pgm.tmp pgm.lst Listing pgm.o a.out Executable program Output data (stdout) ftn -r s generates a standard listing Loader (ld) Lister (ftnlx) Input Data (stdin) Figure 1. ftn Command Example 4 S–3901–60Introduction [1] By default, the Cray Fortran compiler creates files during processing. It attaches various extensions to the base file name and places them into your working directory: • The compiled code is sent to object file file.o in the current directory. • The executable file is a.out by default. You can use the -o option to specify the name of the executable file. • If specified, assembly language output is sent to file.s. Source file names ending with .s are assembled, and the assembled code is written to the corresponding file.o. You can use the options on the ftn command line to modify the default actions; for example, you can change the size of the default data types. For more information about ftn command line options, see Chapter 3, page 15. 1.2.1 Cross-compiler Platforms The Cray X1 Series Programming Environment and the Cray X2 Programming Environment also run on cross-compiler platforms. You can use a cross-compiler platform to compile programs and create binaries for subsequent execution on a Cray X1 series system or a Cray X2 system. If your site has the proper licensing in place, you might choose to use one of these other platforms. In the case of the Cray X1 series system, it will afford faster compile time and give you access to the Cray Programming Environment when the X1 system is not available. Supported platforms are listed in the Cray Programming Environment Releases Overview and Installation Guide. 1.3 Cray Fortran Compiler Messages You can obtain Cray Fortran compiler message explanations by using the explain command. For more information, see the explain(1) man page. S–3901–60 5Cray® Fortran Reference Manual 1.4 Document-specific Conventions The following conventions are specific to this document: Convention Meaning Rnnn The Rnnn notation indicates that the feature is in the Fortran standard and can be located in the standard via the Rnnn syntax rule number. Cray pointer The term Cray pointer refers to the Cray pointer data type extension. 1.5 Fortran Standard Compatibility In the Fortran standard, the term processor means the combination of a Fortran compiler and the computing system that executes the code. A processor conforms to the standard if it compiles and executes programs that conform to the standard, provided that the Fortran program is not too large or complex for the computer system in question. You can direct the compiler to flag and generate messages when nonstandard usage of Fortran is encountered. For more information about this command line option (ftn -en), see Section 3.5, page 18 or the ftn(1) man page. When the option is in effect, the compiler prints messages for extensions to the standard that are used in the program. As required by the standard, the compiler also flags the following items and provides the reason that the item is being flagged: • Obsolescent features • Deleted features • Kind type parameters not supported • Violations of any syntax rules and the accompanying constraints • Characters not permitted by the processor • Illegal source form • Violations of the scope rules for names, labels, operators, and assignment symbols 6 S–3901–60Introduction [1] The Cray Fortran compiler includes extensions to the Fortran standard. Because the compiler processes programs according to the standard, it is considered to be a standard-conforming processor. When the option to note deviations from the Fortran standard is in effect (-en), extensions to the standard are flagged with ANSI messages when detected at compile time. 1.5.1 Fortran 95 Compatibility No known issues. 1.5.2 Fortran 90 Compatibility No known issues. 1.5.3 FORTRAN 77 Compatibility The format of a floating-point zero written with a G edit descriptor is different in Fortran 95. The floating-point zero was written with an Ew.d edit descriptor in FORTRAN 77, but is written with an Fw.d edit descriptor in the Cray Fortran compiler. FORTRAN 77 output cannot be changed. Therefore, different compare files must be retained for FORTRAN 77 and Fortran 95 programs that use the G edit descriptor for floating-point output. 1.6 Related Cray Publications The following documentation can aid in the development of your Fortran programs: • ftn(1) man page • ftnlx(1) man page • Cray X1 Series System Overview • Cray X2 Series System Overview • Optimizing Applications on Cray X1 Series Systems • Loader man page, ld(1) S–3901–60 7Cray® Fortran Reference Manual 1.7 Related Fortran Publications For more information about the Fortran language and its history, consult the following commercially available reference books. • Fortran 2003 Standard can be downloaded from http://www.nag.co.uk/sc22wg5/ or http://j3-fortran.org/ • Chapman, S. Fortran 95/2003 for Scientists & Engineers. McGraw Hill, 2007. ISBN 0073191574. • Metcalf, M., J. Reid, and M. Cohen. Fortran 95/2003 Explained. Oxford University Press, 2004. ISBN 0-19-852693-8. 8 S–3901–60Part I: Cray Fortran Commands and Directives Part I describes the various elements that make up the Cray Fortran programming language. It includes the following chapters: • The Trigger Environment (Chapter 2, page 11) • Invoking the Cray Fortran Compiler (Chapter 3, page 15) • Environment Variables (Chapter 4, page 81) • Cray Fortran Directives (Chapter 5, page 87) • Cray Streaming Directives (Chapter 6, page 143) • Source Preprocessing (Chapter 7, page 157) • OpenMP Fortran API (Chapter 8, page 167) • Cray Fortran Defined Externals (Chapter 9, page 173)The Trigger Environment (X1 Only) [2] Users of Cray X1 series systems interact with the system as if all elements of the Programming Environment are hosted on the Cray X1 series mainframe, including Programming Environment commands hosted on the Cray Programming Environment Server (CPES). CPES-hosted commands have corresponding commands on the Cray X1 series mainframe that have the same names. These commands are called triggers. Triggers (such as the ftn command) are required only for the Programming Environment. In the event that a programming or debugging tool does not work as expected, understanding the trigger environment aids administrators and end users in identifying the part of the system in which the problem has occurred. When a user enters the name of a CPES-hosted command on the command line of the Cray X1 series mainframe, the corresponding trigger executes, which sets up an environment for the CPES-hosted command. This environment duplicates the portion of the current working environment on the Cray X1 series mainframe that relates to the Programming Environment. This enables the CPES-hosted commands to function properly. To replicate the current working environment, the trigger captures the current working environment on the Cray X1 series system and copies the standard I/O and error as follows: • Copies the standard input of the current working environment to the standard input of the CPES-hosted command. • Copies the standard output of the CPES-hosted command to standard output of the current working environment. • Copies the standard error of the CPES-hosted command to the standard error of the current working environment. All catchable interrupts, quit, and terminate signals propagate through the trigger to reach the CPES-hosted command. Upon termination of the CPES-hosted command, the trigger terminates and returns with the CPES-hosted command's return code. Uncatchable signals have a short processing delay before the signal is passed to the CPES-hosted command. If you execute its trigger again before the CPES-hosted command has had time to process the signal, an undefined behavior may occur. S–3901–60 11Cray® Fortran Reference Manual Because the trigger has the same name, inputs, and outputs as the CPES-hosted command, user scripts, makefiles, and batch files can function without modification. That is, running a command in the trigger environment is very similar to running the command hosted on the Cray X1 series system. The commands that have triggers include: • app2 • ar • as • c++filt • c89 • c99 • cc • ccp • CC • ftn • ftnlx • ftnsplit • ld • nm • pat_build • pat_help • pat_report • pat_run • remps Note: Because of Trigger environment and X11 forwarding issues, the Cray Apprentice2 data visualization tool does not work in high-security environments where the CPES is not accessible through the customer network. This limitation is expected to be removed in a future Cray Programming Environments update package. 12 S–3901–60The Trigger Environment (X1 Only) [2] 2.1 Preparing the Trigger Environment To prepare the trigger environment for use, you must initialize your shell, load the Modules application, and then use the module command to load the Programming Environment module. To do so, follow these steps: 1. After you log in to a Cray X1 series system, begin your work session by initializing your shell. Cray provides initialization files for most common shells; by default, these are stored in /opt/modules/modules/init. For example, to initialize a C shell, enter this command: % source /opt/modules/modules/init/csh 2. The Modules application enables you to dynamically modify your user environment by using modulefiles. Each module file contains all the information needed to configure the shell for an application. While it is possible to use Cray X1 series systems without using the Modules application, doing so introduces unnecessary complexity and increases the opportunity for operator error. Initialize the Modules application by using this command: % module use /opt/PE/modulefiles 3. After the Modules application is initialized, use the module command to load the complete and current Programming Environment module: % module load PrgEnv The Programming Environment module contains your compilers, libraries, development tools, man pages, and various other component modules, and sets up the environment variables necessary to find the include files, libraries, and product paths on the CPES and the Cray X1 series system. As you become more familiar with the Programming Environment, you can choose to add or subtract individual modules, but as a rule, the easiest way to avoid many common problems is to start by loading the complete PrgEnv module. Note: Cray man pages are packaged in the modules with the software they document. The man pages do not become available until after you have loaded the appropriate module. To see the list of products loaded by the PrgEnv module, enter the following on the command line: module list S–3901–60 13Cray® Fortran Reference Manual If you have questions about setting up the Programming Environment, contact your system support staff. 2.2 Working in the Programming Environment To use the Programming Environment, you must work on a file system that is cross-mounted to the CPES. If you attempt to use the Programming Environment from a directory that is not cross-mounted to the CPES, you will receive this message: trigrcv: trigger command cannot access current directory. [directory] is not properly cross-mounted on host [CPES] The default files used by the Programming Environment are installed in the /opt/ctl file system. The default include file directory is /opt/ctl/include. All Programming Environment products are found in the/opt/ctl file system. 14 S–3901–60Invoking the Cray Fortran Compiler [3] This chapter describes the ftn command, which invokes the Cray Fortran compiler. The ftn(1) man page contains information from this chapter in an abbreviated form. Note: If the information contained in this manual differs from the ftn(1) man page, the information in the man page overrides the information in this manual. The following files are produced by or accepted by the Cray Fortran compiler: File Type a.out Default name of the executable output file. See the -o out_file option for information about specifying a different name for the executable file. file.a Library files to be searched for external references or modules. file.cg and file.opt Files containing decompilation of the intermediate representation of the compiler. These listings resemble the format of the source code. These files are generated when the -rd option is specified. file.f or file.F Input Fortran source file in fixed source form. If file ends in .F, the source preprocessor is invoked. file.f90, file.F90, file.ftn, file.FTN Input Fortran source file in free source form. If file ends in .F90 or .FTN, the source preprocessor is invoked. file.i File containing output from the source preprocessor. file.lst Listing file. file.o Relocatable object file. file.s Assembly language file. file.L File containing binary code and generated assembly language output. S–3901–60 15Cray® Fortran Reference Manual file.T CIF output file. modulename.mod If the -em option is specified, the compiler writes a modulename.mod file for each module; modulename is created by taking the name of the module and, if necessary, converting it to uppercase. This file contains module information, including any contained module procedures. The syntax of the ftn command is as follows: ftn [-A module_name[, module_name ] ...] [-b bin_obj_file] [-c] [-C cifopts] [-d disable] [-D identifier[= value]] [-e enable] [-f source_form] [-F] [-g] [-G debug_lvl ] [-h arg], [-I incldir] [-J dir_name] [-l lib_file] [-L ldir] [-m msg_lvl] [-M msgs] [-N col] [-o out_file] [-O opt[,opt] . . .] [-p module_site] [-Q path] [-r list_opt] [-R runchk] [-s size] [-S asm_file] [ -T] [-U identifier[, identifier] ...] [-v] [-V] [-Wphase,"opt..."] [-x dirlist] [-X npes] [-Yphase,dirname] [-Z] [--] sourcefile [sourcefile ...] Note: Some default values shown for ftn command options may have been changed by your site. See your system support staff for details. 3.1 -A module_name [, module_name] ... The -A module_name [, module_name] ... option directs the compiler to behave as if you entered a USE module_name statement for each module_name into your Fortran source code. The USE statements are entered in every program unit and interface body in the source file being compiled. 3.2 -b bin_obj_file The -b bin_obj_file option disables the load step and saves the binary object file version of your program in bin_obj_file. 16 S–3901–60Invoking the Cray Fortran Compiler [3] Only one input file is allowed when the -b bin_obj_file option is specified. If you have more than one input file, use the -c option to disable the load step and save the binary files to their default file names. If only one input file is processed and neither the -b nor the -c option is specified, the binary version of your program is not saved after the load is completed. If both the -b bin_obj_file and -c options are specified on the ftn command line, the load step is disabled and the binary object file is written to the name specified as the argument to the -b bin_obj_file option. For more information about the -c option, see Section 3.3, page 17. By default, the binary file is saved in file.o, where file is the name of the source file and .o is the suffix used. 3.3 -c The -c option disables the load step and saves the binary object file version of your program in file.o, where file is the name of the source file and .o is the suffix used. If there is more than one input file, a file.o is created for each input file specified. By default, this option is off. If only one input file is processed and neither the -b bin_obj_file nor the -c options are specified, the binary version of your program is not saved after the load is completed. If both the -b bin_obj_file and -c options are specified on the ftn command line, the load step is disabled and the binary object file is written to the name specified as the argument to the -b bin_obj_file option. For more information about the -b bin_obj_file option, see Section 3.2, page 16. If both the -o out_file and the -c option are specified on the ftn command line, the load step is disabled and the binary file is written to the out_file specified as an argument to -o. For more information about the -o out_file option, see Section 3.20, page 60. 3.4 -C cifopts The -C cifopts option creates one compiler information file (CIF) for each source file. You can specify "a" for the cifopts argument, which writes all possible CIF information. The compiler places each CIF in file.T, where file is the name of the source file and .T is the CIF suffix. The -r option overrides the -C option, if both are used. S–3901–60 17Cray® Fortran Reference Manual By default, the ftn command does not create a CIF. You must enable the -C option to create a CIF. The CIF can be used as input to the ftnlx command. 3.5 -d disable and -e enable The -d disable and -e enable options disable or enable compiling options. To specify more than one compiling option, enter the options without separators between them; for example, -eaf. Table 1 shows the arguments to use for disable or enable. Table 1. Compiling Options args Action, if enabled 0 Initializes all undefined local numeric stack variables to 0. If a user variable is of type character, it is initialized to NUL. If a user variable is type logical, it is initialized to false. The variables are initialized upon each execution of each procedure. Enabling this option can help identify problems caused by using uninitialized numeric and logical variables. Default: disabled a Aborts compilation after encountering the first error. Default: disabled B Generates binary output. If disabled, inhibits all optimization and allows only syntactic and semantic checking. Default: enabled c Interface checking: use Cray's system modules to check library calls in a compilation. If you have a procedure with the same name as one in the library, you will get errors as the compiler does not skip user-specified procedures when performing the checks. Default: disabled 18 S–3901–60Invoking the Cray Fortran Compiler [3] args Action, if enabled d Controls a column-oriented debugging feature when using fixed source form. When the option is enabled, the compiler replaces the D or d characters appearing in column 1 of your source with a blank and treats the entire line as a valid source line. This feature can be useful, for example, during debugging if you want to insert PRINT statements. When disabled, a D or d character in column 1 is treated as a comment character. Default: disabled D Turns on all debugging information. This option is equivalent to specifying these options: -O0, -g, -m2, -R aCEbcdspi, and -rl. See also -ed. Default: disabled E The -eE option allows existing declarations to duplicate the declarations contained in a used module. Therefore, you do not have to modify the older code by removing the existing declarations. Because the declarations are not removed, the use associated objects will duplicate declarations already in the code, which is not standard conforming. However, this option allows the compiler to accept these statements as long as the declarations match the declarations in the module. Existing declarations of a procedure must match the interface definitions in the module; otherwise an error is generated. Only existing declarations that declare the function name or generic name in an EXTERNAL or type statement are allowable under this option. S–3901–60 19Cray® Fortran Reference Manual args Action, if enabled This example illustrates some of the acceptable types of existing declarations. Program older contains the older code, while module m contains the interfaces to check. module m interface subroutine one(r) real :: r end subroutine function two() integer :: two end function end interface end module program older use m !Or use -Am on the compiler command line external one !Use associated objects integer :: two !in declarative statements call one(r) j = two() end program Default: disabled g Allows branching into the code block for a DO or DO WHILE construct. Historically, codes used branches out of and into DO constructs. Fortran standards prohibit branching into a DO construct from outside of that construct. By default, the Cray Fortran compiler will issue an error for this situation. Cray does not recommend branching into a DO construct, but if you specify -eg, the code will compile. Default: disabled 20 S–3901–60Invoking the Cray Fortran Compiler [3] args Action, if enabled h Enables support for 8-bit and 16-bit INTEGER and LOGICAL types that use explicit kind or star values. By default (-dh), data objects declared as INTEGER(kind=1), INTEGER(kind=2), LOGICAL(kind=1), or LOGICAL(kind=2) are 32 bits long. When this option is enabled (-eh), data objects declared as INTEGER(kind=1) or LOGICAL(kind=1) types are 8 bits long, and objects declared as INTEGER(kind=2) and LOGICAL(kind=2) are 16 bits long. These objects are fully vectorizable depending on the operations performed, but Cray discourages their use because their resultant performance is less than the performance of their 32-bit counterparts. 8- and 16-bit objects are fully vectorizable when they are used in one of the following operations within a vector context: • Reads of 8- and 16-bit variables • Writes to 8- and 16-bit variables, except arrays • Use of 8- and 16-bit variables as targets in a reduction loop. For example, c is an 8-bit object in this program fragment: integer :: i integer(kind=1) :: a(100), c c = 0 do i=1,100 c = c + a(i) ! This will vectorize end do Default: disabled I Treats all variables as if an IMPLICIT NONE statement had been specified. Does not override any IMPLICIT statements or explicit type statements. All variables must be typed. Default: disabled j Executes DO loops at least once. Default: disabled S–3901–60 21Cray® Fortran Reference Manual args Action, if enabled L Allows zero-trip shortloops (that is, shortloops that do not execute) and allows the use of the !DIR$ SHORTLOOP directive on loops that may have a zero-trip count. For more information, see Section 5.2.14, page 107. Default: disabled m Causes the compiler to create and search .mod files when compiling modules and satisfying module references. Note: The compiler creates modules through the MODULE statement. A module is referenced with the USE statement. When the option is disabled, the compiler creates and searches .o files when compiling modules and satisfying module references. The .mod files are named modulename.mod where modulename is the name of the module specified in the MODULE statement or the USE statement. You cannot mix the .mod files with .o files in the same directory or specify both on the same ftn command line; however, system modules will work with either the -e m or -d m option. By default, module files are written to the directory from which the ftn command is entered. You can use the -J dir_name option to specify an alternate output directory. For more information about the -J dir_name option, see Section 3.13, page 32. Default: disabled n Generates messages to note all nonstandard Fortran usage. Default: disabled o Display to stderr the optimization options used by the compiler for this compilation. Default: disabled 22 S–3901–60Invoking the Cray Fortran Compiler [3] args Action, if enabled p Enables double precision arithmetic. The -dp option can only be used when the default data size is 64 bits (that is, the -s default64 or -sreal64 option is used). When this option is disabled, variables declared on a DOUBLE PRECISION statement and constants specified with the D exponent are implicitly converted to default real type. This causes arithmetic operations and intrinsics involving these variables to have a default real type rather than a double-precision real type. Similarly, variables declared on a DOUBLE COMPLEX statement and complex constants specified with the D exponent are implicitly mapped to the complex type in which each part has a default real type. Specific double precision and double complex intrinsic procedure names are mapped to their single precision equivalents. Default: enabled P Performs source preprocessing on Fortran source files, but does not compile (see Section 3.39, page 80 for valid file extensions). When specified, source code is included by #include directives but not by Fortran INCLUDE lines. Generates file.i, which contains the source code after the preprocessing has been performed and the effects applied to the source program. For more information about source preprocessing, see Chapter 7, page 157. Default: disabled q Aborts compilation if 100 or more errors are generated. Default: enabled Q Controls whether or not the compiler accepts variable names that begin with a leading underscore (_) character. For example, when Q is enabled, the compiler accepts _ANT as a variable name. Enabling this option can cause collisions with system name space (for example, library entry point names). Default: disabled S–3901–60 23Cray® Fortran Reference Manual args Action, if enabled R Compiles all functions and subroutines as if they had been defined with the RECURSIVE attribute. Default: disabled s Scale the values of all KIND=4 count and count_rate arguments for the SYSTEM_CLOCK intrinsic function. Since the value of a 32-bit count argument can quickly wrap around to zero, the value of count is scaled down by a factor of 100. KIND=4 count_rate is scaled in the same way. The Fortran Standard allows using different kind arguments to count and count_rate, so this scaling can be disabled. Care should be taken to make sure count and count_rate are the same kind if this scaling is enabled. Default: enabled S Generates assembly language output and saves it in file.s. When the -eS option is specified on the command line with the -S asm_file option, the -S asm_file option overrides the -eS option. Default: disabled v Allocates variables to static storage. These variables are treated as if they had appeared in a SAVE statement. The following types of variables are not allocated to static storage: automatic variables (explicitly or implicitly stated), variables declared with the AUTOMATIC attribute, variables allocated in an ALLOCATE statement, and local variables in explicit recursive procedures. Variables with the ALLOCATABLE attribute remain allocated upon procedure exit, unless explicitly deallocated, but they are not allocated in static memory. Variables in explicit recursive procedures consist of those in functions, in subroutines, and in internal procedures within functions and subroutines that have been defined with the RECURSIVE attribute. The STACK compiler directive overrides -ev; for more information about this compiler directive, see Section 5.7.3, page 135. Default: disabled 24 S–3901–60Invoking the Cray Fortran Compiler [3] args Action, if enabled w Enables support for automatic memory allocation for allocatable variables and arrays that are on the left hand side of intrinsic assignment statements. The option can potentially decrease run-time performance, even when automatic memory allocation is not needed. It will affect optimizations for a code region containing an assignment to allocatable variables or arrays. For example, it could easily prevent loop fusion for multiple array syntax assignment statements with the same shape. Default: disabled. y Adds information into the binary files that allows the loader to find the modules when used in subsequent compiles. The -dy option disables this information. Enabling this option is useful if the binary files for the Fortran modules are not moved prior to the load step. The loader can then find these binaries without the user adding them to the load line. If the module binary files will be moved before the load step, this option should be disabled and the module binary files must be explicitly specified on the load line. Often this is the case when module binaries are added to a library archive file. Default: enabled Z Performs source preprocessing and compilation on Fortran source files (see Section 3.39, page 80 for valid file extensions). When specified, source code is included by #include directives and by Fortran INCLUDE lines. Generates file.i, which contains the source code after the preprocessing has been performed and the effects applied to the source program. For more information about source preprocessing, see Chapter 7, page 157. Default: disabled S–3901–60 25Cray® Fortran Reference Manual 3.6 -D identifier [=value] The -D identifier[=value] option defines variables used for source preprocessing as if they had been defined by a #define source preprocessing directive. If a value is specified, there can be no spaces on either side of the equal sign (=). If no value is specified, the default value of 1 is used. The -U option undefines variables used for source preprocessing. If both -D and -U are used for the same identifier, in any order, the identifier is undefined. For more information about the -U option, see Section 3.28, page 76. This option is ignored unless one of the following conditions is true: • The Fortran input source file is specified as either file.F, file.F90, or file.FTN. • The -eP or -eZ options have been specified. For more information about source preprocessing, see Chapter 7, page 157. 3.7 -f source_form The -f source_form option specifies whether the Fortran source file is written in fixed source form or free source form. For source_form, enter free or fixed. The source_form specified here overrides any source form implied by the source file suffix. A FIXED or FREE directive specified in the source code overrides this option (see Section 5.6.2, page 132). The default source form is fixed for input files that have the .f or .F suffix. The default source form is free for input files that have the .f90, .F90, .ftn, or .FTN suffix. Note that the Fortran standard has declared fixed source form to be obsolescent. If the file has a .F, .F90, or .FTN suffix, the source preprocessor is invoked. See Chapter 7, page 157 about preprocessing. 3.8 -F The -F option enables macro expansion throughout the source file. Typically, macro expansion occurs only on source preprocessing directive lines. By default, this option is off. 26 S–3901–60Invoking the Cray Fortran Compiler [3] This option is ignored unless one of the following conditions is true: • The Fortran input source file is specified as either file.F, file.F90, file.FTN. • The -eP or -eZ option was specified. For more information about source preprocessing, see Chapter 7, page 157. 3.9 -g The -g option provides debugging support identical to specifying the -G0 option. By default, this option is off. 3.10 -G debug_lvl The -G debug_lvl option generates a debug symbol table and establishes a debugging level. The debugging level determines the points at which breakpoints can be set. The frequency and position of breakpoints can curtail optimization partially or totally. At higher debugging levels, fewer breakpoints can be set, but optimization is increased. By default, this option is off. Enter one of the following arguments for debug_lvl: debug_lvl Support 0 Breakpoints can be set at each line. This level of debugging is supported when optimization is disabled (when -O0, -O ipa0, -O scalar0, -O stream0, -O task0, and -O vector0 are in effect). If -G0 has been specified on the command line along with an optimization level other than -O0, -O ipa0, -O scalar0, -O stream0, -O task0, or -O vector0, the compiler issues a message and disables most optimization. Array syntax statements vectorize at this level. This level can also be obtained by specifying the -g option. S–3901–60 27Cray® Fortran Reference Manual 1 Allows block-by-block debugging, with the exception of innermost loops. Streaming is disabled (equivalent to -O stream0) (X1 only). You can place breakpoints at statement labels on executable statements and at the beginning and end of block constructs (such as IF/THEN/ELSE blocks, DO/END DO blocks, and at SELECT CASE/END SELECT blocks). This level of debugging can be specified when -O 0 or -O 1 is specified. Disables some scalar optimization and all loop nest restructuring. This debug_lvl allows vectorization of some inner loops and most array syntax statements. Vectorization is equal to that performed when -O vector1 is in effect. 2 Allows post-mortem debugging. No breakpoints can be set. Local information, such as the value of a loop index variable, is not necessarily reliable at this level because such information often is carried in registers in optimized code. 3.11 -h arg The -h arg allows you to access various compiler functionality. For more information about what to specify for arg, see the following subsections. 3.11.1 -h command The -h command option provides another way to access the functionality of the -O command compiler option. For more information about -O command, see Section 3.19.4, page 39. The -h command option is offered as a convenience to those who mix Fortran and C and/or C++ code because the Cray C and Cray C++ compilers have the same option. 3.11.2 -h cpu=target_system The -h cpu=target_system option specifies the Cray X1 or X2 systems on which the absolute binary file is to be executed, where target_system can be one of cray-x1, cray-x1e or cray-x2. Default: cray-x1 on X1 systems; cray-x2 on X2 systems 28 S–3901–60Invoking the Cray Fortran Compiler [3] The target system may also be specified using the CRAY_PE_TARGET environment variable. For more information, see Section 4.1.2, page 82. Note: There are no differences between the code produced for the cray-x1 and cray-x1e targets. 3.11.3 -h gen_private_callee (X1 only) The -h gen_private_callee option provides another way to access the functionality of the -O gen_private_callee compiler option. For more information about -O gen_private_callee, see Section 3.19.8, page 44. The -h gen_private_callee option is offered as a convenience to those who mix Fortran and C and/or C++ code because the Cray C and Cray C++ compilers have the same option. 3.11.4 -h ieee_nonstop The -h ieee_nonstop option specifies that the IEEE-754 "nonstop" floating-point environment is used. This environment disables all traps (interrupts) on floating-point exceptions, enables recording of all floating-point exceptions in the floating-point status register, and rounds floating-point operations to nearest. When this option is omitted, invalid, overflow, and divide by zero exceptions will trap and be recorded; underflow and inexact exceptions will neither trap nor be recorded; and floating-point operations round to nearest. For UNICOS/mp, this option requires release 2.5 or later. 3.11.5 -h keepfiles The -h keepfiles option prevents the removal of the object (.o) files after an executable is created. Normally, the compiler automatically removes these files after linking them to create an executable. Since the original object files are required in order to instrument a program for performance analysis, if you plan to use CrayPat to conduct performance analysis experiments, you can use this option to preserve the object files. S–3901–60 29Cray® Fortran Reference Manual 3.11.6 -h mpmd, -h nompmd The -h mpmd option allows program units containing Cray Fortran Co-array (CAF) code to be used with multiple program, multiple data (MPMD) applications. Only components of interrelated applications containing Cray Fortran Co-array (CAF) code must be compiled with the -h mpmd compiler option. The -h nompmd option does not add MPMD capability to CAF code. The default is -h nompmd. You can launch multiple interrelated applications with a single aprun or mpirun command. The applications must have the following characteristics: • The applications can use MPI, SHMEM, or CAF to perform application-to-application communications. Using UPC for application-to-application communication is not supported. • Within each application, the supported programming models are MPI, SHMEM, CAF, and OpenMP. • (X1 only)All applications must be of the same mode; that is, they must all be MSP-mode applications or all SSP-mode applications. • If one or more of the applications in an MPMD job use a shared memory model (OpenMP or pthreads) and need a depth greater than the default of 1, then all of the applications will have the depth specified by the aprun or mpirun -d option, whether they need it or not. To launch multiple applications with one command, you use the -h mpmd compiler command option and launch them using aprun or mpirun. For example, suppose you have created three MPI applications which contain CAF code as follows: ftn -o multiabc -h mpmd a.o b.o c.o ftn -o multijkl -h mpmd j.o k.o l.o ftn -o multixyz -h mpmd x.o y.o z.o Note: On Cray X1 series systems, users can launch an executable either by invoking the aprun command explicitly: aprun /myapp or implicitly (called auto aprun): /myaprun The auto aprun feature is not supported on Cray X2 systems. 30 S–3901–60Invoking the Cray Fortran Compiler [3] The number of processing elements required are 128 for multiabc, 16 for multijkl, and 4 for multixyz. To launch all three applications simultaneously, you would enter: mpirun -np 128 multiabc : -np 16 multijkl : -np 4 multixyz 3.11.7 -h msp (X1 only) The -h msp option provides another way to access the functionality of the -O msp compiler option. For more information about -O msp, see Section 3.19.14, page 50. The -h msp option is offered as a convenience to those who mix Fortran and C and/or C++ code because the Cray C and Cray C++ compilers have the same option. 3.11.8 -h ssp (X1 only) The -h ssp option provides another way to access the functionality of the -O ssp compiler option. For more information about -O ssp, see Section 3.19.21, page 55. The -h ssp option is offered as a convenience to those who mix Fortran and C and/or C++ code because the Cray C and Cray C++ compilers have the same option. 3.12 -I incldir The -I incldir option specifies a directory to be searched for files named in INCLUDE lines in the Fortran source file and for files named in #include source preprocessing directives. You must specify an -I option for each directory you want searched. Directories can be specified in incldir as full path names or as path names relative to the working directory. By default, only the system directories are searched. The following example causes the compiler to search for files included within earth.f in the directories /usr/local/sun and ../moon: % ftn -I /usr/local/sun -I ../moon earth.f S–3901–60 31Cray® Fortran Reference Manual If the INCLUDE line or #include directive in the source file specifies an absolute name (that is, one that begins with a slash (/)), that name is used, and no other directory is searched. If a relative name is used (that is, one that does not begin with a slash (/)), the compiler searches for the file in the directory of the source file containing the INCLUDE line or #include directive. If this directory contains no file of that name, the compiler then searches the directories named by the -I options, as specified on the command line, from left to right. 3.13 -J dir_name The -J dir_name option specifies the directory to which file.mod files are written when the -e m option is specified on the command line. By default, the module files are written to the directory from which the ftn command was entered. The compiler will automatically search the dir_name directory for modules to satisfy USE statements by giving the dir_name path to the -p module_site option. You do not need to explicitly use the -p option for the compiler to do this. The compiler places this -p module_site option on the end of the command line. An error is issued if the -em option is not specified when the -J dir_name is used. 3.14 -l libname The -l libname option directs the loader to search for the specified object library file when loading an executable file. To request more than one library file, specify multiple -l options. The loader searches for libraries by prepending ldir/lib on the front of libname and appending .a on the end of it, for each ldir that has been specified by using the -L option. It uses the first file it finds. See also the -L option. For more information about library search rules, see Section 3.15, page 32. 3.15 -L ldir The -L ldir option directs the loader to look for library files in directory ldir. To request more than one library directory, specify multiple -L options. 32 S–3901–60Invoking the Cray Fortran Compiler [3] The loader searches for library files in directory ldir before searching the default directories: /opt/ctl/libs and /lib. For example, if -L ../mylib, -L /loclib, and -l m are specified, the loader searches for the following files and uses the first one found: ../mylibs/libm.a /loclib/libm.a /opt/ctl/libs/libm.a /lib/libm.a See the ld(1) man page for more information about library searches. For information about specifying module locations, see Section 3.21, page 60. 3.16 -m msg_lvl The -m msg_lvl option specifies the minimum compiler message levels to enable. The following list shows the integers to specify in order to enable each type of message and which messages are generated by default. msg_lvl Message types enabled 0 Error, warning, caution, note, and comment 1 Error, warning, caution, and note 2 Error, warning, and caution 3 Error and warning (default) 4 Error Caution and warning messages denote, respectively, possible and probable user errors. By default, messages are sent to the standard error file, stderr, and are displayed on your terminal. If the -r option is specified, messages are also sent to the listing file. To see more detailed explanations of messages, use the explain command. This command retrieves message explanations and displays them online. For example, to obtain documentation on message 500, enter the following command: % explain ftn-500 S–3901–60 33Cray® Fortran Reference Manual The default msg_lvl is 3, which suppresses messages at the comment, note, and caution level. It is not possible to suppress messages at the error level. To suppress specific comment, note, caution, and warning messages, see Section 3.17, page 34. To obtain messages regarding nonstandard Fortran usage, specify -e n. For more information about this option, see Section 3.5, page 18. 3.17 -M msgs The -M msgs option suppresses specific messages at the warning, caution, note, and comment levels and can change the default message severity to an error or a warning level. You cannot suppress or alter the severity of error-level messages with this option. To suppress messages, specify one or more integer numbers that correspond to the Cray Fortran compiler messages you want to suppress. To specify more than one message number, specify a comma (but no spaces) between the message numbers. For example, -M 110,300 suppresses messages 110 and 300. To change a message's severity to an error level or a warning level, specify an E (for error) or a W (for warning) and then the number of the message. For example, consider the following option: -M 300,E600,W400. This specification results in the following messages: • Message 300 is disabled and is not issued, provided that it is not an error-level message by default. Error-level messages cannot be suppressed and cannot have their severity downgraded. • Message 600 is issued as an error-level message, regardless of its default severity. • Message 400 is issued as a warning-level message, provided that it is not an error-level message by default. 3.18 -N col The -N col option specifies the line width for fixed- and free-format source lines. The value used for col specifies the maximum number of columns per line. For free form sources, col can be set to 132 or 255. For fixed form sources, col can be set to 72, 80, 132, or 255. 34 S–3901–60Invoking the Cray Fortran Compiler [3] Characters in columns beyond the col specification are ignored. By default, lines are 72 characters wide for fixed-format sources and 132 characters wide for free-form sources. 3.19 -O opt [,opt] ... The -O opt option specifies optimization features. You can specify more than one -O option, with accompanying arguments, on the command line. If specifying more than one argument to -O, separate the individual arguments with commas and do not include intervening spaces. Note: The -eo option or the ftnlx command displays all the optimization options the compiler uses at compile time. The -O 0, -O 1, -O 2, and -O 3 options allow you to specify a general level of optimization that includes vectorization, scalar optimization, inlining, and streaming (X1 only). Generally, as the optimization level increases, compilation time increases and execution time decreases. The -O 1, -O 2, and -O 3 specifications do not directly correspond to the numeric optimization levels for scalar optimization, vectorization, inlining, and streaming (X1 only). For example, specifying -O 3 does not necessarily enable vector3. Cray reserves the right to alter the specific optimizations performed at these levels from release to release. The other optimization options, such as -O aggress and -O cachen, control pattern matching, cache management, zero incrementing, and several other optimization features. Some of these features can also be controlled through compiler directives. For more information about directives, see Optimizing Applications on Cray X1 Series Systems and Optimizing Applications on Cray X2 Systems. Figure 2, page 36 shows the relationships between some of the -O opt values. S–3901–60 35Cray® Fortran Reference Manual stream0 stream1 stream2 stream3 X X X X X X X X X X X X X X Low compile cost Moderate compile cost Potentially high compile cost No numerical differences from serial execution (no vector/stream reductions) Potential numerical differences from serial execution (vector/stream reductions) Potential numerical differences from unoptimized execution (operator reassociation) No optimizations that may create exceptions Implies at least scalar1 Implies at least scalar2 Loop nest restructuring Vectorize array syntax statements Vectorize/stream only inner loops OpenMP disabled X scalar0 scalar1 scalar2 scalar3 vector0 vector1 vector2 vector3 task0 task1 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X Optimizations that may create exceptions X X X X X X X X X Figure 2. Optimization Values Note: The four columns in the table above (stream0, stream1, stream2, and stream3) apply only to the X1 series systems. 36 S–3901–60Invoking the Cray Fortran Compiler [3] 3.19.1 -O n The -On option performs general optimization at these levels: 0 (none), 1 (conservative), 2 (moderate, default), and 3 (aggressive). • The -O 0 option inhibits optimization including inlining. This option's characteristics include low compile time, small compile size, and no global scalar optimization. Most array syntax statements are vectorized, but all other vectorizations are disabled. • The -O 1 option specifies conservative optimization. This option's characteristics include moderate compile time and size, global scalar optimizations, and loop nest restructuring. Results may differ from the results obtained when -O 0 is specified because of operator reassociation. No optimizations will be performed that might create false exceptions. Only array syntax statements and inner loops are vectorized and the system does not perform some vector reductions. User tasking is enabled, so !$OMP directives are recognized. The -O 1 option enables automatic multistreaming of array syntax and entire loop nests (X1 only). • The -O 2 option specifies moderate optimization. This option's characteristics include moderate compile time and size, global scalar optimizations, pattern matching, and loop nest restructuring. Results may differ from results obtained when -O 1 is specified because of vector reductions. The -O 2 option enables automatic vectorization of array syntax and entire loop nests. This is the default level of optimization. • The -O 3 option specifies aggressive optimization. This option's characteristics include a potentially larger compile size, longer compile time, global scalar optimizations, possible loop nest restructuring, and pattern matching. The optimizations performed might create false exceptions in rare instances. Results may differ from results obtained when -O 1 is specified because of vector or multistreaming (X1 only) reductions. S–3901–60 37Cray® Fortran Reference Manual 3.19.2 -O aggress, -O noaggress The -O aggress option causes the compiler to treat a program unit (for example, a subroutine or a function) as a single optimization region. Doing so can improve the optimization of large program units by raising the limits for internal tables, which increases opportunities for optimization. This option increases compile time and size. The default is -O noaggress. 3.19.3 -O cachen The -O cachen option specifies the following levels of automatic cache management. The default on Cray X1 series systems is -O cache0. The default on Cray X2 systems is -O cache2. • -O cache0 specifies no automatic cache management; all memory references are allocated to cache in an exclusive state. Cache directives are still honored. Characteristics include low compile time. The -O cache0 option is compatible with all scalar, vector, and (X1 only) stream optimization levels. • -O cache1 specifies conservative automatic cache management. Characteristics include moderate compile time. Data are placed in the cache when the possibility of cache reuse exists and the predicted cache footprint of the datum in isolation is small enough to experience the reuse. The -O cache1 option requires at least -O vector1. • -O cache2 specifies moderately aggressive automatic cache management. Characteristics include moderate compile time. Data are placed in the cache when the possibility of cache reuse exists and the predicted state of the cache model is such that the datuml will experience the reuse. The -O cache2 option requires at least -O vector1. • -O cache3 specifies aggressive automatic cache management. Characteristics include potentially high compile time. Data are placed in the cache when the possibility of cache reuse exists and the allocation of the datum to the cache is predicted to increase the number of cache hits. The -O cache3 option requires at least -O vector1. 38 S–3901–60Invoking the Cray Fortran Compiler [3] 3.19.4 -O command The X1 and X2 implementations of this option are described below in separate sections. (The following section applies to the X1 series only.) The command mode option (-O command) allows you to create commands for Cray X1 series systems to supplement commands developed by Cray. Command mode is not suitable for user applications or use with the aprun command. The commands created with the command mode option cannot multistream, but will run serially on a single-streaming processor (SSP) within a support node. These commands will execute immediately without assistance from psched. To disable vectorization, add the -O vector0 option to the compiler command line. The compiled commands will have less debugging information, unless you specify a debugging option. The debugging information does not slow execution time, but it does result in a larger executable that may take longer to load. For simplicity, use the Fortran compiler to load your programs built with the command mode option, because the required options and libraries are automatically specified and loaded for you. To load the libraries manually, you must use the loader command (ld) and specify on its command line the -O command and -O ssp options and the -L option with the path to the command mode libraries. The command mode libraries are found in the cmdlibs directory under the path defined by the CRAYLIBS_SV2 environment variable. These must also be linked: • Start0.o • libf library • libm library • libu library Programs linked with the -O ssp option and -O command must have been previously compiled with the -O command option. That is, do not link object files built with the command mode option with object files that did not use the option. The following sample command line illustrates compiling the code for a command named fierce: % ftn -O command -O vector0 -o fierce fierce.ftn S–3901–60 39Cray® Fortran Reference Manual Note: The -h command option is another name for this option. (The following section applies to the X2 only.) The command mode option (-O command) allows you to create commands for Cray X2 systems to supplement commands developed by Cray. Commands can be run on application nodes using option -n1 to specify a single process. Executing commands on multiple processes is not supported. For simplicity, use the Fortran compiler to load your programs built with the command mode option, because the required options and libraries are automatically specified and loaded for you. The following sample command line illustrates compiling the code for a command named fierce: % ftn -O command -o fierce fierce.ftn Note: The -h command option is another name for this option. 3.19.5 -O fpn The -O fp option allows you to control the level of floating-point optimizations. The n argument controls the level of allowable optimization; 0 gives the compiler minimum freedom to optimize floating-point operations, while 3 gives it maximum freedom. The higher the level, the less the floating-point operations conform to the IEEE standard. This option is useful for code that uses unstable algorithms, but which is optimizable. It is also useful for applications that want aggressive floating-point optimizations that go beyond what the Fortran standard allows. 40 S–3901–60Invoking the Cray Fortran Compiler [3] Generally, this is the behavior and usage for each -O fp level: • -O fp0 causes your program's executable code to conform more closely to the IEEE floating-point standard than the default mode (-O fp2). When this level is specified, many identity optimizations are disabled, executable code is slower than higher floating-point optimization levels, floating point reductions are disabled, and a scaled complex divide mechanism is enabled that increases the range of complex values that can be handled without producing an underflow. The-O fp0 option should only be used when your code pushes the limits of IEEE accuracy or requires strong IEEE standard conformance. • -O fp1 performs various, generally safe, IEEE non-conforming optimizations, such as folding a == a to true, where a is a floating point object. At this level, floating-point reassociation1 is greatly limited, which may affect the performance of your code. The -O fp1 options should only be used when your code pushes the limits of IEEE accuracy, or requires substantial IEEE standard conformance. • -O fp2 includes optimizations of -O fp1. This is the default. • -O fp3 includes optimizations of -O fp1 and -O fp2. The -O fp3 option should be used when performance is more critical than the level of IEEE standard conformance provided by -O fp2. 1 An example of reassociation is when a+b+c is rearranged to b+a+c, where a, b, and c are floating point variables. S–3901–60 41Cray® Fortran Reference Manual Table 2 compares the various optimization levels of the -O fp option (levels 2 and 3 are usually the same). The table lists some of the optimizations performed; the compiler may perform other optimizations not listed. If multiple -h fp options are used, the compiler will use only the rightmost option and will issue a message indicating such. Table 2. Floating-point Optimization Levels Optimization Type 0 1 2 (default) 3 Inline selected mathematical library functions N/A N/A N/A Accuracy is slightly reduced. Complex divisions Accurate and slower Accurate and slower Less accurate (less precision) and faster. Less accurate (less precision) and faster. Exponentiation rewrite None None Maximum performance 2 Maximum performance 2, 3 Strength reduction Fast Fast Aggressive Aggressive Rewrite division as reciprocal equivalent 4 None None Yes Yes Floating point reductions Slow Fast Fast Fast Safety Maximum Moderate Moderate Low 2 Rewriting values raised to a constant power into an algebraically equivalent series of multiplications and/or square roots. 3 Rewriting exponentiations (a b ) not previously optimized into the algebraically equivalent form exp(b * ln(a)). 4 For example, x/y is transformed to x * 1.0/y. 42 S–3901–60Invoking the Cray Fortran Compiler [3] 3.19.6 -O fusionn The -O fusionn option globally controls loop fusion and changes the assertiveness of the FUSION directive. Loop fusion can improve the performance of loops, though in rare cases it may degrade performance. The n argument allows you to turn loop fusion on or off and determine where fusion should occur. It also affects the assertiveness of the FUSION directive. Use one of these values for n: 0 No fusion (ignore all FUSION directives and do not attempt to fuse other loops) 1 Attempt to fuse loops that are marked by the FUSION directive. 2 (default) Attempt to fuse all loops (includes array syntax implied loops), except those marked with the NOFUSION directive. For more information about loop fusion, see Optimizing Applications on Cray X1 Series Systems and Optimizing Applications on Cray X2 Systems. 3.19.7 -Ogcpn The -Ogcpn option enables/disables global constant propagation, where the value of n toggles the optimization on (1) or off (0). This optimization is off by default. Global constant propagation is an interprocedural optimization that replaces statically initialized variables with constants. For this optimization to work, the entire executable program must be presented to the compiler at once, which requires a large amount of memory and can significantly increase compile time. If the entire executable is not presented at once, the optimization fails. Messages are issued that indicate dead ends in the call graph. This option can be used in conjunction with the -Oipafrom= option. For example: % ftn -Oipafrom=ipa.f -Ogcp1 t.f When using the -Oipafrom= command line option as shown above, the compiler will only look in ipa.f for routine definitions to use during interprocedural analysis. To also consider t.f for interprocedural analysis, enter the following command: % ftn -Oipafrom=t.f:ipa.f -Ogcp1 t.f S–3901–60 43Cray® Fortran Reference Manual Note: Only routines in t.f will actually get linked into the executable. For a routine to be linked into an executable, it must be input to the compile step. Warning: Duplicate definitions of a routine in the input to the compiler and in the input to -Oipafrom= must be identical or the behavior of the generated code is unpredictable. 3.19.8 -O gen_private_callee (X1 only) The -O gen_private_callee option is used when compiling source files containing subprograms which will be called from streamed regions, whether those streamed regions are created by Cray streaming directives (CSDs), or by the use of the SSP_PRIVATE directive to cause autostreaming. See Chapter 6, page 143 for information about CSDs or to Section 5.3.2, page 118 for information about the SSP_PRIVATE directive. Note: The -h gen_private_callee option is another name for this option. 3.19.9 -O infinitevl, -O noinfinitevl The -O infinitevl option assumes that the safe vector length is infinite for IVDEP directives without the SAFEVL clause. The -O noinfinitevl option assumes the safe vector length is the maximum vector length supported by the target for IVDEP directives without the SAFEVL or INFINITEVL clause. See Section 5.2.6, page 100 for more information about the INFINITEVL and SAFEVL clause. The default is -O infinitevl. 3.19.10 -O ipan and -O ipafrom=source[:source] ... Inlining is the process of replacing a user procedure call with the procedure definition itself. This saves subprogram call overhead and may allow better optimization of the inlined code. If all calls within a loop are inlined, the loop becomes a candidate for parallelization. 44 S–3901–60Invoking the Cray Fortran Compiler [3] The -O ipan option specifies automatic inlining. Automatic inlining allows the compiler to automatically select, depending on the inlining level n, which functions to inline. Each n specifies a different set of heuristics. When -O ipan is used alone, the candidates for expansion are all those functions that are present in the input file to the compile step. If -O ipan is used in conjunction with -O ipafrom=, the candidates for expansion are those functions present in source. For an explanation of each lining level, see Table 3, page 47. The compiler supports the following inlining modes through the indicated options: • Automatic inlining allows the compiler to automatically select, depending on the selected inlining level, which procedures to inline. • Explicit inlining allows you to explicitly indicate which procedures the compiler should attempt to inline. • Combined inlining allows you to specifiy potential targets for inline expansion, while applying the selected level of inlining heuristics. Cloning is the attempt to duplicate a procedure under certain conditions and replace dummy arguments with associated constant actual arguments throughout the cloned procedure. The compiler attempts to clone a procedure when a call site contains actual arguments that are scalar integer and/or scalar logical constants. When the constants are exposed to the optimizer, it can generate more efficient code. Automatic cloning is enabled at -Oipa4 and higher. The compiler will first attempt to inline a call site. If inlining the call site fails, the compiler will attempt to clone the procedure for the specific call site. S–3901–60 45Cray® Fortran Reference Manual When a clone is made, dummy arguments are replaced with associated constant values throughout the routine. The following example shows cloning in action: PROGRAM TEST CALL SAM(3, .TRUE.) ! Call site with constants END SUBROUTINE SAM(I, L) INTEGER I LOGICAL L IF (L) THEN PRINT *, I ENDIF END Compiling the previous program with the -O ipa4 option, the compiler produces the following program: PROGRAM TEST CALL SAM@1(3, .TRUE.) ! This is a call to a clone of SAM. END ! Original Subroutine SUBROUTINE SAM(I, L) INTEGER I LOGICAL L IF (L) THEN PRINT *, I ENDIF END ! Cloned subroutine SUBROUTINE SAM@1(I, L) INTEGER I LOGICAL L 46 S–3901–60Invoking the Cray Fortran Compiler [3] IF (.TRUE.) THEN ! The optimizer will eliminate this IF test PRINT *, 3 ENDIF END 3.19.10.1 Automatic Inlining The -O ipan option allows the compiler to automatically decide which procedures to consider for inlining. Procedures that are potential targets for inline expansion include all the procedures within the input file to the compilation. Table 3 explains what is inlined at each level. Table 3. Automatic Inlining Specifications Inlining level Description 0 All inlining is disabled. All inlining compiler directives are ignored. 1 Directive inlining. Inlining is attempted for call sites and routines that are under the control of an inlining compiler directive. See Chapter 5, page 87 for more information about inlining directives. 2 Call nest inlining. Inline a call nest to an arbitrary depth as long as the nest does not exceed some compiler-determined threshold. A call nest can be a leaf routine. The expansion of the call nest must yield straight-line code (code containing no external calls) for any expansion to occur. 3 Constant actual argument inlining. This includes levels 1 and 2, plus any call site that contains a constant actual argument. This is the default inlining level. 4 Tiny routine inlining. This includes levels 1, 2, and 3, plus the inlining of very small routines regardless of where those routines fall in the call graph. The lower limit threshold is an internal compiler parameter. Routine cloning is attempted if inlining fails at a given call site. 5 Aggressive inlining. Inlining is attempted for every call site encountered. Cray does not recommend using this level. Routine cloning is attempted if inlining fails at a given call site. S–3901–60 47Cray® Fortran Reference Manual 3.19.10.2 Explicit Inlining The -O ipafrom=source[:source] ... option allows you to explicitly indicate the procedures to consider for inline expansion. The source arguments identify each file or directory that contains the routines to consider for inlining. Whenever a call is encountered in the input program that matches a routine in source, inlining is attempted for that call site. Note: Blanks are not allowed on either side of the equal sign. All inlining directives are recognized with explicit inlining. For information about inlining directives, see Chapter 5, page 87. Note that the routines in source are not actually loaded with the final program. They are simply templates for the inliner. To have a routine contained in source loaded with the program, you must include it in an input file to the compilation. Use one or more of the objects described in Table 4 in the source argument. Table 4. File Types Fortran source files The routines in Fortran source files are candidates for inline expansion. and must contain error-free code. Source files that are acceptable for inlining are files that have one of the following extensions • .f • .F • .f90 • .F90 • .ftn • .FTN Module files When compiling with -em and -Omodinline in effect, the precompiled module information is written to modulename.mod. The compiler writes a modulename.mod file for each module; modulename is created by taking the name of the module and, if necessary, converting it to uppercase. 48 S–3901–60Invoking the Cray Fortran Compiler [3] You cannot use the Fortran source of a module procedure as input to the -O ipafrom= option. dir A directory that contains any of the file types described in this table. 3.19.10.3 Combined Inlining Combined inlining is invoked by specifying the -O ipan and -O ipafrom= options on the command line. This inlining mode will look only in source for potential targets for expansion, while applying the selected level of inlining heuristics specified by the -O ipan option. 3.19.11 -O inlinelib The -O inlinelib option causes the compiler to attempt inlining of those Cray scientific library routines that are available for inlining. At present this is a limited subset of the LibSci routines; more inlinable library routines will be added in future releases. For a report of what was inlined or not, see the -O msgs,negmsgs option. This option is off by default. 3.19.12 -O modinline, -O nomodinline The -O modinline option prepares module procedures so they can be inlined by directing the compiler to create templates for module procedures encountered in a module. These templates are attached to file.o or modulename.mod. The files that contain these inlinable templates can be saved and used later to inline call sites within a program being compiled. When -e m is in effect, module information is stored in modname.mod. The compiler writes a modulename.mod file for each module; modulename is created by taking the name of the module and, if necessary, converting it to uppercase. The process of inlining module procedures requires only that file.o or modulename.mod be available during compilation through the typical module processing mechanism. The USE statement makes the templates available to the inliner. When -O modinline is specified, the MODINLINE and NOMODINLINE directives are recognized. Using the -O modinline option increases the size of file.o. S–3901–60 49Cray® Fortran Reference Manual To ensure that file.o is not removed, specify this option in conjunction with the -c option. For information about the -c option, see Section 3.3, page 17. The default is -O modinline. 3.19.13 -O msgs, -O nomsgs The -O msgs option causes the compiler to write optimization messages to stderr. These messages include VECTOR, SCALAR, INLINE, IPA , and STREAM (X1 only) messages. When the -O msgs option is in effect, you may request that a listing be produced so that you can see the optimization messages in the listing. For information about obtaining listings, see Section 3.23, page 64. The default is -O nomsgs. 3.19.14 -O msp (X1 only) The -O msp option causes the compiler to generate code and to select the appropriate libraries to create an executable that runs on one or more multistreaming processors (MSPs). This is called MSP mode. Any code, including Cray distributed memory models, can use MSP mode. Executables compiled for MSP mode can contain object files compield with SSP or MSP mode. That is, SSP and MSP object files can be specified during the load step as follows: ftn -O msp -c ... !Produce MSP object files ftn -O ssp -c ... !Produce SSP object files ftn sspA.o sspB.o msp.o ... !Link MSP and SSP object files !to create an executable to run on MSPs Note: Code explicitly compiled with the -O stream0 option can be linked with object files compiled with SSP or MSP mode. You can use this option to create a universal library that can be used in SSP or MSP mode. For more information about SSP and MSP mode, see Section 3.19.21, page 55 and Optimizing Applications on Cray X1 Series Systems. This option is on by default. Note: The -h msp option is another name for this option. 50 S–3901–60Invoking the Cray Fortran Compiler [3] 3.19.15 -O negmsgs, -O nonegmsgs The -O negmsgs option causes the compiler to generate messages to stderr that indicate why optimizations such as vectorization, streaming (X1 only), or inlining did not occur in a given instance. The -O negmsgs option enables the -O msgs option. The -rm option enables the -O negmsgs option. The default is -O nonegmsgs. 3.19.16 -O nointerchange The -O nointerchange option inhibits the compiler's attempts to interchange loops. Interchanging loops by having the compiler replace an inner loop with an outer loop can increase performance. The compiler performs this optimization by default. Specifying the -O nointerchange option is equivalent to specifying a NOINTERCHANGE directive prior to every loop. To disable loop interchange on individual loops, use the NOINTERCHANGE directive. For more information about the NOINTERCHANGE directive, see Section 5.5.1, page 125. 3.19.17 -O overindex, -O nooverindex The -O nooverindex option declares that there are no array subscripts which index a dimension of an array that are outside the declared bounds of that dimension. Short loop code generation occurs when the extent does not exceed the maximum vector length of the machine. Specifying -O overindex declares that the program contains code that makes array references with subscripts that exceed the defined extents. This prevents the compiler from performing the short loop optimizations described in the preceding paragraph. Overindexing is nonstandard, but it compiles correctly as long as data dependencies are not hidden from the compiler. This technique collapses loops; that is, it replaces a loop nest with a single loop. An example of this practice is as follows: DIMENSION A(20, 20) DO I = 1, N A(I, 1) = 0.0 END DO S–3901–60 51Cray® Fortran Reference Manual Assuming that N equals 400 in the previous example, the compiler might generate more efficient code than a doubly nested loop. However, incorrect results can occur in this case if -O nooverindex is in effect. You do not need to specify -O overindex if the overindexed array is a Cray pointee, has been equivalenced, or if the extent of the overindexed dimension is declared to be 1 or *. In addition, the -O overindex option is enabled automatically for the following extension code, where the number of subscripts in an array reference is less than the declared number: DIMENSION A(20, 20) DO I = 1, N A(I) = 0.0 ! 1-dimension reference; ! 2-dimension array END DO Note: The -O overindex option is used by the compiler for detection of short loops and subsequent code scheduling. This allows manual overindexing as described in this section, but it may have a negative performance effect because of fewer recognized short loops and more restrictive code scheduling. In addition, the compiler continues to assume, by default, a standard-conforming user program that does not overindex when doing dependency analysis for other loop nest optimizations. The default is -O nooverindex. 3.19.18 -O pattern, -O nopattern The -O pattern option enables pattern matching for library substitution. The pattern matching feature searches your code for specific code patterns and replaces them with calls to highly optimized routines. The -O pattern option is enabled only for optimization levels -O 2, -O vector2 or higher; there is no way to force pattern matching for lower levels. Specifying -O nopattern disables pattern matching and causes the compiler to ignore the PATTERN and NOPATTERN directives. For information about the PATTERN and NOPATTERN directives, see Section 5.2.8, page 102. The default is -O pattern. 52 S–3901–60Invoking the Cray Fortran Compiler [3] 3.19.19 -O scalarn The -O scalarn option specifies these levels of scalar optimization: • scalar0 disables scalar optimization. Characteristics include low compile time and size. The -O scalar0 option is compatible with -O task0 or -O task1 and with -O vector0. • scalar1 specifies conservative scalar optimization. Characteristics include moderate compile time and size. Results can differ from the results obtained when -O scalar0 is specified because of operator reassociation. No optimizations are performed that could create false exceptions. The -O scalar1 option is compatible with -O vector0 or -O vector1, with -O task0 or -O task1, and with -O stream0 (X1 only) or -O stream1 (X1 only). • scalar2 specifies moderate scalar optimization. Characteristics include moderate compile time and size. Results can differ slightly from the results obtained when -O scalar1 is specified because of possible changes in loop nest restructuring. Generally, no optimizations are done that could create false exceptions. The -O scalar2 option is compatible with all vectorization, multistreaming, and tasking levels. This is the default scalar optimization level. • scalar3 specifies aggressive scalar optimization. Characteristics include potentially greater compile time and size. Results can differ from the results obtained when -O scalar1 is specified because of possible changes in loop nest restructuring. The optimization techniques used can create false exceptions in rare instances. Analysis that determines whether a variable is used before it is defined is enabled at this level. The -O scalar3 option is compatible with all tasking and vectorization levels. S–3901–60 53Cray® Fortran Reference Manual 3.19.20 -O shortcircuitn The -O shortcircuitn option specify various levels of short circuit evaluation. Short circuit evaluation is an optimization in which the compiler analyzes all or part of a logical expression based on the results of a preliminary analysis. When short circuiting is enabled, the compiler attempts short circuit evaluation of logical expressions that are used in IF statement scalar logical expressions. This evaluation is performed on the .AND. operator and the .OR. operator. Example 1: Assume the following logical expression: operand1 .AND. operand2 The operand2 need not be evaluated if operand1 is false because in that case, the entire expression evaluates to false. Likewise, if operand2 is false, operand1 need not be evaluated. Example 2: Assume the following logical expression: operand1 .OR. operand2 The operand2 need not be evaluated if operand1 is true because in that case, the entire expression evaluates to true. Likewise, if operand2 is true, operand1 need not be evaluated. The compiler performs short circuit evaluation in a variety of ways, based on the following command line options: • -O shortcircuit0 disables short circuiting of IF and ELSEIF statement logical conditions. • -O shortcircuit1 specifies short circuiting of IF and ELSEIF logical conditions only when a PRESENT, ALLOCATED, or ASSOCIATED intrinsic procedure is in the condition. The short circuiting is performed left to right. In other words, the left operand is evaluated first, and if it determines the value of the operation, the right operand is not evaluated. The following code segment shows how this option could be used: SUBROUTINE SUB(A) INTEGER,OPTIONAL::A IF (PRESENT(A) .AND. A==0) THEN ... 54 S–3901–60Invoking the Cray Fortran Compiler [3] The expression A==0 must not be evaluated if A is not PRESENT. The short circuiting performed when -O shortcircuit1 is in effect causes the evaluation of PRESENT(A) first. If that is false, A==0 is not evaluated. If -O shortcircuit1 is in effect, the preceding example is equivalent to the following example: SUBROUTINE SUB(A) INTEGER,OPTIONAL::A IF (PRESENT(A)) THEN IF (A==0) THEN ... • -O shortcircuit2 specifies short circuiting of IF and ELSEIF logical conditions, and it is done left to right. All .AND. and .OR. operators in these expressions are evaluated in this way. The left operand is evaluated, and if it determines the result of the operation, the right operand is not evaluated. • -O shortcircuit3 specifies short circuiting of IF and ELSEIF logical conditions. It is an attempt to avoid making function calls. When this option is in effect, the left and right operands to .AND. and .OR. operators are examined to determine if one or the other contains function calls. If either operand has functions, short circuit evaluation is performed. The operand that has fewer calls is evaluated first, and if it determines the result of the operation, the remaining operand is not evaluated. If both operands have no calls, then no short circuiting is done. For the following example, the right operand of .OR. is evaluated first. If A==0 then ifunc() is not called: IF (ifunc() == 0 .OR. A==0) THEN ... -O shortcircuit3 is the default. 3.19.21 -O ssp (X1 only) The -O ssp option causes the compiler to compile the source code and select the appropriate libraries to create an executable that runs on one single-streaming processor (SSP mode). Any code, including those using Cray distributed memory models, can use SSP mode. The executable is scheduled by psched and runs on one SSP on an application node. S–3901–60 55Cray® Fortran Reference Manual Executables compiled for SSP mode can contain only object files compiled in SSP mode. When loading object files separately from the compile step, the SSP mode must be specified during the load step as this example shows: ftn -O ssp -c ... !Produce SSP object files ftn -O ssp sspA.o sspB.o ... !Link SSP object files !to create an executable to run on a single SSP Since SSP mode does not use streaming, the compiler automatically specifies the -O stream0 option. This option also causes the compiler to ignore CSDs. Note: Code explicitly compiled with the -O stream0 option can be linked with object files compiled with SSP or MSP mode. You can use this option to create a universal library that can be used in SSP or MSP mode. For more information about SSP and MSP mode, see Section 3.19.14, page 50 and Optimizing Applications on Cray X1 Series Systems. This option is off by default. Note: The -h ssp option is another name for this option. 3.19.22 -O streamn (X1 only) The -O streamn option controls the multistreaming when multistreaming is enabled. These levels can be set to no multistreaming optimization, at -O stream0, to aggressive multistreaming optimization at -O stream3. Generally, vectorized applications that execute on a one-processor system can expect to execute up to four times faster on a processor with multistreaming enabled. At the default streaming level, -O stream2, the four processors SSP0, SSP1, SSP2, and SSP3 may be used by the code generated by the Fortran compiler. Automatic streaming can be turned off by using the -O stream0 option. This does not mean that SSP1, SSP2, and SSP3 are not used during execution. These processors can still be used at times by the library routines called by the generated code. At times, the library routines may park (suspend) the SSP1, SSP2, and SSP3 processors. These SSPs are not available for other executables while code compiled with the stream0 option enabled is executing. 56 S–3901–60Invoking the Cray Fortran Compiler [3] The MSP optimization levels assume that certain scalar and vectorization optimization levels are also specified. If incompatible optimization levels are specified, the compiler adjusts the optimization levels used and issues a message. The various MSP optimization levels and their compatibilities with other optimizations are as follows: • -O stream0 inhibits automatic MSP optimizations. No MSP directives are recognized. The -O stream0 option is compatible with all vectorization and scalar optimization levels. • -O stream1 is the same as -O stream2, except that stream consolidation is not done. Stream consolidation is a compiler optimization that attempts to minimize the synchronization cost of streaming. • -O stream2 specifies safe MSP optimization. The compiler recognizes MSP directives. The compiler automatically performs MSP optimizations on loop nests and appropriate BMM operations. The -O stream2 option is compatible with -O scalar2, -O scalar3, -O vector2, and -O vector3. Default. • -O stream3 specifies aggressive MSP optimization on all code including appropriate BMM operations. The compiler recognizes MSP directives. The -O stream3 option is compatible with -O scalar2, -O scalar3, -O vector2, and -O vector3. For information about MSP directives, see Section 5.3, page 117. For information about optimizing with MSP, see Optimizing Applications on Cray X1 Series Systems. For more information about the effects the streaming option has on BMM operators, refer to the bmm man page. 3.19.23 -O task0, -O task1 The -O task0 option causes the compiler to ignore OpenMP directives. Characteristics of this option include reduced compile time and size. The -O task0 option is compatible with all vectorization and scalar optimization levels. The -O task1 causes to compiler to recognize OpenMP directives. S–3901–60 57Cray® Fortran Reference Manual The -O task1 option is compatible with all vectorization and scalar optimization levels. The default is -O task1. 3.19.24 -O unrolln The -O unrolln option globally controls loop unrolling and changes the assertiveness of the UNROLL directive. By default, the compiler attempts to unroll all loops, unless the NOUNROLL directive is specified for a loop. Generally, unrolling loops increases single processor performance at the cost of increased compile time and code size. The n argument allows you to turn loop unrolling on or off and determine where unrolling should occur. It also affects the assertiveness of the UNROLL directive. Use one of these values for n: 0 No unrolling (ignore all UNROLL directives and do not attempt to unroll other loops) 1 Attempt to unroll loops that are marked by the UNROLL directive. That is, the compiler will unroll the loop if there is proof that the loop will benefit by unrolling. 2 (default) Attempt to unroll all loops (includes array syntax implied loops), except those marked with the NOUNROLL directive. For more information about unrolling loops, see Optimizing Applications on Cray X1 Series Systems. 58 S–3901–60Invoking the Cray Fortran Compiler [3] 3.19.25 -O vectorn The -O vectorn option specifies these levels of vectorization: • -O vector0 specifies very conservative vectorization. Characteristics include low compile time and small compile size. The -O vector0 option is compatible with all scalar optimization levels and with task0 or task1. Vector code is generated for most array syntax statements but not for user-coded loops. • -O vector1 specifies conservative vectorization. Characteristics include moderate compile time and size. Loop nests are restructured if scalar level > 0. Only inner loops are vectorized. No vectorizations that might create false exceptions are performed. The -O vector1 option is compatible with -O task0 or -O task1 and with -O scalar1, -O scalar2, -O scalar3, or -O stream1 (X1 only) . • -O vector2 specifies moderate vectorization. Characteristics include moderate compile time and size. Loop nests are restructured. The -O vector2 option is compatible with -O scalar2 or -O scalar3 and with -O task0, -O task1, -O stream0 (X1 only), -O stream1 (X1 only), and -O stream2 (X1 only). This is the default vectorization level. • -O vector3 specifies aggressive vectorization. Characteristics include potentially high compile time and size. Loop nests are restructured. Vectorizations that might create false exceptions in rare cases may be performed. The -O vector3 option is compatible with -O scalar2, -O scalar3, -O stream2 (X1 only), and -O stream3 (X1 only) and with all tasking levels. 3.19.26 -O zeroinc, -O nozeroinc The -O zeroinc option causes the compiler to assume that a constant increment variable (CIV) can be incremented by zero. A CIV is a variable that is incremented only by a loop invariant value. For example, in a loop with variable J, the statement J = J + K, where K can be equal to zero, J is a CIV. -O zeroinc can cause less strength reduction to occur in loops that have variable increments. The default is -O nozeroinc, which means that you must prevent zero incrementing. S–3901–60 59Cray® Fortran Reference Manual 3.19.27 -O -h profile_generate The profile_generate option lets you request that the source code be instrumented for profile information gathering. The compiler will insert calls and data gathering instructions to allow CPAT to gather information about the loops in a compilation unit. In order to actually get data out of this feature CPAT must be run on the resulting executable to link in the CPAT data gathering routines. If executable is not run through CPAT the inserted code will still execute, however, the gathered data will not be recorded. See the CPAT manuals for how to extract useful information for this feature. 3.19.28 -O -h profile_data=pgo_opt The profile_data option instructs the compiler how to treat !PGO$ directives. There are two pgo_opt levels: sample and absolute. The default value is sample. Sample tells the compiler to treat the !PGO$ directive as information gathered from a sample program. This will keep the compiler from performing unsafe optimizations with the data. Absolute tells the compiler to treat the !PGO$ as representing the only data set that the program will ever see; this is intended for program units that either always are called with the same arguments or when it is known that the data set will not change from the experimental runs. The new directive !PGO$ loop_info is a special form of the directive !DIR$ loop_info; it tags the information as having come from profiling. 3.20 -o out_file The -o out_file option overrides the default executable file name, a.out, with the name specified by the out_file argument. If the -o out_file option is specified on the command line along with the -c option, the load step is disabled and the binary file is written to the out_file specified as an argument to -o. For more information about the -c option, see Section 3.3, page 17. 3.21 -p module_site The -p module_site option tells the compiler where to look for Fortran modules to satisfy USE statements. 60 S–3901–60Invoking the Cray Fortran Compiler [3] Note: The compiler will automatically search for modules you stored in the directories specified by the -J dir_name option of the current compilation. You do not need to explicitly use the -p option to have the compiler do this. The compiler will specify a -p option with the dir_name path and place it on the end of the command line. The module_site argument specifies the name of a binary file or directory to search for modules. The module_site specified can be an archive file, build file (bld file), or binary file (.o). When searching files, the compiler searches files suffixed with .o (file.o) or library files suffixed with .a (lib.a) containing one or more modules. When searching a directory, the compiler searches files in the named directory that are suffixed with .o or .a, or if the -e m option is specified, the compiler searches .mod files. After searching the directory named in module_site, the compiler searches for modules in the current directory. File name substitution (such as *.o) is not allowed. If the path name begins with a slash (/), the name is assumed to be an absolute path name. Otherwise, it is assumed to be a path name relative to the working directory. If you need to specify multiple binary files, library files, or directories, you must specify a -p option for each module_site. There is no limit on the number of -p options that you can specify. The compiler searches the binary files, library files, and directories in the order specified. Cray provides some modules as part of the Cray Fortran Compiler Programming Environment. These are referred to as system modules. The system files that contain these modules are searched last. Example 1: Consider the following command line: % ftn -p steve.o -p mike.o joe.f Assume that steve.o contains a module called Rock and mike.o contains a module called Stone. A reference to use Rock in joe.f causes the compiler to use Rock from steve.o. A reference to Stone in joe.f causes the compiler to use Stone from mike.o. Example 2: The following example specifies binary file murphy.o and library file molly.a: % ftn -p murphy.o -p molly.a prog.f S–3901–60 61Cray® Fortran Reference Manual Example 3: In this example, assume that the following directory structure exists in your home directory: programs / | \ tests one.f two.f | use_it.f The following module is in file programs/one.f, and the compiled version of it is in programs/one.o: MODULE one INTEGER i END MODULE The next module is in file programs/two.f, and the compiled version of it is in programs/two.o: MODULE two INTEGER j END MODULE The following program is in file programs/tests/use_it.f: PROGRAM demo USE one USE two . . . END PROGRAM To compile use_it.f, enter the following command from your home directory, which contains the subdirectory programs: % ftn -p programs programs/tests/use_it.f 62 S–3901–60Invoking the Cray Fortran Compiler [3] Example 4: In the next set of program units, a module is contained within the first program unit and accessed by more than one program unit. The first file, progone.f, contains the following code: MODULE split INTEGER k REAL a END MODULE PROGRAM demopr USE split INTEGER j j = 3 k = 1 a = 2.0 CALL suba(j) PRINT *, 'j=', j PRINT *, 'k=', k PRINT *, 'a=', a END The second file, progtwo.f, contains the following code: SUBROUTINE suba(l) USE split INTEGER l l = 4 k = 5 CALL subb(l) RETURN END SUBROUTINE subb(m) USE split INTEGER m m = 6 a = 7.0 RETURN END Use the following command line to compile the two files with one ftn command and a relative pathname: % ftn -p progone.o progone.f progtwo.f S–3901–60 63Cray® Fortran Reference Manual When the -e m option is in effect, you can use the -p module_site option to specify one or more directories that contain module files rather than specifying every individual module file name. 3.22 -Q path The -Q option specifies the directory that will contain all saved nontemporary files from this compilation (for example, all .o and .mod files). Specific file types (like .o files) are saved to a different directory if the -b, -J, -o, or -S option is specified. The following examples use this directory structure: current_dir ---------------------- | | | | | | | | | bin_out mod_out all_out The following example saves all nontemporary files (x.o and any .mod files) in the current directory: % ftn -b x.o -em x.f90 The following example saves all nontemporary files in the all_out directory and x.o in the current directory. % ftn -Q all_out -em -b x.o x.f90 The following example saves the x.o file to the bin_out and all .mod files to the all_out directory. % ftn -Q all_out -b bin_out/x.o -em x.f90 The following example saves the a.out file to the all_out and all .mod files to the mod_out directory. % ftn -Q all_out -J mod_out x.f90 3.23 -r list_opt The -r list_opt option generates a listing. The list_opt argument produces listings with commonly needed information. 64 S–3901–60Invoking the Cray Fortran Compiler [3] If one or more input files are specified on the compiler command line, the listing is placed in file.lst. If the -C option is specified with the -r list_opt option, the -C option is overridden and a warning message is generated. The arguments for list_opt are shown below. Note: Options a, c, l, m, o, s, and x invoke the ftnlx command. Option d provides a decompiled listing and is not CIF based. Option T retains the CIF. Options b, e, p, and w change the appearance of the listing produced by ftnlx. list_opt Listing type -r a Includes all reports in the listing (including source, cross references, lint, loopmarks, common block, and options used during compilation). For more information about loopmarks, see Optimizing Applications on Cray X1 Series Systems. -r b Adds page breaks and headers to the listing report. -r c Listing includes a report of all COMMON blocks and all members of each common block. It also shows the program units that use the COMMON blocks. -r d Decompiles (translates) the intermediate representation of the compiler into listings that resemble the format of the source code. This is performed twice, resulting in two output files, at different points during the optimization process. You can use these files to examine the restructuring and optimization changes made by the compiler, which can lead to insights about changes you can make to your Fortran source to improve its performance. The compiler produces two decompilation listing files with these extensions per specified source file: .opt and .cg. The compiler generates the .opt file after applying most high level loop nest transformations to the code. The code structure of this listing most resembles your Fortran code and is readable by most users. In some cases, because of optimizations, the structure of the loops and conditionals will be significantly different than the structure in your source file. S–3901–60 65Cray® Fortran Reference Manual The .cg file contains a much lower level of decompilation. It is still displayed in a Fortran-like format, but is quite close to what will be produced as assembly output. This version displays the intermediate text after all multistreaming translation (X1 only), vector translation, and other optimizations have been performed. An intimate knowledge of the hardware architecture of the system is helpful to understanding this listing. The .opt and .cg files are intended as a tool for performance analysis, and are not valid Fortran source code. The format and contents of the files can be expected to change from release to release. The following examples (for the X2) show the listings generated when -rd is applied to this example: Note: The column of numbers in the left-hand side of the .opt and .cg files refer to the line number in the Fortran source file. !Source code, in file example.f: subroutine example( a, b, c ) real a(*), b(*), c(*) do i = 1,100 a(i) = b(i) * c(i) enddo end Enter the following command: % ftn -c -rd example.f 66 S–3901–60Invoking the Cray Fortran Compiler [3] This is the listing of the example.opt file after loop optimizations are performed: 1. subroutine example( a, b, c ) 3. $Induc01_N4 = 0 3. !dir$ ivdep 3. do 4. A(1 + $Induc01_N4) = C(1 + $Induc01_N4) * B(1 + 4. . $Induc01_N4) 5. $Induc01_N4 = 1 + $Induc01_N4 3. if ( $Induc01_N4 >= 100 ) exit 3. enddo 6. return 6. end This is the listing of the example.cg file after other optimizations are performed: 1. subroutine example( a, b, c ) 3. ! === Begin Short Vector Loop === 4. 0[loc( A ):100:1] = 0[loc( B ):100:1] * 0[loc( C ):100:1] 3. ! === End Short Vector Loop === 6. return 6. end Note: The entire subroutine is multistreamed. -r e Expands included files in the source listing. This option is off by default. -r l Lists source code and includes lint style checking. The listing includes the COMMON block report (see the -r c option for more information about the COMMON block report). -r m Produces a source listing with loopmark information. To provide a more complete report, this option automatically enables the -O negmsg option to show why loops were not optimized. If you do not require this information, use the -O nonegmsg option on the same command line. Loopmark information will not be displayed if the -d B option has been specified. -r o Show in the list file all options used by the compiler at compile time. S–3901–60 67Cray® Fortran Reference Manual -r s Lists source code and messages. Error and warning messages are interspersed with the source lines. Optimization messages appear after each program unit. Produces 80-column output by default. -r T Retains file.T after processing rather than deleting it. This option may be specified in addition to any of the other options. For more information about file.T, see the -C option. -r w Produces 132-column output, which, when specified in conjunction with -r s or -r x, overrides the 80-column output that those options produce by default. You can specify -r w in conjunction with either the -r s option or the -r x option. Specifying -r w in conjunction with any other -r listing option generates a warning message. -r x Generates a cross-reference listing. Produces 80-column output by default. 3.24 -R runchk The -R runchk option lets you specify any of a group of run-time checks for your program. To specify more than one type of checking, specify consecutive runchk arguments, such as: -R ab. Note: Performance is degraded when run-time checking is enabled. This capability, though useful for debugging, is not recommended for production runs. The run-time checks available are as follows: runchk Checking performed a Compares the number and types of arguments passed to a procedure with the number and types expected. Note: When -R a is specified, some pattern matching may be lost because some of the library calls typically found in the generated code may not be present. This occurs when -R a is specified in conjunction with one of the following other options: -O 2 (the default optimization level), -O 3, -O ipa2, -O ipa3, -O ipa4 or -O ipa5. 68 S–3901–60Invoking the Cray Fortran Compiler [3] b Enables checking of array bounds. If a problem is detected at run time, a message is issued but execution continues. The NOBOUNDS directive overrides this option. For more information about NOBOUNDS, see Section 5.6.1, page 130. Note: Bounds checking behavior differs with the optimization level. At the default optimization level, -O 2, some run-time checking is inhibited. Complete checking is guaranteed only when optimization is turned off by specifying -O 0 on the ftn command line. c Enables conformance checking of array operands in array expressions. Even without the -R option, such checking is performed during compilation when the dimensions of array operands can be determined. C Passes a descriptor for the actual arguments as an extra argument to the called routine and sets a flag to signal the called routine that this descriptor is included. d Enables directive checking at run-time. Errors detected at compile time are reported during compilation and so are not reported at run-time. The following directives are checked: collapse, shortloop, shortloop128, and the loop_info clauses min_trips and max_trips. Violation of a run-time check results in an immediate fatal error diagnostic. E Creates a descriptor for the dummy arguments at each entry point and tests the flag from the caller to see if argument checking should be performed. If the flag is set, the argument checking is done. M msgnum[,msgnum]... Suppresses one or more specific run-time argument checking messages. This suboption cannot be specified along with any other -R options. For example, if you want to specify -Ra and -RM, you must specify them as two separate options to the ftn command, as follows: ftn -RM1640 -Ra otter.f. S–3901–60 69Cray® Fortran Reference Manual You can use a comma to separate multiple message numbers. In the following example, run-time argument checking is enabled, but messages 1953 and 1946 are suppressed: ftn -Ra -RM1953,1946 raccoon.f n Compares the number of arguments passed to a procedure with the number expected. Does not make comparisons with regard to argument data type (see -R a). p Generates run-time code to check the association or allocation status of referenced POINTER variables, ALLOCATABLE arrays, or assumed-shape arrays. A warning message is issued at run time for references to disassociated pointers, unallocated allocatable arrays, or assumed shape dummy arguments that are associated with a pointer or allocatable actual argument when the actual argument is not associated or allocated. s Enables checking of character substring bounds. This option behaves similarly to option -R b. Note: Bounds checking behavior differs with the optimization level. At the default optimization level, -O 2, some run-time checking is inhibited. Complete checking is guaranteed only when optimization is turned off by specifying -O 0 on the ftn command line. If argument checking is to be done for a particular call, the calling routine must have been compiled with either -R a or -R C and the called routine must have been compiled with either -R a or -R E. -R a is equivalent to -R CE. The separation of -R a into -R C and -R E allows some control over which calls are checked. Libraries can be compiled with -R E. If the program that is calling the libraries is compiled with either -R a or -R C, library calls are checked. If the calling routines are not compiled with -R a or -R C, no checking occurs. Slight overhead is added to each entry sequence compiled with -R E or -R a and to each call site compiled with -R C or -R a. If a call site passes the extra information to an entry that is compiled to perform checking, the checking itself costs a few thousand clock periods per call. This cost depends on the number of arguments at the call. 70 S–3901–60Invoking the Cray Fortran Compiler [3] Some nonstandard code behaves differently when argument checking is used. Different behavior can include run-time aborts or changed results. The following example illustrates this: CALL SUB1(10,15) CALL SUB1(10) END SUBROUTINE SUB1(I,K) PRINT *,I,K END Without argument checking, if the two calls in this example share the same stack space for arguments, subroutine SUB1 prints the values 10 and 15 for both calls. However, with argument checking enabled, an extra argument is added to the argument list, overwriting any previous information that was there. In this case, the second call to SUB1 prints 10, followed by an incorrect value. If full argument checking is enabled by -R a, a message reporting the mismatch in the number of arguments is issued. This problem occurs only with nonstandard code in which the numbers of actual and dummy arguments do not match. 3.25 -s size The -s size option allows you to modify the sizes of variables, literal constants, and intrinsic function results declared as type REAL, INTEGER, LOGICAL, COMPLEX, DOUBLE COMPLEX, or DOUBLE PRECISION. Use one of these for size: size Action byte_pointer (Default) Applies a byte scaling factor to integers used in pointer arithmetic involving Cray pointers. That is, Cray pointers are moved on byte instead of word boundaries. Pointer arithmetic scaling is explained in Section 3.25.2, page 74. S–3901–60 71Cray® Fortran Reference Manual default32 (Default) Adjusts the data size of default types as follows: • 32 bits: REAL, INTEGER, LOGICAL • 64 bits: COMPLEX, DOUBLE PRECISION • 128 bits: DOUBLE COMPLEX Note: The data sizes of integers and logicals that use explicit kind and star values are not affected by this option. However, they are affected by the -e h option. See Section 3.5, page 18. default64 Adjust the data size of default types as follows: • 64 bits: REAL, INTEGER, LOGICAL • 128 bits: COMPLEX, DOUBLE PRECISION • 256 bits: DOUBLE COMPLEX If you used the -s default64 at compile time, you must also specify this option when invoking the ftn command to call the loader. Note: The data sizes of integers and logicals that use explicit kind and star values are not affected by this option. However, they are affected by the -eh option. See Section 3.5, page 18. integer32 (Default) Adjusts the default data size of default integers and logicals to 32 bits. integer64 Adjusts the default data size of default integers and logicals to 64 bits. real32 (Default) Adjusts the default data size of default real types as follows: • 32 bits: REAL • 64 bits: COMPLEX and DOUBLE PRECISION • 128 bits: DOUBLE COMPLEX 72 S–3901–60Invoking the Cray Fortran Compiler [3] real64 Adjusts the default data size of default real types as follows: • 64 bits: REAL • 128 bits: COMPLEX and DOUBLE PRECISION • 256 bits: DOUBLE COMPLEX word_pointer Applies a word scaling factor to integers used in pointer arithmetic involving Cray pointers. That is, Cray pointers are moved on word instead of byte boundaries. Pointer arithmetic scaling is explained later in Section 3.25.2, page 74. The default data size options (for example, -s default64) option does not affect the size of data that explicitly declare the size of the data (for example, REAL(KIND=4) R. 3.25.1 Different Default Data Size Options on the Command Line You must be careful when mixing different default data size options on the same command line because equivalencing data of one default size with data of another default size can cause unexpected results. For example, assume that the following command line is used for a program: % ftn -s default64 -s integer32 ... S–3901–60 73Cray® Fortran Reference Manual The mixture of these default size options causes the program below to equivalence 32-bit integer data with 64-bit real data and to incompletely clear the real array. Program test IMPLICIT NONE real r integer i common /blk/ r(10), i(10) integer overlay(10) equivalence (overlay, r) call clear(overlay) call clear(i) contains subroutine clear(i) integer, dimension (10) :: i i = 0 end subroutine end program test The above program sets only the first 10 32-bit words of array r to zero. It should instead set 10 64-bit words to zero. 3.25.2 Pointer Scaling Factor You can specify that the compiler apply a scaling factor to integers used in pointer arithmetic involving Cray pointers so that the pointer is moved to the proper word or byte boundary. For example, the compiler views this code statement: Cray_ptr = Cray_ptr + integer_value as Cray_ptr = Cray_ptr + (integer_value * scaling_factor) 74 S–3901–60Invoking the Cray Fortran Compiler [3] The scaling factor is dependent on the size of the default integer and which scaling option (-s byte_pointer or -s word_pointer) is enabled. Table 5. Scaling Factor in Pointer Arithmetic Scaling Option Default Integer Size Scaling Factor -s byte_pointer 32 or 64 bits 1 -s word_pointer and -s default32 enabled 32 bits 4 -s word_pointer and -s default64 enabled 64 bits 8 Therefore, when the -s byte_pointer option is enabled, this example increments ptr by i bytes: pointer (ptr, ptee) !Cray pointer ptr = ptr + i When the -s word_pointer and -s default32 options are enabled, the same example is viewed by the compiler as: ptr = ptr + (4*i) When the -s word_pointer and -s default64 options are enabled, the same example is viewed by the compiler as: ptr = ptr + (8*i) 3.26 -S asm_file The -S asm_file option specifies the assembly language output file name. When -S asm_file is specified on the command line with either the -e S or -b bin_obj_file options, the -e S and -b bin_obj_file options are overridden. 3.27 -T The -T option disables the compiler but displays all options currently in effect. The Cray Fortran compiler generates information identical to that generated when the -v option is specified on the command line; when -T is specified, however, no processing is performed. When this option is specified, output is written to the standard error file (stderr). S–3901–60 75Cray® Fortran Reference Manual 3.28 -U identifier [,identifier] ... The -U identifier [,identifier] ... option undefines variables used for source preprocessing. This option removes the initial definition of a predefined macro or sets a user predefined macro to an undefined state. The -D identifier [=value] option defines variables used for source preprocessing. If both -D and -U are used for the same identifier, in any order, the identifier is undefined. For more information about the -D option, see Section 3.6, page 26. This option is ignored unless one of the following conditions is true: • The Fortran input source file is specified as either file.F, file.F90, file.FTN. • The -e P or -e Z options have been specified. For more information about source preprocessing, see Chapter 7, page 157. 3.29 -v The -v option sends compilation information to the standard error file (stderr). The information generated indicates the compilation phases as they occur and all options and arguments being passed to each processing phase. 3.30 -V The -V option displays to the standard error file (stderr) the release version of the ftn command. Unlike all other command-line options, you can specify this option without specifying an input file name; that is, specifying ftn -V is valid. 3.31 -Wa"assembler_opt" The -Wa"assembler_opt" option passes assembler_opt directly to the assembler. For example, -Wa"-h" passes the -h option directly the as command, directing it to enable all pseudos, regardless of location field name. This option is meaningful to the system only when file.s is specified as an input file on the command line. For more information about assembler options, see the as(1) man page. 76 S–3901–60Invoking the Cray Fortran Compiler [3] 3.32 -Wl"loader_opt" The -Wl"loader_opt" option passes loader_opt directly to the loader. For example, specifying -Wl"-m" passes the argument -m directly to the loader's -m option. For more information about loader options, see the ld(1) man page. Note: Cray recommends that you use the compiler to invoke the loader, because the compiler calls the loader with the appropriate default libraries. The appropriate default libraries may change from release to release. 3.33 -Wr"lister_opt" The -Wr"lister_opt" option passes lister_opt directly to the ftnlx command. For example, specifying -Wr"-o cfile.o" passes the argument cfile.o directly to the ftnlx command's -o option; this directs ftnlx to override the default output listing and put the output file in cfile.o. If you specify the -Wr"lister_opt" option, you must specify the -r list_opt option. For more information about options, see the ftnlx man page. 3.34 -x dirlist The -x dirlist option disables specified directives or specified classes of directives. If specifying a multiword directive, either enclose the directive name in quotation marks or remove the spaces between the words in the directive's name. S–3901–60 77Cray® Fortran Reference Manual For dirlist, enter one of the following arguments: dirlist Item disabled all All compiler directives, OpenMP Fortran directives, and CSDs. For information about the OpenMP directives or CSDs see Chapter 8, page 167 or Chapter 6, page 143 respectively. csd All CSDs. See Chapter 6, page 143. dir All compiler directives. directive One or more compiler directives or OpenMP Fortran directives. If specifying more than one, separate them with commas; for example: -x INLINEALWAYS,"NO SIDE EFFECTS",BOUNDS. omp All OpenMP Fortran directives. conditional_omp All C$ and !$ conditional compilation lines. 3.35 -X npes The -X npes option specifies the number of processing elements (PEs) to use during execution. The value for npes ranges from 1 through 4096 inclusive. Note: (X1 only) Programs compiled with the -X option can be executed without using the aprun command. If this command is used for these programs, you must specify to this command the same number of processors (npes) specified at compile time. N$PES is a special symbol whose value is equal to the number of PEs available to your program. When the -X npes option is specified at compile time, the N$PES constant is replaced by integer value npes. The N$PES constant can be used only in either of these situations: • The -X npes option is specified on the command line, or • The value of the expression containing the N$PES constant is not known until run time (that is, it can only be used in run-time expressions) 78 S–3901–60Invoking the Cray Fortran Compiler [3] One of the many uses for the N$PES symbol is illustrated in the following example, which declares the size of an array within a subroutine to be dependent upon the number of processors: SUBROUTINE WORK DIMENSION A(N$PES) Using the N$PES symbol in conjunction with the -X npes option allows the programmer to program the number of PEs into a program in places that do not accept run-time values. Specifying the number of PEs at compile time can also enhance compiler optimization. 3.36 -Yphase,dirname The -Yphase,dirname option specifies a new directory (dirname) from which the designated phase should be executed. phase can be one or more of the values shown in Table 6. Table 6. -Yphase Definitions phase System phase Command 0 Compiler ftn a Assembler as l Loader ld 3.37 -Z The -Z option enables the compiler to recognize co-array syntax. Co-arrays are a syntactic extension to the Fortran language that offers a method for performing data passing. (Co-arrays are discussed in detail in Chapter 10.) Data passing is an effective method for programming single-program-multiple-data (SPMD) parallel computations. Its chief advantages over message passing are lower latency and higher bandwidth for data transfers, both of which lead to improved scalability for parallel applications. Compared to MPI and SHMEM, co-arrays provide enhanced readability and, thus, increased programmer productivity. As a language extension, the code can also be conditionally analyzed and optimized by the compiler. S–3901–60 79Cray® Fortran Reference Manual 3.38 -- The -- symbol signifies the end of options. After this symbol, you can specify files to be processed. This symbol is optional. It may be useful if your input file names begin with one or more dash (-) characters. 3.39 sourcefile[sourcefile.suffix ...] The sourcefile[sourcefile.suffix ...] option names the file or files to be processed. The file suffixes indicate the content of each file and determine whether the preprocessor, compiler, assembler, or loader will be invoked. Preprocessor Files having the F, F90, or FTN suffix invoke the preprocessor. Compiler Fortran source files having the following prefixes invoke the compiler: • .f or .F, indicates a fixed source form file. • .f90, .F90, .ftn, .FTN, indicates a free source form file. Note: The source form specified on the -f source_form option overrides the source form implied by the file suffixes. Loader Files with a .o extension (object files) invoke the loader. If only one source file is specified on the command line, the .o file is created and deleted. To retain the .o file, use the -c option to disable the loader. You can specify object files produced by the Cray Fortran, C, C++, or assembler compilers. Object files are passed to the loader in the order in which they appear on the ftn command line. If the loader is disabled by the -b or -c option, no files are passed to the loader. The loader allows other file types. See the -e m option in the ld man page for more information about these files. 80 S–3901–60Environment Variables [4] Environment variables are predefined shell variables, taken from the execution environment, that determine some of your shell characteristics. Several environment variables pertain to the Cray Fortran compiler. The Cray Fortran compiler recognizes general and multiprocessing environment variables. The multiprocessing variables in the following sections affect the way your program will perform on multiple processors. Using environment variables lets you tune the system for parallel processing without rebuilding libraries or other system software. The variables allow you to control parallel processing at compile time and at run time. Compile time environment variables apply to all compilations in a session. The following examples show how to set an environment variable: • With the standard shell, enter: CRAY_FTN_OPTIONS=options export CRAY_FTN_OPTIONS • With the C shell, enter: setenv CRAY_FTN_OPTIONS options The following sections describe the environment variables recognized by the Cray Fortran compiler. Note: Many of the environment variables described in this chapter refer to the default system locations of Programming Environment components. If the Cray Fortran Compiler Programming Environment has been installed in a nondefault location, see your system support staff for path information. 4.1 Compiler and Library Environment Variables The variables described in the following subsections allow you to control parallel processing at compile time. S–3901–60 81Cray® Fortran Reference Manual 4.1.1 CRAY_FTN_OPTIONS Environment Variable The CRAY_FTN_OPTIONS environment variable specifies additional options to attach to the command line. This option follows the options specified directly on the command line. File names cannot appear. These options are inserted at the right-most portion of the command line before the input files and binary files are listed. This allows you to set the environment variable once and have the specified set of options used in all compilations. This is especially useful for adding options to compilations done with build tools. For example, assume that this environment variable was set as follows: setenv CRI_FTN_OPTIONS -G0 With the variable set, the following two command line specifications are equivalent: % ftn -c t.f % ftn -c -G0 t.f 4.1.2 CRAY_PE_TARGET Environment Variable The CRAY_PE_TARGET environment variable specifies the target_system for compilation. The command line option -h cpu=target_system takes precedence over the CRAY_PE_TARGET setting. The acceptable values for CRAY_PE_TARGET currently are cray-x1, cray-x1e, and cray-x2. Note: Currently, there are no differences in the code produced for the cray-x1 and cray-x1e targets. This option was created to allow Cray to support future changes in optimization and code generation based on experience with the Cray X1E and future hardware platforms. It is possible that compilations with the -h cpu=cray-x1e option will not be compatible with Cray X1 machines in future releases. 4.1.3 FORMAT_TYPE_CHECKING Environment Variable The FORMAT_TYPE_CHECKING environment variable specifies various levels of conformance between the data type of each I/O list item and the formatted data edit descriptor. When set to RELAXED, the run-time I/O library enforces limited conformance between the data type of each I/O list item and the formatted data edit descriptor. 82 S–3901–60Environment Variables [4] When set to STRICT77, the run-time I/O library enforces strict FORTRAN 77 conformance between the data type of each I/O list item and the formatted data edit descriptor. When set to STRICT90 or STRICT95, the run-time I/O library enforces strict Fortran 90/95 conformance between the data type of each I/O list item and the formatted data edit descriptor. See the following tables: Table 17, page 202, Table 18, page 203, Table 19, page 203, and Table 20, page 203. 4.1.4 FORTRAN_MODULE_PATH Environment Variable Like the Cray Fortran compiler -p module_site command line option, this environment variable allows you to specify the files or the directory to search for the modules to use. The files can be archive files, build files (bld file), or binary files. The compiler appends the paths specified by the FORTRAN_MODULE_PATH environment variable to the path specified by the -p module_site command line option. Since the FORTRAN_MODULE_PATH environment variable can specify multiple files and directories, a colon separates each path as shown in the following example: % set FORTRAN_MODULE_PATH='path1 : path2 : path3' 4.1.5 LISTIO_PRECISION Environment Variable The LISTIO_PRECISION environment variable controls the number of digits of precision printed by list-directed output. The LISTIO_PRECISION environment variable can be set to FULL or PRECISION. • FULL prints full precision (default). • PRECISION prints x or x + 1 decimal digits, where x is value of the PRECISION intrinsic function for a given real value. This is a smaller number of digits, which usually ensures that the last decimal digit is accurate to within 1 unit. This number of digits is usually insufficient to assure that subsequent input will restore a bit-identical floating-point value. S–3901–60 83Cray® Fortran Reference Manual 4.1.6 NLSPATH Environment Variable The NLSPATH environment variable specifies the message system library catalog path. This environment variable affects compiler interactions with the message system. For more information about this environment variable, see catopen(3). 4.1.7 NPROC Environment Variable The NPROC environment variable specifies the maximum number of processes to be run. Setting NPROC to a number other than 1 can speed up a compilation if machine resources permit. The effect of NPROC is seen at compilation time, not at execution time. NPROC requests a number of compilations to be done in parallel. It affects all the compilers and also make. For example, assume that NPROC is set as follows: setenv NPROC 2 The following command is entered: ftn -o t main.f sub.f In this example, the compilations from .f files to .o files for main.f and sub.f happen in parallel, and when both are done, the load step is performed. If NPROC is unset, or set to 1, main.f is compiled to main.o; sub.f is compiled to sub.o, and then the link step is performed. You can set NPROC to any value, but large values can overload the system. For debugging purposes, NPROC should be set to 1. By default, NPROC is 1. 4.1.8 TMPDIR Environment Variable The TMPDIR environment variable specifies the directory containing the compiler temporary files. The location of the directory is defined by your administrator and cannot be changed. 84 S–3901–60Environment Variables [4] 4.1.9 ZERO_WIDTH_PRECISION Environment Variable The ZERO_WIDTH_PRECISION environment variable controls the field width when field width w of Fw.d is zero on output. The ZERO_WIDTH_PRECISION environment variable can be set to PRECISION or HALF. • PRECISION specifies that full precision will be written. This is the default. • HALF specifies that half of the full precision will be written. 4.2 OpenMP Environment Variable OMP_THREAD_STACK_SIZE is a Cray specific OpenMP environment variable that affects programs at run time. It changes the size of the thread stack from the default size of 16 MB to the specified size. The size of the thread stack should be increased when private variables may utilize more than 16 MB of memory. (X1 only) The requested thread stack space is allocated from the local heap when the threads are created. The amount of space used by each thread for thread stacks depend on whether you are using MSP or SSP mode. In MSP mode, the memory used is 5 times the specified thread stack size because each SSP is assigned one thread stack and one thread stack is used as the MSP common stack. For SSP mode, the memory used is one times the specified thread stack size. (X1 only) Since memory is allocated from the local heap, you may want to consider how increasing the size of the thread stacks will affect available space in the local heap. To adjust the size of the local heap, see the X1_HEAP_SIZE and X1_LOCAL_HEAP_SIZE environment variables in the memory(7) man page. (X2 only) The heaps on X2 do not have to be sized statically as they have to be on the X1 series systems; their sizes are adjusted as needed. This is the format for the OMP_THREAD_STACK_SIZE environment variable: OMP_THREAD_STACK_SIZE n where n is a hex, octal or decimal integer specifying the amount of memory, in bytes, to allocate for a thread's stack. For more information about OpenMP API, see Chapter 8, page 167. S–3901–60 85Cray® Fortran Reference Manual 4.3 Run Time Environment Variables Run time environment variables allow you to adjust the following elements of your run time environment: • Stack and heap sizes, see the memory(7) man page for more information. • Default options for automatic aprun, see the CRAY_AUTO_APRUN_OPTIONS environment variable in the aprun(1) man page. • (X1 only) Dynamic COMMON block, see the X1_DYNAMIC_COMMON_SIZE environment variable in the ld(1) man page. • The field width w of Fw.d when w is zero on output, refer to the ZERO_WIDTH_PRECISION environment variable in Section 4.1.9, page 85. 86 S–3901–60Cray Fortran Directives [5] Directives are lines inserted into source code that specify actions to be performed by the compiler. They are not Fortran statements. This chapter describes the Cray Fortran compiler directives. If you specify a directive while running on a system that does not support that particular directive, the compiler generates a message and continues with the compilation. Note: The Cray Fortran compiler also supports the OpenMP Fortran API directives. See Chapter 8, page 167 for more information. Section 5.1, page 90 describes how to use the directives and the effects they have on programs. Table 7 categorizes the Cray Fortran compiler directives according to purpose and directs you to the pages containing more details. For more information about optimization, see Optimizing Applications on Cray X1 Series Systems. Table 7. Directives Purpose and Name Description Vectorization and tasking: COPY_ASSUMED_SHAPE Copy arrays to temporary storage. For more information, see Section 5.2.4, page 98. HAND_TUNED Assert that the loop has been hand-tuned for maximum performance and restrict automatic compiler optimizations. For more information, see Section 5.2.5, page 100. IVDEP Ignore loop vector-dependencies that a loop might have. For more information, see Section 5.2.6, page 100. NEXTSCALAR Disable loop vectorization. For more information, see Section 5.2.7, page 101. PATTERN, NOPATTERN Replace or do not replace recognized code patterns with optimized library routines. For more information, see Section 5.2.8, page 102. PERMUTATION Declare that an integer array has no repeating values. For more information, see Section 5.2.9, page 102. S–3901–60 87Cray® Fortran Reference Manual Purpose and Name Description PIPELINE Attempt to force or inhibit software-based vector pipelining. For more information, see Section 5.2.18, page 115. PREFERVECTOR Vectorize nested loops. For more information, see Section 5.2.10, page 103. PROBABILITY Suggest the probability of a branch being executed. For more information, see Section 5.2.11, page 104. SAFE_ADDRESS Speculatively execute memory references within a loop. For more information, see Section 5.2.12, page 105. SAFE_CONDITIONAL Speculatively execute memory references and arithmetic operations within a loop. For more information, see Section 5.2.13, page 106. SHORTLOOP, SHORTLOOP128 Eliminate testing of conditional statements that terminate a loop for short loops. For more information, see Section 5.2.14, page 107. LOOP_INFO Provide loop count and cache allocation information to the optimizer to produce faster code sequences. This directive can be used to replace SHORTLOOP, SHORTLOOP128, NO_CACHE_ALLOC, or CACHE_SHARED. For more information, see Section 5.2.15, page 108. UNROLL, NOUNROLL Unroll or do not unroll loops to improve performance. For more information, see Section 5.2.16, page 112. VECTOR, NOVECTOR Vectorize or do not vectorize loops and array statements. For more information, see Section 5.2.17, page 115. VFUNCTION Declare the existence of a vectorized external function. For more information, see Section 5.2.19, page 116. Multistreaming Processor (MSP) optimization (X1 only): PREFERSTREAM Optimize the loop following the PREFERSTREAM directive, for cases where the compiler could perform MSP optimizations on more than one loop in a loop nest. For more information, see Section 5.3.1, page 118. SSP_PRIVATE Optimize loops containing procedural calls. See Section 5.3.2, page 118. STREAM, NOSTREAM Optimize or do not optimize loops and arrays. For more information, see Section 5.3.3, page 120. 88 S–3901–60Cray Fortran Directives [5] Purpose and Name Description Inlining: CLONE, NOCLONE Attempt cloning or do not attempt cloning at call sites. For more information, see Section 5.4.1, page 121. INLINE, NOINLINE Attempt to inline or do not attempt to inline call sites. For more information, see Section 5.4.2, page 122. INLINENEVER, INLINEALWAYS Never or always inline the specified procedures. For more information, see Section 5.4.3, page 122. MODINLINE, NOMODINLINE Enable or disable inlineable templates for the designated procedures. For more information, see Section 5.4.4, page 123. Scalar optimization: INTERCHANGE, NOINTERCHANGE Interchange or do not interchange the order of the loops. For more information, see Section 5.5.1, page 125. NOSIDEEFFECTS Tell the compiler that the data in the registers will not change when calling the specified subprogram. For more information, see Section 5.5.3, page 128. SUPPRESS Suppress scalar optimization of specified variables. For more information, see Section 5.5.4, page 129. Local use of compiler features: BOUNDS, NOBOUNDS Check or do not check the bounds of array references. For more information, see Section 5.6.1, page 130. FREE, FIXED Specify that the source uses a free or fixed format. For more information, see Section 5.6.2, page 132. Storage: BLOCKABLE Specify that it is legal to cache block subsequent loops. For more information, see Section 5.7.1, page 133. BLOCKINGSIZE, NOBLOCKING Assert that the loop following the directive is or is not involved in cache blocking. For more information, see Section 5.7.2, page 133. STACK Allocate variables on the stack. For more information, see Section 5.7.3, page 135. Miscellaneous: CONCURRENT Convey user-known array dependencies to the compiler. For more information, see Section 5.8.1, page 136. S–3901–60 89Cray® Fortran Reference Manual Purpose and Name Description FUSION, NOFUSION Allow you to fine-tune the selection of which DO loops the compiler should attempt to fuse. For more information, see Section 5.8.2, page 137. ID Insert an identifier string into the .o file. For more information, see Section 5.8.3, page 137. IGNORE_TKR Ignore the type, kind, and rank (TKR) of specified dummy arguments of a procedure interface. For more information, see Section 5.8.4, page 139. NAME Define a name that uses characters that are outside of the Fortran character set. See Section 5.8.5, page 140. CACHE_EXCLUSIVE Asserts that all vector loads with the specified symbols as the base are to be made using cache-exclusive instructions. See Section 5.2.1, page 97. NO_CACHE_ALLOC Suggest data objects that should not be placed into the cache. See Section 5.2.3, page 98. CACHE_SHARED Asserts that all vector loads with the specified symbols as the base are to be made using cache-shared instructions. For more information, see Section 5.2.2, page 97. WEAK Define a procedure reference as weak. See Section 5.8.7, page 141. 5.1 Using Directives This section describes how to use the directives and the effects they have on programs. 90 S–3901–60Cray Fortran Directives [5] 5.1.1 Directive Lines A directive line begins with the characters CDIR$ or !DIR$. How you specify directives depends on the source form you are using, as follows: • If you are using fixed source form, indicate a directive line by placing the characters CDIR$ or !DIR$ in columns 1 through 5. If the compiler encounters a nonblank character in column 6, the line is assumed to be a directive continuation line. Columns 7 and beyond can contain one or more directives. Characters in directives entered in columns beyond the default column width are ignored. • If you are using free source form, indicate a directive by the characters !DIR$, followed by a space, and then one or more directives. If the position following the !DIR$ contains a character other than a blank, tab, or newline character, the line is assumed to be a continuation line. The !DIR$ need not start in column 1, but it must be the first text on a line. In the following example, an asterisk (*) appears in column 6 to indicate that the second line is a continuation of the preceding line: !DIR$ Nosideeffects !DIR$*ab The FIXED and FREE directives must appear alone on a directive line and cannot be continued. If you want to specify more than one directive on a line, separate each directive with a comma. Some directives require that you specify one or more arguments; when specifying a directive of this type, no other directive can appear on the line. Spaces can precede, follow, or be embedded within a directive, regardless of source form. Code portability is maintained despite the use of directives. In the following example, the ! symbol in column 1 causes other compilers to treat the Cray Fortran compiler directive as a comment: A=10. !DIR$ NOVECTOR DO 10,I=1,10... Do not use source preprocessor (#) directives within multiline compiler directives (CDIR$ or !DIR$). S–3901–60 91Cray® Fortran Reference Manual 5.1.2 Range and Placement of Directives The range and placement of directives are as follows: • The FIXED and FREE directives can appear anywhere in your source code. All other directives must appear within a program unit. • These directives must reside in the declarative portion of a program unit and apply only to that program unit: – CACHE_SHARED – CACHE_EXCLUSIVE – COPY_ASSUMED_SHAPE – COERCE_KIND – IGNORE_RANK – IGNORE_TKR – INLINEALWAYS, INLINENEVER – NAME – NO_CACHE_ALLOC – NOSIDEEFFECTS – STACK – SSP_PRIVATE (X1 only) – SYMMETRIC – SYSTEM_MODULE – VFUNCTION – WEAK • The following directives toggle a compiler feature on or off at the point at which the directive appears in the code. These directives are in effect until the opposite directive appears, until the directive is reset, or until the end of the program unit, at which time the command line settings become the default for the remainder of the compilation. – BOUNDS, NOBOUNDS 92 S–3901–60Cray Fortran Directives [5] – CLONE, NOCLONE – INLINE, NOINLINE – INTERCHANGE, NOINTERCHANGE – PATTERN, NOPATTERN – STREAM, NOSTREAM – VECTOR, NOVECTOR • The SUPPRESS directive applies at the point at which it appears. • The ID directive does not apply to any particular range of code. It adds information to the file.o generated from the input program. • The following directives apply only to the next loop or block of code encountered lexically: – BLOCKABLE – BLOCKINGSIZE, NOBLOCKING – CONCURRENT – HAND_TUNED – INTERCHANGE, NOINTERCHANGE – IVDEP – NEXTSCALAR – PERMUTATION – PIPELINE, NOPIPELINE – PREFERSTREAM – PREFERVECTOR – PROBABILITY – SAFE_ADDRESS – SAFE_CONDITIONAL S–3901–60 93Cray® Fortran Reference Manual – SHORTLOOP, SHORTLOOP128 – LOOP_INFO – UNROLL, NOUNROLL • The MODINLINE and NOMODINLINE directives are in effect for the scope of the program unit in which they are specified, including all contained procedures. If one of these directives is specified in a contained procedure, the contained procedure's directive overrides the containing procedure's directive. 5.1.3 Interaction of Directives with the -x Command Line Option The -x option on the ftn command accepts one or more directives as arguments. When your input is compiled, the compiler ignores directives named as arguments to the -x option. If you specify -x all, all directives are ignored. If you specify -x dir, all directives preceded by !DIR$ or CDIR$ are ignored. For more information about the -x option, see Section 3.34, page 77. 5.1.4 Command Line Options and Directives Some features activated by directives can also be specified on the ftn command line. A directive applies to parts of programs in which it appears, but a command line option applies to the entire compilation. 94 S–3901–60Cray Fortran Directives [5] Vectorization, scalar optimization, streaming (X1 only), and tasking can be controlled through both command line options and directives. If a compiler optimization feature is disabled by default or is disabled by an argument to the -O option to the ftn command, the associated !prefix$ directives are ignored. The following list shows Cray Fortran compiler optimization features, related command line options, and related directives: • Specifying the -O 0 option on the command line disables all optimization. All scalar optimization, vectorization, multistreaming (X1 only), and tasking directives are ignored. • Specifying the -O ipa0 option on the command line disables inlining and causes the compiler to ignore all inlining directives. • Specifying the -O scalar0 option disables scalar optimization and causes the compiler to ignore all scalar optimization and all vectorization directives. • Specifying the -O stream0 option disables MSP optimization and causes the compiler to ignore all MSP directives (X1 only). • Specifying the -O task0 option disables tasking and causes the compiler to ignore tasking directives. • Specifying the -O vector0 option causes the compiler to ignore all vectorization directives. Specifying the NOVECTOR directive in a program unit causes the compiler to ignore subsequent directives in that program unit that may specify vectorization. The following sections describe directive syntax and the effects of directives on Cray Fortran compiler programs. S–3901–60 95Cray® Fortran Reference Manual 5.2 Vectorization Directives This section describes the following directives used to control vectorization and tasking: • CACHE_EXCLUSIVE • CACHE_SHARED • NO_CACHE_ALLOC • COPY_ASSUMED_SHAPE • HAND_TUNED • IVDEP • NEXTSCALAR • PATTERN, NOPATTERN • PERMUTATION • PREFERVECTOR • PROBABILITY • SAFE_ADDRESS • SAFE_CONDITIONAL • SHORTLOOP, SHORTLOOP128 • LOOP_INFO • UNROLL, NOUNROLL • VECTOR, NOVECTOR • PIPELINE, NOPIPELINE • VFUNCTION The -O 0, -O scalar0, -O task0, and -O vector0 options on the ftn command override these directives. 96 S–3901–60Cray Fortran Directives [5] 5.2.1 Use Cache-exclusive Instructions for Vector Loads: CACHE_EXCLUSIVE The CACHE_EXCLUSIVE directive asserts that all vector loads with the specified symbols as the base are to be made using cache-exclusive instructions. This is an advisory directive; if the compiler honors it, vector load misses cause the cache line to be allocated in an exclusive state in anticipation of a subsequent store. This directive is ignored for stores. Scalar loads and stores are also unaffected. The primary use of this directive is to override automatic cache management decisions (see Section 3.19.3, page 38). To use the directive, place it only in the specification part, before any executable statement. The syntax of the CACHE_EXCLUSIVE directive is: !DIR$ CACHE_EXCLUSIVE symbol [, symbol] symbol A base symbol (an array or scalar structure, but not a member reference or array element). Examples of valid CACHE_EXCLUSIVE symbols are A, B, C. Symbols such as A%B or C(10) cannot be used as CACHE_EXCLUSIVE symbols. 5.2.2 Use Cache-shared Instructions for Vector Loads: CACHE_SHARED The CACHE_SHARED directive asserts that all vector loads with the specified symbols as the base are to be made using cache-shared instructions. This an advisory directive; if the compiler honors it, vector load misses cause the cache line to be allocated in a shared state, in anticipation of a subsequent load by a different MSP (X1 only). This directive is not meaningful and will be ignored for stores. Scalar loads and stores are also unaffected. The compiler may override the directive when it determines the directive is not beneficial. The syntax of the CACHE_SHARED directive is: !DIR$ CACHE_SHARED symbol [, symbol ...] symbol A base symbol (an array or scalar structure, but not a member reference or array element). Examples of valid CACHE_SHARED symbols are A, B, C. Symbols such as A%B or C(10) cannot be used as CACHE_SHARED symbols. S–3901–60 97Cray® Fortran Reference Manual 5.2.3 Avoid Placing Object into Cache: NO_CACHE_ALLOC The NO_CACHE_ALLOC directive is an advisory directive that specifies objects that should not be placed into the cache. Advisory directives are directives the compiler will honor if conditions permit it to. When this directive is honored, the performance of your code may be improved because the cache is not occupied by objects that have a lower cache hit rate. Theoretically, this makes room for objects that have a higher cache hit rate. Here are some guidelines that will help you determine when to use this directive. This directive works only on objects that are vectorized. That is, other objects with low cache hit rates can still be placed into the cache. Also, you should use this directive for objects you do not want placed into the cache. To use the directive, you must place it only in the specification part, before any executable statement. This is the form of the directive: !DIR$ NO_CACHE_ALLOC BASE_NAME [, BASE_NAME] ... BASE_NAME specifies the base name of the object that should not be placed into the cache. This can be the base name of any object such as an array, scalar structure, etc., without member references like C(10). If you specify a pointer in the list, only the references, not the pointer itself, have the no cache allocate property. 5.2.4 Copy Arrays to Temporary Storage: COPY_ASSUMED_SHAPE The COPY_ASSUMED_SHAPE directive copies assumed-shape dummy array arguments into contiguous local temporary storage upon entry to the procedure in which the directive appears. During execution, it is the temporary storage that is used when the assumed-shape dummy array argument is referenced or defined. The format of this directive is as follows: !DIR$ COPY_ASSUMED_SHAPE [ array [, array ] ...] array The name of an array to be copied to temporary storage. If no array names are specified, all assumed-shape dummy arrays are copied to temporary contiguous storage upon entry to the procedure. When the procedure is exited, the arrays in temporary storage are copied back to the dummy argument arrays. If one or more arrays are specified, only those arrays specified are copied. The arrays specified must not have the TARGET attribute. 98 S–3901–60Cray Fortran Directives [5] All arrays specified, or all assumed-shape dummy arrays (if specified without array arguments), on a single COPY_ASSUMED_SHAPE directive must be shape conformant with each other. Incorrect code may be generated if the arrays are not. You can use the -R c command line option to verify whether the arrays are shape conformant. The COPY_ASSUMED_SHAPE directive applies only to the program unit in which it appears. Assumed-shape dummy array arguments cannot be assumed to be stored in contiguous storage. In the case of multidimensional arrays, the elements cannot be assumed to be stored with uniform stride between each element of the array. These conditions can arise, for example, when an actual array argument associated with an assumed-shape dummy array is a non-unit strided array slice or section. If the compiler cannot determine whether an assumed-shape dummy array is stored contiguously or with a uniform stride between each element, some optimizations are inhibited in order to ensure that correct code is generated. If an assumed-shape dummy array is passed to a procedure and becomes associated with an explicit-shape dummy array argument, additional copy-in and copy-out operations may occur at the call site. For multidimensional assumed-shape arrays, some classes of loop optimizations cannot be performed when an assumed-shape dummy array is referenced or defined in a loop or an array assignment statement. The lost optimizations and the additional copy operations performed can significantly reduce the performance of a procedure that uses assumed-shape dummy arrays when compared to an equivalent procedure that uses explicit-shape array dummy arguments. The COPY_ASSUMED_SHAPE directive causes a single copy to occur upon entry and again on exit. The compiler generates a test at run time to determine whether the array is contiguous. If the array is contiguous, the array is not copied. This directive allows the compiler to perform all the optimizations it would otherwise perform if explicit-shape dummy arrays were used. If there is sufficient work in the procedure using assumed-shape dummy arrays, the performance improvements gained by the compiler outweigh the cost of the copy operations upon entry and exit of the procedure. S–3901–60 99Cray® Fortran Reference Manual 5.2.5 Limit Optimizations: HAND_TUNED This directive asserts that the code in the loop that follows the directive has been arranged by hand for maximum performance and the compiler should restrict some of the more aggressive automatic expression rewrites. The compiler will still fully optimize, vectorize, and multistream the loop within the constraints of the directive. The syntax of this directive is as follows: !DIR$ HAND_TUNED Warning: Exercise caution when using this directive and evaluate code performance before and after using it. The use of this directive may severely impair performance. 5.2.6 Ignore Vector Dependencies: IVDEP When the IVDEP directive appears before a loop, the compiler ignores vector dependencies, including explicit dependencies, in any attempt to vectorize the loop. IVDEP applies to the first DO loop or DO WHILE loop that follows the directive. The directive applies to only the first loop that appears after the directive within the same program unit. For array operations, Fortran requires that the complete right-hand side (RHS) expression be evaluated before the assignment to the array or array section on the left-hand side (LHS). If possible dependencies exist between the RHS expression and the LHS assignment target, the compiler creates temporary storage to hold the RHS expression result. If an IVDEP directive appears before an array syntax statement, the compiler ignores potential dependencies and suppresses the creation and use of array temporaries for that statement. Using array syntax statements allows you to reference referencing arrays in a compact manner. Array syntax allows you to use either the array name, or the array name with a section subscript, to specify actions on all the elements of an array, or array section, without using DO loops. Whether or not IVDEP is used, conditions other than vector dependencies can inhibit vectorization. The format of this directive is as follows: !DIR$ IVDEP [ SAFEVL=vlen | INFINITEVL] 100 S–3901–60Cray Fortran Directives [5] vlen Specifies a vector length in which no dependency will occur. vlen must be an integer between 1 and 1024 inclusive. INFINITEVL Specifies an infinite safe vector length. That is, no dependency will occur at any vector length. If no vector length is specified on the Cray X1 series or X2 systems, the vector length used is infinity. If a loop with an IVDEP directive is enclosed within another loop with an IVDEP directive, the IVDEP directive on the outer loop is ignored. When the Cray Fortran compiler vectorizes a loop, it may reorder the statements in the source code to remove vector dependencies. When IVDEP is specified, the statements in the loop or array syntax statement are assumed to contain no dependencies as written, and the Cray Fortran compiler does not reorder loop statements. For information about vector dependencies, see Optimizing Applications on Cray X1 Series Systems. 5.2.7 Specify Scalar Processing: NEXTSCALAR The NEXTSCALAR directive disables vectorization for the first DO loop or DO WHILE loop that follows the directive. The directive applies to only one loop, the first loop that appears after the directive within the same program unit. NEXTSCALAR is ignored if vectorization has been disabled. The format of this directive is as follows: !DIR$ NEXTSCALAR If the NEXTSCALAR directive appears prior to any array syntax statement, it disables vectorization for the array syntax statement. Note: The NEXTSCALAR directive does not affect multistreaming. (X1 only) S–3901–60 101Cray® Fortran Reference Manual 5.2.8 Request Pattern Matching: PATTERN and NOPATTERN By default, the compiler detects coding patterns in source code sequences and replaces these sequences with calls to optimized library routines. In most cases, this replacement improves performance. There are cases, however, in which this substitution degrades performance. This can occur, for example, in loops with very low trip counts. In such a case, you can use the NOPATTERN directive to disable pattern matching and cause the compiler to generate inline code. The formats of these directives are as follows: !DIR$ PATTERN !DIR$ NOPATTERN When !DIR$ NOPATTERN has been encountered, pattern matching is suspended for the remainder of the program unit or until a !DIR$ PATTERN directive is encountered. When the -O nopattern command line option (default) is in effect, the PATTERN and NOPATTERN compiler directives are ignored. For more information about -O nopattern, see Section 3.19.18, page 52. The PATTERN and NOPATTERN directives should be specified before the beginning of a pattern. Example: By default, the compiler would detect that the following loop is a matrix multiply and replace it with a call to a matrix multiply library routine. By preceding the loop with a !DIR$ NOPATTERN directive, however, pattern matching is inhibited and no replacement is done. !DIR$ NOPATTERN DO k= 1,n DO i= 1,n DO j= 1,m A(i,j) = A(i,j) + B(i,k) * C(k,j) END DO END DO END DO 5.2.9 Declare an Array with No Repeated Values: PERMUTATION The !DIR$ PERMUTATION directive declares that an integer array has no repeated values. This directive is useful when the integer array is used as a subscript for another array (vector-valued subscript). When this directive precedes a loop to be vectorized, it may cause more efficient code to be generated. 102 S–3901–60Cray Fortran Directives [5] The format for this directive is as follows: !DIR$ PERMUTATION (ia [, ia ] ...) ia Integer array that has no repeated values for the entire routine. When an array with a vector-valued subscript appears on the left side of the equal sign in a loop, many-to-one assignment is possible. Many-to-one assignment occurs if any repeated elements exist in the subscripting array. If it is known that the integer array is used merely to permute the elements of the subscripted array, it can often be determined that many-to-one assignment does not exist with that array reference. Sometimes a vector-valued subscript is used as a means of indirect addressing because the elements of interest in an array are sparsely distributed; in this case, an integer array is used to select only the desired elements, and no repeated elements exist in the integer array, as in the following example: !DIR$ PERMUTATION(IPNT) ! IPNT has no repeated values ... DO I = 1, N A(IPNT(I)) = B(I) + C(I) END DO 5.2.10 Designate Loop Nest for Vectorization: PREFERVECTOR For cases in which the compiler could vectorize more than one loop, the PREFERVECTOR directive indicates that the loop following the directive should be vectorized. This directive can be used if there is more than one loop in the nest that could be vectorized. The format of this directive is as follows: !DIR$ PREFERVECTOR S–3901–60 103Cray® Fortran Reference Manual In the following example, both loops can be vectorized, but the compiler generates vector code for the outer DO I loop. Note that the DO I loop is vectorized even though the inner DO J loop was specified with an IVDEP directive: !DIR$ PREFERVECTOR DO I = 1, N !DIR$ IVDEP DO J = 1, M A(I) = A(I) + B(J,I) END DO END DO 5.2.11 Conditional Density: PROBABILITY This directive is used to guide inlining decisions, branch elimination optimizations, branch hint marking, and the choice of the optimal algorithmic approach to the vectorization of conditional code. The information specified by this directive is used by interprocedural analysis and the optimizer to produce faster code sequences. This directive can appear anywhere executable code is legal, and the syntax of this directive takes one of three forms. !DIR$ PROBABILITY const !DIR$ PROBABILITY_ALMOST_ALWAYS !DIR$ PROBABILITY_ALMOST_NEVER Where const is an expression between 0.0 (never) and 1.0 (always) that evaluates to a floating point constant at compilation time. The specified probability is a hint, rather than a statement of fact. The directive applies to the block of code where it appears. It is important to realize that the directive should not be applied to a conditional test directly; rather, it should be used to indicate the relative probability of a THEN or ELSE branch being executed. For example: IF ( A(I) > B(I) ) THEN !DIR$ PROBABILITY 0.3 A(I) = B(I) ENDIF This example states that the probability of entering the block of code with the assignment statement is 0.3, or 30%. In turn, this means that a(i) is expected to be greater than b(i) 30% of the time as well. 104 S–3901–60Cray Fortran Directives [5] For vector IF code, a probability of very low (< 0.1) or probability_almost_never will cause the compiler to use the vector gather/scatter methods used for sparse IF vector code instead of the vector merge methods used for denser IF code. For example: do i = 1,n if ( a(i) > 0.0 ) then !dir$ probability_almost_never b(i) = b(i)/a(i) + a(i)/b(i) ! Evaluate using sparse methods endif enddo Note that the PROBABILITY directive appears within the conditional, rather than before the condition. This removes some of the ambiguity of tying the directive directly to the conditional test. 5.2.12 Allow Speculative Execution of Memory References Within Loops: SAFE_ADDRESS The SAFE_ADDRESS directive allows you to tell the compiler that it is safe to speculatively execute memory references within all conditional branches of a loop. In other words, you know that these memory references can be safely executed in each iteration of the loop. For most code, the SAFE_ADDRESS directive can improve performance significantly by preloading vector expressions. However, most loops do not require this directive to have preloading performed. The directive is only required when the safety of the operation cannot be determined or index expressions are very complicated. The SAFE_ADDRESS directive is an advisory directive. That is, the compiler may override the directive if it determines the directive is not beneficial. If you do not use the directive on a loop and the compiler determines that it would benefit from the directive, it issues a message indicating such. The message is similar to this: do i = 1,n ftn-6375 ftn_driver.exe: VECTOR X7, File = 10928.f, Line = 110 A loop starting at line 110 would benefit from "!dir$ safe_address". If you use the directive on a loop and the compiler determines that it does not benefit from the directive, it issues a message that states the directive is superfluous and can be removed. To see the messages you must use the -O msgs option. S–3901–60 105Cray® Fortran Reference Manual Incorrect use of the directive can result in segmentation faults, bus errors, or excessive page faulting. However, it should not result in incorrect answers. Incorrect usage can result in very severe performance degradations or program aborts. This is the syntax of the SAFE_ADDRESS directive: !DIR$ SAFE_ADDRESS In the example below, the compiler will not preload vector expressions, because the value of j is unknown. However, if you know that references to b(i,j) are safe to evaluate for all iterations of the loop, regardless of the condition, we can use the SAFE_ADDRESS directive for this loop as shown below: subroutine x3( a, b, n, m, j ) real a(n), b(n,m) !dir$ safe_address do i = 1,64 ! Vectorized loop if ( a(i).ne.0.0 ) then b(i,j) = 0.0 ! Value of 'j' is unknown endif enddo end With the directive, the compiler can load b(i,j) with a full vector mask, merge 0.0 where the condition is true, and store the resulting vector using a full mask. 5.2.13 Allow Speculative Execution of Memory References and Arithmetic Operations: SAFE_CONDITIONAL The SAFE_CONDITIONAL directive expands upon the SAFE_ADDRESS directive. It implies SAFE_ADDRESS and further specifies that arithmetic operations are safe, as well as memory operations. This directive applies to scalar, vector, and multistreamed loop nests. It can improve performance by allowing the hoisting of invariant expressions from conditional code and allowing prefetching of memory references. 106 S–3901–60Cray Fortran Directives [5] The SAFE_CONDITIONAL directive is an advisory directive. The compiler may override the directive if it determines that the directive is not beneficial. ! Caution: Incorrect use of the directive may result in segmentation faults, bus errors, excessive page faulting, or arithmetic aborts. However, it should not result in incorrect answers. Incorrect usage may result in severe performance degradation or program aborts. The syntax of this directive is as follows: !DIR$ SAFE_CONDITIONAL In the example below, the compiler cannot precompute the invariant expression s1*s2 because these values are unknown and may cause an arithmetic trap if executed unconditionally. However, if you know that the condition is true at least once, then it is safe to use the SAFE_CONDITIONAL directive and execute s1*s2 speculatively. subroutine safe_cond( a, n, s1, s2 ) real a(n), s1, s2 !dir$ safe_conditional do i = 1,n if ( a(i) /= 0.0 ) then a(i) = a(i) + s1*s2 endif enddo end With the directive, the compiler evaluates s1*s2 outside of the loop, rather than under control of the conditional code. In addition, all control flow is removed from the body of the vector loop as s1*s2 no longer poses a safety risk. 5.2.14 Designate Loops with Low Trip Counts: SHORTLOOP, SHORTLOOP128 The SHORTLOOP directive, used before a DO or DO WHILE loop with a low trip count, allows the compiler to generate code that improves program performance by eliminating run-time tests for determining whether a vectorized DO loop has been completed. The compiler will diagnose misuse at compile time (when able) or under option -Rd at run time. S–3901–60 107Cray® Fortran Reference Manual The formats of these directives are as follows: !DIR$ SHORTLOOP !DIR$ SHORTLOOP128 You can specify either of the preceding formats, as follows: • If you specify !DIR$ SHORTLOOP, the loop trip count must be in the range 1 = trip_count = 64. If trip_count equals 0 or exceeds 64, results are unpredictable. • If you specify !DIR$ SHORTLOOP128, the loop trip count must be in the range 1 = trip_count = 128. If trip_count equals zero or exceeds 128, results are unpredictable. SHORTLOOP is ignored in the following cases: • If vectorization is disabled. • If the code in question is an array syntax assignment statement. • If the compiler can determine that the directive is invalid. If so, a diagnostic message is issued. The meaning of SHORTLOOP and SHORTLOOP128 can be modified by using the -eL command. If enabled, this option changes the lower bound to allow zero-trip loops. For more information, see Section 3.5, page 18. 5.2.15 Provide More Information for Loops: LOOP_INFO The LOOP_INFO directive allows additional information to be specified about the behavior of a loop. This currently includes information about the run-time trip count and hints on cache allocation strategy. The compiler will diagnose misuse at compile time (when able) or under option -Rd at run time. With respect to the trip count information, the LOOP_INFO directive is similar to the SHORTLOOP or SHORTLOOP128 directive, but provides more information to the optimizer and can produce faster code sequences. LOOP_INFO is used before a DO or WHILE loop with a low or known trip count. For cache allocation hints, the LOOP_INFO directive can be used to override default settings or to supersede earlier NO_CACHE_ALLOC, CACHE_EXCLUSIVE, or CACHE_SHARED directives. 108 S–3901–60Cray Fortran Directives [5] The syntax of the LOOP_INFO directive is as follows: !DIR$ LOOP_INFO [min_trips(c)] [est_trips(c)] [max_trips(c)] [cache_ex( symbol [, symbol ...] )] [cache_sh( symbol [, symbol ...] )] [cache_na( symbol [, symbol ...] )] [prefer_amo ][prefer_noamo ] [prefetch ][noprefetch ] Where min_trips is the guaranteed minimum number of trips, est_trips is the estimated or average number of trips, and max_trips is the guaranteed maximum number of trips. The SHORTLOOP and SHORTLOOP128 directives are equivalent, respectively, to: ! dir$ loop_info min_trips(1) max_trips(64) ! dir$ loop_info min_trips(1) max_trips(128) The cache_ex, cache_sh, and cache_na options specify symbols that are to receive the exclusive, shared, and non-allocating cache hints, respectively. If no hints are specified and no NO_CACHE_ALLOC or CACHE_SHARED directives are present, the default is exclusive. The cache hints are local and apply only to the specified loop nest. For more information about cache_na behavior, see Section 5.2.3, page 98 . For more information about cache_sh behavior, see Section 5.2.2, page 97. The cache_ex hint can be used to override locally any earlier NO_CACHE_ALLOC or CACHE_SHARED directive. S–3901–60 109Cray® Fortran Reference Manual The prefer_amo clause of the loop_info directive only has meaning on architectures that have vector atomic memory operation capability in hardware including the Cray X2. On architectures that lack this hardware, such as the Cray X1 and Cray X1E, the clause is accepted but has no effect. The prefer_amo clause instructs, but does not require, the compiler to use vector atomic memory operations as aggressively as possible, including in those cases that the compiler would normally avoid because it expects the performance to be poor. For example: subroutine p_amo( ia, ib, n ) integer (kind=8) ia(n), ib(n) ! The compiler avoids vector AMOs in this case for most access patterns do i = 1,n ia(i) = ia(i) + 1 enddo ! Direct the compiler to use vector AMOs when possible !dir$ loop_info prefer_amo do i = 1,n ib(i) = ib(i) + 1 enddo end For sample test case p_amo, the compiler does not use a vector atomic memory operation for the first loop, but it does use it for the second loop because of the prefer_amo compiler clause of the loop_info directive. A message similar to the following lines is issued when messages are enabled: ib(i) = ib(i) + 1 ftn-6385 ftn: VECTOR P_AMO, File = amo.f, Line = 10 A vector atomic memory operation was used for this statement. 110 S–3901–60Cray Fortran Directives [5] The prefer_noamo clause instructs, but does not require, the compiler to avoid all uses of vector atomic memory operations. The compiler may, at its discretion, continue to use vector atomic memory operations if there is no alternative solution to vectorizing the loop. The compiler automatically uses vector atomic memory operations if its assessment shows that the performance will improve. For example: subroutine a_amo( a, b, c, ia, ib, n ) integer (kind=8) ia(n), ib(n) integer (kind=8) a(n), b(n), c(n) ! Compiler automatically uses a vector AMO do i = 1,n a(ia(i)) = a(ia(i)) + c(i) enddo ! Instruct the compiler to avoid using a vector AMO !dir$ loop_info prefer_noamo do i = 1,n b(ib(i)) = b(ib(i)) + c(i) enddo end For sample test case a_amo, the compiler uses a vector atomic memory operation for the 'update' construct in the first loop. In the second loop, the 'prefer_noamo' clause of the loop_info directive instructs the compiler to avoid using vector atomic memory operations. Messages demonstrating the effects of these directives similar to the following lines are ssued for the two loop bodies: a(ia(i)) = a(ia(i)) + c(i) ftn-6385 ftn: VECTOR A_AMO, File = a_amo.f, Line = 6 A vector atomic memory operation was used for this statement. do i = 1,n ftn-6371 ftn: VECTOR A_AMO, File = a_amo.f, Line = 10 A vectorized loop contains potential conflicts due to indirect addressing at line 11, causing less efficient code to be generated. The hardware vector atomic memory operations for the Cray X2 include 64-bit integer bitwise and, bitwise or, bitwise exclusive or, and integer addition. The compiler recognizes these and other operations that can efficiently map onto the set of instructions. S–3901–60 111Cray® Fortran Reference Manual The prefetch clause (X2 only) instructs the compiler to preload scalar data into the first-level cache to improve the frequency of cache hits and lower latency. They are generated in situations where the compiler expects them to improve performance. Strategic use of prefetch instructions can hide latency for scalar loads feeding vector instructions or scalar loads in purely scalar loops. Prefetch instructions are generated at default and higher levels of optimization. Thus, they are turned off at -O0 or -O1. Prefetch can be turned off at the loop level via the following directive: !dir$ loop_info noprefetch do i = 1, n 5.2.16 Unroll Loops: UNROLL and NOUNROLL Loop unrolling can improve program performance by revealing cross-iteration memory optimization opportunities such as read-after-write and read-after-read. The effects of loop unrolling also include: • Improved loop scheduling by increasing basic block size • Reduced loop overhead • Improved chances for cache hits The formats of these directives are as follows: !DIR$ UNROLL [ n ] !DIR$ NOUNROLL n Specifies the total number of loop body copies to be generated. n is an integer value from 0 through 1024. If you specify a value for n, the compiler unrolls the loop by that amount. If you do not specify n, the compiler determines if it is appropriate to unroll the loop, and if so, the unroll amount. The subsequent DO loop is not unrolled if you specify UNROLL0, UNROLL1, or NOUNROLL. These directives are equivalent. The UNROLL directive should be placed immediately before the DO statement of the loop that should be unrolled. Note: The compiler cannot always safely unroll non-innermost loops due to data dependencies. In these cases, the directive is ignored (see Example 1). 112 S–3901–60Cray Fortran Directives [5] The UNROLL directive can be used only on loops whose iteration counts can be calculated before entering the loop. If UNROLL is specified on a loop that is not the innermost loop in a loop nest, the inner loops must be nested perfectly. That is, at each nest level, there is only one loop and only the innermost loop contains work. The NOUNROLL directive inhibits loop unrolling. Note: Loop unrolling occurs for both vector and scalar loops automatically. It is usually not necessary to use the unrolling directives. The UNROLL directive should be limited to non-inner loops such as Example 1 in which unroll-and-jam conditions can occur. Such loop unrolling is associated with compiler message 6005. Using the UNROLL directive for inner loops may be detrimental to performance and is not recommended. Typically, loop unrolling occurs in both vector and scalar loops without need of the UNROLL directive. Example 1: Unrolling outer loops Assume that the outer loop of the following nest will be unrolled by two: !DIR$ UNROLL 2 DO I = 1, 10 DO J = 1,100 A(J,I) = B(J,I) + 1 END DO END DO With outer loop unrolling, the compiler produces the following nest, in which the two bodies of the inner loop are adjacent to each other: DO I = 1, 10, 2 DO J = 1,100 A(J,I) = B(J,I) + 1 END DO DO J = 1,100 A(J,I+1) = B(J,I+1) + 1 END DO END DO S–3901–60 113Cray® Fortran Reference Manual The compiler jams, or fuses, the inner two loop bodies together, producing the following nest: DO I = 1, 10, 2 DO J = 1,100 A(J,I) = B(J,I) + 1 A(J,I+1) = B(J,I+1) + 1 END DO END DO Example 2: Illegal unrolling of outer loops Outer loop unrolling is not always legal because the transformation can change the semantics of the original program. For example, unrolling the following loop nest on the outer loop would change the program semantics because of the dependency between A(...,I) and A(...,I+1): !DIR$ UNROLL 2 DO I = 1, 10 DO J = 1,100 A(J,I) = A(J-1,I+1) + 1 END DO END DO Example 3: Unrolling nearest neighbor pattern The following example shows unrolling with nearest neighbor pattern. This allows register reuse and reduces memory references from 2 per trip to 1.5 per trip. !DIR$ UNROLL 2 DO J = 1,N DO I = 1,N ! VECTORIZE A(I,J) = B(I,J) + B(I,J+1) ENDDO ENDDO The preceding code fragment is converted to the following code: DO J = 1,N,2 ! UNROLLED FOR REUSE OF B(I,J+1) DO I = 1,N ! VECTORIZED A(I,J) = B(I,J) + B(I,J+1) A(I,J+1) = B(I,J+1) + B(I,J+2) END DO END DO 114 S–3901–60Cray Fortran Directives [5] 5.2.17 Enable and Disable Vectorization: VECTOR and NOVECTOR The NOVECTOR directive suppresses compiler attempts to vectorize loops and array syntax statements. NOVECTOR takes effect at the beginning of the next loop and applies to the rest of the program unit unless it is superseded by a VECTOR directive. These directives are ignored if vectorization or scalar optimization have been disabled. The formats of these directives are as follows: !DIR$ VECTOR !DIR$ NOVECTOR When !DIR$ NOVECTOR has been used within the same program unit, !DIR$ VECTOR causes the compiler to resume its attempts to vectorize loops and array syntax statements. After a VECTOR directive is specified, automatic vectorization is enabled for all loop nests. The VECTOR directive affects subsequent loops. The NOVECTOR directive also affects subsequent loops, but if it is specified within the body of a loop, it affects the loop in which it is contained and all subsequent loops. 5.2.18 Enable or Disable, Temporarily, Soft Vector-pipelining: PIPELINE and NOPIPELINE Software-based vector pipelining (software vector pipelining) provides additional optimization beyond the normal hardware-based vector pipelining. In software vector pipelining, the compiler analyzes all vector loops and will automatically attempt to pipeline a loop if doing so can be expected to produce a significant performance gain. This optimization also performs any necessary loop unrolling. In some cases the compiler will either not pipeline a loop that could be pipelined, or pipeline a loop without producing performance gains. In these cases, you can use the PIPELINE or NOPIPELINE directives to advise the compiler to pipeline or not pipeline the loop immediately following the directive. The format of the pipelining directives is as follows: !DIR$ PIPELINE !DIR$ NOPIPELINE Software vector pipelining is valid only for the innermost loop of a loop nest. S–3901–60 115Cray® Fortran Reference Manual The PIPELINE and NOPIPELINE directives are advisory only. While you can use the NOPIPELINE directive to inhibit automatic pipelining, and you can use the PIPELINE directive to attempt to override the compiler's decision not to pipeline a loop, you cannot force the compiler to pipeline a loop that cannot be pipelined. Vector loops that have been pipelined generate compile-time messages to that effect, if optimization messaging is enabled (-O msgs). For more information about the messages issued, see the Optimizing Applications on Cray X1 Series Systems. 5.2.19 Specify a Vectorizable Function: VFUNCTION The VFUNCTION directive declares that a vector version of an external function exists. The VFUNCTION directive must precede any statement function definitions or executable statements in a program. VFUNCTION cannot be specified for internal or module procedures. VFUNCTION cannot be specified for functions within interface blocks. This is the format of the VFUNCTION directive: !DIR$ VFUNCTION function_name [,f ] ... f Symbolic name of a vector external function. The maximum length is 29 characters because the % character is added at the beginning and end of the name as part of the calling sequence. For example, if the function is named FUNC, the CAL vector version is spelled %FUNC%. (The scalar version is FUNC%.) 116 S–3901–60Cray Fortran Directives [5] The following rules and recommendations apply to any function f named as an argument in a VFUNCTION directive: • f cannot be declared in an EXTERNAL statement, have its interface specified in an interface body, or be specified in a PROCEDURE declaration statement. • f must be written in CAL and must use the call-by-register sequence. • Arguments to f must be either vectorizable expressions or scalar expressions; array syntax and array expressions are not allowed. • A call to f can pass a maximum of seven single-word items or one four-word item (complex (KIND=KIND(0.0D0))). No structures or character arguments can be passed. These can be mixed in any order with a maximum of seven words total. • f should not change the value of its arguments or variables in common blocks or modules. Any changed value should be for variables that are distinct from the arguments. • f should not reference variables in common blocks or modules that are also used by a program unit in the calling chain. • A call to f cannot occur within a WHERE statement or WHERE block. • f must not have side effects or perform I/O. Arguments to f are sent to the V registers that have numbers that match the arguments' ordinal numbers in the argument list: X=VFUNC(v1,v2,v3,v4). (The scalar version uses the same convention with the S registers.) If the argument list for f contains both scalar and vector arguments in a vector loop, the scalar arguments are broadcast into the appropriate vector registers. If all arguments are scalar or the function reference is not in a vector loop, f is called with all arguments passed in S registers. 5.3 Multistreaming Processor (MSP) Directives (X1 only) The MSP directives work with the -O streamn command line option to determine whether parts of your program are optimized for the MSP. Therefore, one of the following options must be specified on the ftn command line in order for these directives to be recognized: -O stream1 or -O stream3. The default streaming option, -O stream2, also causes recognition of the directives. For more information about the -O streamn command line option, see Section 3.19.22, page 56. S–3901–60 117Cray® Fortran Reference Manual The MSP directives are as follows: • PREFERSTREAM • SSP_PRIVATE • STREAM, NOSTREAM The following subsections describe the MSP optimization directives. 5.3.1 Specify Loop to be Optimized for MSP: PREFERSTREAM For cases in which the compiler could perform MSP optimizations on more than one loop in a loop nest, the PREFERSTREAM directive indicates that the loop following the directive is the one to be optimized. The format of this directive is as follows: !DIR$ PREFERSTREAM This directive is ignored if -O stream0 is in effect. 5.3.2 Optimize Loops Containing Procedural Calls: SSP_PRIVATE The SSP_PRIVATE directive allows the compiler to stream loops that contain procedural calls. By default, the compiler does not stream procedural calls contained in a loop, because the call may have side effects that interfere with correct parallel execution. The SSP_PRIVATE directive asserts that the specified procedure is free of side effects that inhibit parallelism and that the specified procedure, and all procedures it calls, will run on one SSP. An implied condition for streaming loops containing a call to a procedure specified with the SSP_PRIVATE directive is that the loop body must not contain any problems that prevent parallelism. The compiler can disregard an SSP_PRIVATE directive if it detects possible loop-carried dependencies that are not directly related to a call inside the loop. Note: The SSP_PRIVATE directive only affects whether or not loops are automatically streamed. It has no effect on loops within Cray streaming directive (CSD) parallel regions. 118 S–3901–60Cray Fortran Directives [5] When using the SSP_PRIVATE directive, you must ensure that the procedure called within the body of the loop follows these criteria: • The procedure does not modify data in one iteration and reference this same data in another iteration of the streamed loop. This rule applies equally to arguments, common variables, and data declared by using a SAVE statement. • The procedure does not reference data in one iteration that is defined in another iteration. • If the procedure modifies an argument, common variable, or data declared in a SAVE statement, the iterations cannot modify data at the same storage location. unless these variables are scoped as PRIVATE. Following the streamed loop, the content of private variables are undefined. The SSP_PRIVATE directive does not force the master thread to execute the last iteration of the task loop. • If the procedure uses shared data (for example, global data, actual arguments) that can be written to and read, you must protect it with a guard (such as the CSD CRITICAL directive or the lock command) or have the SSPs access the data disjointedly (where access does not overlap). • The procedure calls only other procedures that are capable of being called privately. • The procedure uses the appropriate synchronization mechanism when calling I/O. Note: The preceding list assumes that you have a working knowledge of race conditions. The SSP_PRIVATE directive can only be used in the specification part, before any executable statements. The SSP_PRIVATE directive may be used multiple times within a procedure. This is the form of the SSP_PRIVATE directive: !DIR$ SSP_PRIVATE PROC_NAME[, PROC_NAME] ... PROC_NAME specifies one or more procedure names called from within the loops that are candidates for streaming. Procedures specified in the procedure name list retain the SSP_PRIVATE attribute throughout the entire program unit. These procedures must be compiled with the -O gen_private_callee option. S–3901–60 119Cray® Fortran Reference Manual The following example demonstrates use of the SSP_PRIVATE directive: ! Code in file1.ftn subroutine example(X, Y, P, N, M) dimension X(N), Y(N), P(0:M) !dir$ ssp_private poly_eval do I = 1, N call poly_eval( Y(I), X(I), P, M ) enddo end ! Code in file2.ftn. subroutine poly_eval( Y, X, P, M ) dimension P(0:M) Y = P(M) do J = M-1, 0, -1 Y = X*Y + P(J) enddo end This example compiles the code: % ftn -c -O gen_private_callee file2.ftn % ftn file1.ftn file2.o Now we run the code: % aprun a.out SSP private procedures are appropriate for user-specified math support functions. Builtin-math functions, like COS are effectively SSP private routines. 5.3.3 Enable MSP Optimization: STREAM and NOSTREAM The STREAM and NOSTREAM directives specify whether the compiler should perform MSP optimizations over a range of code. These optimizations are applied to loops and array syntax statements. The formats of these directives are as follows: !DIR$ STREAM !DIR$ NOSTREAM 120 S–3901–60Cray Fortran Directives [5] One of these directives remains in effect until the opposite directive is encountered or until the end of the program unit. These directives are ignored if -O stream0 is in effect. 5.4 Inlining Directives The inlining directives allow you to specify whether the compiler should attempt to inline certain subprograms or procedures. These are the inlining directives: • clone, noclone • inline, noinline, resetinline • inlinealways, inlinenever • modinline, nomodinline These directives work in conjunction with the following command line options: • -O ipan and -O ipafrom, described in Section 3.19.10, page 44. • -O modinline and -O nomodinline, described in Section 3.19.12, page 49. The following subsections describe the inlining directives. 5.4.1 Disable or Enable Cloning for a Block of Code: CLONE and NOCLONE The clone and noclone directives control whether cloning is attempted over a range of code. If !dir$ clone is in effect, cloning is attempted at call sites. If !dir$ noclone is in effect, cloning is not attempted at call sites. The formats of these directives are as follows: !dir$ clone !dir$ noclone One of these directives remains in effect until the opposite directive is encountered or until the end of the program unit. These directives are recognized when cloning is enabled on the command line (-O clone1). These directives are ignored if the -O ipa0 option is in effect. S–3901–60 121Cray® Fortran Reference Manual 5.4.2 Disable or Enable Inlining for a Block of Code: INLINE, NOINLINE, and RESETINLINE The inline, noinline, and resetinline directives control whether inlining is attempted over a range of code. If !dir$ inline is in effect, inlining is attempted at call sites. If !dir$ noinline is in effect, inlining is not attempted at call sites. After either directive is used, !dir$ resetinline can be used to return inlining to the default state. These are the formats of these directives: !dir$ inline !dir$ noinline !dir$ resetinline The inline and noinline directives remain in effect until the opposite directive is encountered, until the resetinline directive is encountered, or until the end of the program unit. These directives are ignored if -O ipa0 is in effect. 5.4.3 Specify Inlining for a Procedure: INLINEALWAYS and INLINENEVER The inlinealways directive forces attempted inlining of specified procedures. The inlinenever directive suppresses inlining of specified procedures. The formats of these directives are as follows: !dir$ inlinealways name [, name ] ... !dir$ inlinenever name [, name ] ... where name is the name of a procedure. The following rules determine the scope of these directives: • A !dir$ inlinenever directive suppresses inlining for name. That is, if !dir$ inlinenever b appears in routine b, no call to b, within the entire program, is inlined. If !dir$ inlinenever b appears in a routine other than b, no call to b from within that routine is inlined. • A !dir$ inlinealways directive specifies that inlining should always be attempted for name. That is, if !dir$ inlinealways c appears in routine c, inlining is attempted for all calls to c, throughout the entire program. If !dir$ inlinealways c appears in a routine other than c, inlining is attempted for all calls to c from within that routine. An error message is issued if inlinenever and inlinealways are specified for the same procedure in the same program unit. 122 S–3901–60Cray Fortran Directives [5] Example: The following file is compiled with -O ipa1: subroutine s() !dir$ inlinealways s ! This says attempt ! inlining of s at all calls. ... end subroutine subroutine t !dir$ inlinenever s ! Do not inline any calls to s ! in subroutine t. call s() ... end subroutine subroutine v !dir$ noinline ! Has higher precedence than inlinealways. call s() ! Do not inline this call to s. !dir$ inline call s() ! Attempt inlining of this call to s. ... end subroutine subroutine w call s() ! Attempt inlining of this call to s. ... end subroutine 5.4.4 Create Inlinable Templates for Module Procedures: MODINLINE and NOMODINLINE The MODINLINE and NOMODINLINE directives enable and disable the creation of inlinable templates for specific module procedures. The formats of these directives are as follows: !DIR$ MODINLINE !DIR$ NOMODINLINE Note: The MODINLINE and NOMODINLINE directives are ignored if -O nomodinline is specified on the ftn command line. S–3901–60 123Cray® Fortran Reference Manual These directives are in effect for the scope of the program unit in which they are specified, including all contained procedures. If one of these directives is specified in a contained procedure, the contained procedure's directive overrides the containing procedure's directive. The compiler generates a message if these directives are specified outside of a module and ignores the directive. To inline module procedures, the module being used associated must have been compiled with -O modinline. Example: MODULE BEGIN ... CONTAINS SUBROUTINE S() ! Uses SUBROUTINE S's !DIR$ !DIR$ NOMODINLINE ... CONTAINS SUBROUTINE INSIDE_S() ! Uses SUBROUTINE S's !DIR$ ... END SUBROUTINE INSIDE_S END SUBROUTINE S SUBROUTINE T() ! Uses MODULE BEGIN's !DIR$ ... CONTAINS SUBROUTINE INSIDE_T() ! Uses MODULE BEGIN's !DIR$ ... END SUBROUTINE INSIDE_T SUBROUTINE MORE_INSIDE_T !DIR$ NOMODINLINE ... END SUBROUTINE MORE_INSIDE_T END SUBROUTINE T END MODULE BEGIN In the preceding example, the subroutines are affected as follows: • Inlining templates are not produced for S, INSIDE_S, or MORE_INSIDE_T. • Inlining templates are produced for T and INSIDE_T. 124 S–3901–60Cray Fortran Directives [5] 5.5 Scalar Optimization Directives The following directives control aspects of scalar optimization: • INTERCHANGE and NOINTERCHANGE • NOSIDEEFFECTS • SUPPRESS The following subsections describe these directives. 5.5.1 Control Loop Interchange: INTERCHANGE and NOINTERCHANGE The loop interchange control directives specify whether or not the order of the following two or more loops should be interchanged. These directives apply to the loops that they immediately precede. The formats of these directives are as follows: !DIR$ INTERCHANGE (do_variable1,do_variable2 [,do_variable3]...) !DIR$ NOINTERCHANGE do_variable Specifies two or more do_variable names. The do_variable names can be specified in any order, and the compiler reorders the loops. The loops must be perfectly nested. If the loops are not perfectly nested, you may receive unexpected results. The compiler reorders the loops such that the loop with do_variable1 is outermost, then loop do_variable2, then loop do_variable3. The NOINTERCHANGE directive inhibits loop interchange on the loop that immediately follows the directive. Example: The following code has an INTERCHANGE directive: !DIR$ INTERCHANGE (I,J,K) DO K = 1,NSIZE1 DO J = 1,NSIZE1 DO I = 1,NSIZE1 X(I,J) = X(I,J) + Y(I,K) * Z(K,J) ENDDO ENDDO ENDDO S–3901–60 125Cray® Fortran Reference Manual The following code results when the INTERCHANGE directive is used on the preceding code: DO I = 1,NSIZE1 DO J = 1,NSIZE1 DO K = 1,NSIZE1 X(I,J) = X(I,J) + Y(I,K) * Z(K,J) ENDDO ENDDO ENDDO 5.5.2 Control Loop Collapse: COLLAPSE and NOCOLLAPSE The loop collapse directives control collapse of the immediately following loop nest or elemental array syntax statement. When the COLLAPSE directive is applied to a DO-loop nest, the loop control variables of the participating loops must be listed in order of increasing access stride. NOCOLLAPSE disqualifies the immediately following DO-loop from collapsing with any other loop; before an elemental array syntax statement, it inhibits all collapse in said statement. subroutine S(A, n, n1, n2) real A(n, *) !dir$ collapse (i, j) do i = 1, n1 do j = 1, n2 A(i,j) = A(i,j) + 42.0 enddo enddo end The above yields code equivalent to the following, which should not be coded directly because as program source, it violates the Fortran language standard. subroutine S(A, n, n1, n2) real A(n, *) do ij = 1, n1*n2 A(ij, 1) = A(ij, 1) + 42.0 enddo end 126 S–3901–60Cray Fortran Directives [5] With array syntax, the collapse directive appears as follows: subroutine S( A, B ) real, dimension(:,:) :: A, B !dir$ collapse A = B ! user promises uniform access stride. end In each of the above examples, the directive enables the compiler to assume appropriate conformity between trip counts and array extends. The compiler will diagnose misuse at compile time (when able); or, under option -Rd, at run time. NOCOLLAPSE prevents the compiler from collapsing a given loop with others or from performing any loop collapse within a specified array syntax statement. Collapse is almost always desirable, so this directive should be used sparingly. subroutine S(A, n) dimension A(n,n) !dir$ nocollapse do i = 1, n ! disallow collapse involving i-loop. do j = 1, n A(i,j) = 1.2 enddo enddo end Loop collapse is a special form of loop coalesce. Any perfect loop nest may be coalesced into a single loop, with explicit rediscovery of the intermediate values of original loop control variables. The rediscovery cost, which generally involves integer division, is quite high. Hence, coalesce is rarely suitable for vectorization. It may be beneficial for multithreading. By definition, loop collapse occurs when loop coalesce may be done without the rediscovery overhead. To meet this requirement, all memory accesses must have uniform stride. This typically occurs when a computation can flow from one column of a multidimensional array into the next, viewing the array as a flat sequence. Hence, array sections such as A(:,3:7) are generally suitable for collapse, while a section like A(1:n-1,:) lacks the needed access uniformity. Care must taken when applying the collapse directive to assumed shape dummy arguments and Fortran pointers because the underlying storage need not be contiguous. S–3901–60 127Cray® Fortran Reference Manual 5.5.3 Determine Register Storage: NOSIDEEFFECTS The NOSIDEEFFECTS directive allows the compiler to keep information in registers across a single call to a subprogram without reloading the information from memory after returning from the subprogram. The directive is not needed for intrinsic functions and VFUNCTIONs. NOSIDEEFFECTS declares that a called subprogram does not redefine any variables that meet the following conditions: • Local to the calling program • Passed as arguments to the subprogram • Accessible to the calling subprogram through host association • Declared in a common block or module • Accessible through USE association The format of this directive is as follows: !DIR$ NOSIDEEFFECTS f [, f ] ... f Symbolic name of a subprogram that the user is sure has no side effects. f must not be the name of a dummy procedure, module procedure, or internal procedure. A procedure declared NOSIDEEFFECTS should not define variables in a common block or module shared by a program unit in the calling chain. All arguments should have the INTENT(IN) attribute; that is, the procedure must not modify its arguments. If these conditions are not met, results are unpredictable. The NOSIDEEFFECTS directive must appear in the specification part of a program unit and must appear before the first executable statement. The compiler may move invocations of a NOSIDEEFFECTS subprogram from the body of a DO loop to the loop preamble if the arguments to that function are invariant in the loop. This may affect the results of the program, particularly if the NOSIDEEFFECTS subprogram calls functions such as the random number generator or the real-time clock. The effects of the NOSIDEEFFECTS directive are similar to those that can be obtained by specifying the PURE prefix on a function or a subroutine declaration. For more information about the PURE prefix, refer to the Fortran Standard. 128 S–3901–60Cray Fortran Directives [5] 5.5.4 Suppress Scalar Optimization: SUPPRESS The SUPPRESS directive suppresses scalar optimization for all variables or only for those specified at the point where the directive appears. This often prevents or adversely affects vectorization of any loop that contains SUPPRESS. The format of this directive is as follows: !DIR$ SUPPRESS [ var [, var ] ... ] var Variable that is to be stored to memory. If no variables are listed, all variables in the program unit are stored. If more than one variable is specified, use a comma to separate vars. At the point at which !DIR$ SUPPRESS appears in the source code, variables in registers are stored to memory (to be read out at their next reference), and expressions containing any of the affected variables are recomputed at their next reference after !DIR$ SUPPRESS. The effect on optimization is equivalent to that of an external subroutine call with an argument list that includes the variables specified by !DIR$ SUPPRESS (or, if no variable list is included, all variables in the program unit). SUPPRESS takes effect only if it is on an execution path. Optimization proceeds normally if the directive path is not executed because of a GOTO or IF. Example: SUBROUTINE SUB (L) LOGICAL L A = 1.0 ! A is local IF (L) THEN !DIR$ SUPPRESS ! Has no effect if L is false CALL ROUTINE() ELSE PRINT *, A END IF END In this example, optimization replaces the reference to A in the PRINT statement with the constant 1.0, even though !DIR$ SUPPRESS appears between A=1.0 and the PRINT statement. The IF statement can cause the execution path to bypass !DIR$ SUPPRESS. If SUPPRESS appears before the IF statement, A in PRINT * is not replaced by the constant 1.0. S–3901–60 129Cray® Fortran Reference Manual 5.6 Local Use of Compiler Features The following directives provide local control over specific compiler features. • BOUNDS and NOBOUNDS • FREE and FIXED The -f and -R command line options apply to an entire compilation, but these directives override any command line specifications for source form or bounds checking. The following subsections describe these directives. 5.6.1 Check Array Bounds: BOUNDS and NOBOUNDS Array bounds checking provides a check of most array references at both compile time and run time to ensure that each subscript is within the array's declared size. Note: Bounds checking behavior differs with the optimization level. Complete checking is guaranteed only when optimization is turned off by specifying -O 0 on the ftn command line. The -R command line option controls bounds checking for a whole compilation. The BOUNDS and NOBOUNDS directives toggle the feature on and off within a program unit. Either directive can specify particular arrays or can apply to all arrays. The formats of these directives are as follows: !DIR$ BOUNDS [ array [, array ] ... ] !DIR$ NOBOUNDS [ array [, array ] ... ] array The name of an array. The name cannot be a subobject of a derived type. When no array name is specified, the directive applies to all arrays. BOUNDS remains in effect for a given array until the appearance of a NOBOUNDS directive that applies to that array, or until the end of the program unit. Bounds checking can be enabled and disabled many times in a single program unit. Note: To be effective, these directives must follow the declarations for all affected arrays. It is suggested that they be placed at the end of a program unit's specification statements unless they are meant to control particular ranges of code. 130 S–3901–60Cray Fortran Directives [5] The bounds checking feature detects any reference to an array element whose subscript exceeds the array's declared size. For example: REAL A(10) C DETECTED AT COMPILE TIME: A(11) = X C DETECTED AT RUN TIME IF IFUN(M) EXCEEDS 10: A(IFUN(M)) = W The compiler generates an error message when it detects an out-of-bounds subscript. If the compiler cannot detect the out-of-bounds subscript (for example, if the subscript includes a function reference), a message is issued for out-of-bound subscripts when your program runs, but the program is allowed to complete execution. Bounds checking does not inhibit vectorization but typically increases program run time. If an array's last dimension declarator is *, checking is not performed on the last dimension's upper bound. Arrays in formatted WRITE and READ statements are not checked. Note: Array bounds checking does not prevent operand range errors that result when operand prefetching attempts to access an invalid address outside an array. Bounds checking is needed when very large values are used to calculate addresses for memory references. If bounds checking detects an out-of-bounds array reference, a message is issued for only the first out-of-bounds array reference in the loop. For example: DIMENSION A(10) MAX = 20 A(MAX) = 2 DO 10 I = 1, MAX A(I) = I 10 CONTINUE CALL TWO(MAX,A) END SUBROUTINE TWO(MAX,A) REAL A(*) ! NO UPPER BOUNDS CHECKING DONE END S–3901–60 131Cray® Fortran Reference Manual The following messages are issued for the preceding program: lib-1961 a.out: WARNING Subscript 20 is out of range for dimension 1 for array 'A' at line 3 in file 't.f' with bounds 1:10. lib-1962 a.out: WARNING Subscript 1:20:1 is out of range for dimension 1 for array 'A' at line 5 in file 't.f' with bounds 1:10. 5.6.2 Specify Source Form: FREE and FIXED The FREE and FIXED directives specify whether the source code in the program unit is written in free source form or fixed source form. The FREE and FIXED directives override the -f option, if specified, on the command line. The formats of these directives are as follows: !DIR$ FREE !DIR$ FIXED These directives apply to the source file in which they appear, and they allow you to switch source forms within a source file. You can change source form within an INCLUDE file. After the INCLUDE file has been processed, the source form reverts back to the source form that was being used prior to processing of the INCLUDE file. 5.7 Storage Directives The following directives specify aspects of storing common blocks, variables, or arrays: • BLOCKABLE • BLOCKINGSIZE and NOBLOCKING • STACK The following sections describe these directives. 132 S–3901–60Cray Fortran Directives [5] 5.7.1 Permit Cache Blocking: BLOCKABLE Directive The BLOCKABLE directive specifies that it is legal to cache block the subsequent loops. The format of this directive is as follows: !DIR$ BLOCKABLE (do_variable,do_variable [,do_variable]...) where do_variable specifies the do_variable names of two or more loops. The loops identified by the do_variable names must be adjacent and nested within each other, although they need not be perfectly nested. This directive tells the compiler that these loops can be involved in a blocking situation with each other, even if the compiler would consider such a transformation illegal. The loops must also be interchangeable and unrollable. This directive does not instruct the compiler on which of these transformations to apply. 5.7.2 Declare Cache Blocking: BLOCKINGSIZE and NOBLOCKING Directives The BLOCKINGSIZE and NOBLOCKING directives assert that the loop following the directive either is (or is not) involved in a cache blocking for the primary or secondary cache. The formats of these directives are as follows: !DIR$ BLOCKINGSIZE(n1[,n2]) !DIR$ NOBLOCKING n1,n2 An integer number that indicates the block size. If the loop is involved in a blocking, it will have a block size of n1 for the primary cache and n2 for the secondary cache. The compiler attempts to include this loop within such a block, but it cannot guarantee this. For n1, specify a value such that n1 .GE. 0. For n2, specify a value such that n2 .LE. 2 30 . If n1 or n2 are 0, the loop is not blocked, but the entire loop is inside the block. S–3901–60 133Cray® Fortran Reference Manual Example: SUBROUTINE AMAT(X,Y,Z,N,M,MM) REAL(KIND=8) X(100,100), Y(100,100), Z(100,100) DO K = 1, N !DIR$ BLOCKABLE(J,I) !DIR$ BLOCKING SIZE (20) DO J = 1, M !DIR$ BLOCKING SIZE (20) DO I = 1, MM Z(I,K) = Z(I,K) + X(I,J)*Y(J,K) END DO END DO END DO END For the preceding code, the compiler makes 20 x 20 blocks when blocking, but it could block the loop nest such that loop K is not included in the tile. If it did not, add a BLOCKINGSIZE(0) directive just before loop K to specify that the compiler should generate a loop such as the following: SUBROUTINE AMAT(X,Y,Z,N,M,MM) REAL(KIND=8) X(100,100), Y(100,100), Z(100,100) DO JJ = 1, M, 20 DO II = 1, MM, 20 DO K = 1, N DO J = JJ, MIN(M, JJ+19) DO I = II, MIN(MM, II+19) Z(I,K) = Z(I,K) + X(I,J)*Y(J,K) END DO END DO END DO END DO END DO END Note that an INTERCHANGE directive can be applied to the same loop nest as a BLOCKINGSIZE directive. The BLOCKINGSIZE directive applies to the loop it directly precedes; it moves with that loop when an interchange is applied. The NOBLOCKING directive prevents the compiler from involving the subsequent loop in a cache blocking situation. 134 S–3901–60Cray Fortran Directives [5] 5.7.3 Request Stack Storage: STACK The STACK directive causes storage to be allocated to the stack in the program unit that contains the directive. This directive overrides the -ev command line option in specific program units of a compilation unit. For more information about the -ev command line option, see Section 3.5, page 18. The format of this directive is as follows: !DIR$ STACK Data specified in the specification part of a module or in a DATA statement is always allocated to static storage. This directive has no effect on this static storage allocation. All SAVE statements are honored in program units that also contain a STACK directive. This directive does not override the SAVE statement. If the compiler finds a STACK directive and a SAVE statement without any objects specified in the same program unit, a warning message is issued. The following rules apply when using this directive: • It must be specified within the scope of a program unit. • If it is specified in the specification part of a module, a message is issued. The STACK directive is allowed in the scope of a module procedure. • If it is specified within the scope of an interface body, a message is issued. 5.8 Miscellaneous Directives The following directives allow you to use several different compiler features: • CONCURRENT • FUSION and NOFUSION • ID • IGNORE_TKR • NAME • PREPROCESS • WEAK S–3901–60 135Cray® Fortran Reference Manual 5.8.1 Specify Array Dependencies: CONCURRENT The CONCURRENT directive conveys array dependency information to the compiler. This directive affects the loop that immediately follows it. The CONCURRENT directive is useful when vectorization or MSP (X1 only)optimization is specified by the command line. The format of this directive is as follows: !DIR$ CONCURRENT [ SAFE_DISTANCE=n] n An integer number that represents the number of additional consecutive loop iterations that can be executed in parallel without danger of data conflict. n must be an integeral constant > 0. If SAFE_DISTANCE=n is not specified, the distance is assumed to be infinite, and the compiler ignores all cross-iteration data dependencies. The CONCURRENT directive is ignored if the SAFE_DISTANCE argument is used and MSP optimizations, streaming (X1 only), or vectorization is requested on the command line. Example. Consider the following code: !DIR$ CONCURRENT SAFE_DISTANCE=3 DO I = K+1, N X(I) = A(I) + X(I-K) ENDDO The CONCURRENT directive in this example informs the optimizer that the relationship K > 3 is true. This allows the compiler to load all of the following array references safely during the Ith loop iteration: X(I-K) X(I-K+1) X(I-K+2) X(I-K+3) 136 S–3901–60Cray Fortran Directives [5] 5.8.2 Fuse Loops: FUSION and NOFUSION The FUSION and NOFUSION directives allow you to fine-tune the selection of which DO loops the compiler should attempt to fuse. If there are only a few loops out of many that you want to fuse, then use the FUSION directive with the -O fusion1 option to confine loop fusion to these few loops. If there are only a few loops out of many that you do not want to fuse, use the NOFUSION directive with the -O fusion2 option to specify no fusion for these loops. These are the formats of the directives: !DIR$ FUSION !DIR NOFUSION The FUSION directive should be placed immediately before the DO statement of the loop that should be fused. For more information about loop fusion and its benefits, see Optimizing Applications on Cray X1 Series Systems and Optimizing Applications on Cray X2 Systems. 5.8.3 Create Identification String: ID The ID directive inserts a character string into the file.o produced for a Fortran source file. The format of this directive is as follows: !DIR$ ID "character_string" character_ string The character string to be inserted into file.o. The syntax box shows quotation marks as the character_string delimiter, but you can use either apostrophes (' ') or quotation marks (" "). The character_string can be obtained from file.o in one of the following ways: • Method 1 — Using the what command. To use the what command to retrieve the character string, begin the character string with the characters @(#). For example, assume that id.f contains the following source code: !DIR$ ID '@(#)file.f 03 February 1999' PRINT *, 'Hello, world' END S–3901–60 137Cray® Fortran Reference Manual The next step is to use file id.o as the argument to the what command, as follows: % what id.o % id.o: % file.f 03 February 1999 Note that what does not include the special sentinel characters in the output. In the following example, character_string does not begin with the characters @(#). The output shows that what does not recognize the string. Input file id2.o contains the following: !DIR$ ID 'file.f 03 February 1999' PRINT *, 'Hello, world' END The what command generates the following output: % what id2.o % id2.o: • Method 2 — Using strings or od. The following example shows how to obtain output using the strings command. Input file id.f contains the following: !DIR$ ID "File: id.f Date: 03 February 1999" PRINT *, 'Hello, world' END 138 S–3901–60Cray Fortran Directives [5] The strings command generates the following output: % strings id.o 02/03/9913:55:52f90 3.3cn $MAIN @CODE @DATA @WHAT $MAIN $STKOFEN f$init _FWF $END *?$F(6( Hello, world $MAIN File: id.f Date: 03 February 1999 % od -tc id.o ... portion of dump deleted 0000000001600 \0 \0 \0 \0 \0 \0 \0 \n F i l e : i d 0000000001620 . f D a t e : 0 3 F e b 0000000001640 r u a r y 1 9 9 9 \0 \0 \0 \0 \0 \0 ... portion of dump deleted 5.8.4 Disregard Dummy Argument Type, Kind, and Rank: IGNORE_TKR The IGNORE_TKR directive directs the compiler to ignore the type, kind, and/or rank (TKR) of specified dummy arguments in a procedure interface. The format for this directive is as follows: !DIR$ IGNORE_TKR [ [(letter) dummy_arg] ... ] letter The letter can be T, K, or R, or any combination of these letters (for example, TK or KR). The letter applies only to the dummy argument it precedes. If letter appears, dummy_arg must appear. dummy_arg If specified, it indicates the dummy arguments for which TKR rules should be ignored. S–3901–60 139Cray® Fortran Reference Manual If not specified, TKR rules are ignored for all dummy arguments in the procedure that contains the directive. The directive causes the compiler to ignore the type, kind, and/or rank of the specified dummy arguments when resolving a generic call to a specific call. The compiler also ignores the type, kind, and/or rank on the specified dummy arguments when checking all the specifics in a generic call for ambiguities. Example: The following directive instructs the compiler to ignore type, kind, and/or rank rules for the dummy arguments of the following subroutine fragment: subroutine example(A,B,C,D) !DIR$ IGNORE_TKR A, (R) B, (TK) C, (K) D Table 8 indicates what is ignored for each dummy argument. Table 8. Explanation of Ignored TKRs Dummy Argument Ignored A Type, kind and rank is ignored B Only rank is ignored C Type and kind is ignored D Only kind is ignored 5.8.5 External Name Mapping: NAME The NAME directive allows you to specify a case-sensitive external name, or a name that contains characters outside of the Fortran character set, in a Fortran program. The case-sensitive external name is specified on the NAME directive, in the following format: !DIR$ NAME (fortran_name="external_name" [, fortran_name="external_name" ] ... ) fortran_name The name used for the object throughout the Fortran program. external_name The external form of the name. 140 S–3901–60Cray Fortran Directives [5] Rules for Fortran naming do not apply to the external_name string; any character sequence is valid. You can use this directive, for example, when writing calls to C routines. Example: PROGRAM MAIN !DIR$ NAME (FOO="XyZ") CALL FOO ! XyZ is really being called END PROGRAM Note: The Fortran standard BIND feature provides some of the capability of the NAME directive. 5.8.6 Preprocess Include File: PREPROCESS The PREPROCESS directive allows an include file to be preprocessed when the compilation does not specify the preprocessing command line option. This directive does not cause preprocessing of included files, unless they too use the directive. If the preprocessing command line option is used, preprocessing occurs normally for all files. To use the directive, it must be the first line in the include file and in each included file that needs to be preprocessing. This is the format of the PREPROCESS directive: !DIR$ PREPROCESS [expand_macros] The optional expand_macros clause allows the compiler to expand all macros within the include files. Without this clause, macro expansion occurs only within preprocessing directives. 5.8.7 Specify Weak Procedure Reference: WEAK Sometimes, the code path of a program never executes at run time because of some condition. If this code path references a procedure that is external to the program (for example, a library procedure), the linker will add the binary for the procedure to the compiled program, resulting in a larger program. The WEAK directive can prevent the loader from adding the binary to your program, resulting in a smaller program and less use of memory. S–3901–60 141Cray® Fortran Reference Manual The WEAK directive is used with procedures and variables to declare weak objects. The use of a weak object is referred to as a weak reference. The existence of a weak reference does not cause the loader to add the appropriate binaries into a compiled program, so executing a weak reference will cause the program to fail. The compiler support for determining if the binary of a weak object is loaded is deferred. To cause the loader to add the binaries so the weak reference will work, you must have a strong reference (a normal reference) somewhere in the program. The following example illustrates the reason the WEAK directive is used. The startup code, which is compiled into every Fortran program, calls the SHMEM initialization routine, which causes the linker to add the binary of the initialization routine to every compiled program if a strong reference to the routine is used. This binary is unnecessary if a program does not use SHMEM. To avoid linking unnecessary code, the startup code uses the WEAK directive for the initialization routine. In this manner, if the program does not use SHMEM, the linker does not add the binary of the initialization routine even though the startup code calls it. However, if the program calls the SHMEM routines using strong references, the linker adds the necessary binaries, including the initialization binary into the compiled program. The WEAK directive has two forms: !DIR$ WEAK procedure_name [, procedure_name] ... !DIR$ WEAK procedure_name = stub_name[, procedure_name1 = stub_name1] ... The first form allows you to specify one or more weak objects. This form requires you to implement code that senses that the procedure_name procedure is loaded before calling it. The second form allows you to point a weak reference (procedure_name) to a stub procedure that exists in your code. This allows you to call the stub if a strong reference to procedure_name does not exist. If a strong reference to procedure_name exists, it is called instead of the stub. The stub_name procedure must have the same name and dummy argument list as procedure_name. Note: The linker does not issue an unresolved reference error message for weak procedure references. 142 S–3901–60Cray Streaming Directives (CSDs) (X1 only) [6] The Cray Streaming Directives (CSDs) are nonadvisory directives that allow you to more closely control multistreaming for key loops. Nonadvisory means that the compiler must honor these directives. The intention of these directives is not to create an additional parallel programming style or demand large effort in code development. They are meant to assist the compiler in multistreaming your program. On its own, the compiler should perform multistreaming correctly in most cases. However, if multistreaming for key loops is not occurring as you desire, then use CSDs to override the compiler. CSDs are modeled after the OpenMP directives and are compatible with Pthreads and all distributed-memory parallel programming models on Cray X1 series systems. Multistreaming advisory directives (MSP directives) and CSDs cannot be mixed within the same block of code. Before explaining the guidelines and other issues, you will need an understanding of these items: • CSD parallel regions. (Section 6.1, page 144) • PARALLEL and END PARALLEL—Starts and ends the CSD parallel region. (Section 6.2, page 144) • DO and END DO—Multistreams a DO loop. (Section 6.3, page 146) • PARALLEL DO and END PARALLEL DO—Combine the CSD parallel and do directives into one directive pair. (Section 6.4, page 149) • SYNC—Synchronizes all SSPs within an MSP. (Section 6.5, page 150) • CRITICAL and END CRITICAL—Defines a critical section of code. (Section 6.6, page 150) • ORDERED and END ORDERED—Specifies SSPs execute in order. (Section 6.7, page 151) • NOCSD—Suppresses recognition of CSDs. (Section 6.8, page 152) S–3901–60 143Cray® Fortran Reference Manual When you are familiar with the directives, these topics will be beneficial to you: • Nested CSDs within Cray programming models (Section 6.9, page 153) • CSD placement (Section 6.10, page 153) • Protection of shared data (Section 6.11, page 154) • Dynamic memory allocation for CSD parallel regions (Section 6.12, page 155) • Compiler options affecting CSDs (Section 6.13, page 155) Note: Sometimes the length of a CSD statement can be longer than the maximum allowable line length. To continue the statement, you can use an ampersand character as shown in this example: !csd$ parallel do private (ii,jj,kk, !csd$& ll,mm,nn) 6.1 CSD Parallel Regions CSDs are applied to a block of code (for example, a loop), which is referred to as the CSD parallel region. All CSDs must be used within this region. You must not branch into or out of the region. Multiple CSD parallel regions can exist within a program, but, only one parallel region will be active at any given time. For example, if a parallel region calls a procedure containing a parallel region, the procedure will execute as if it did not contain a parallel region. The CSD parallel region can contain loops and nonloop constructs, but only loops are multistreamed. Parallel execution of nonloop constructs, such as initializing variables for the targeted loop, are performed redundantly on all SSPs. Procedures called from the region will be multistreamed, but you must guarantee that the procedure does not cause any side effects. Parallel execution of the procedure is independent and redundant on all SSPs, except for code blocks containing stand-alone CSDs. See Section 6.10, page 153. 6.2 Start and End Multistreaming: PARALLEL and END PARALLEL The PARALLEL and END PARALLEL directives define the CSD parallel region, tell the compiler to multistream the region, and optionally specify private data objects. All other CSDs must be used within the region. You cannot place the PARALLEL or END PARALLEL directive in the middle of a construct. 144 S–3901–60Cray Streaming Directives (CSDs) (X1 only) [6] This is the form of the parallel directives: !CSD$ PARALLEL [PRIVATE(list)] [ORDERED] structured-block !CSD$ END PARALLEL The PRIVATE clause allows you to specify data objects that are private to each SSP within the CSD parallel region; that is, each SSP has its own copy of that object and is not shared with other SSPs. The main reason for having private objects is because updating them within the CSD parallel region could cause incorrect updates because of race conditions on their addresses. The list argument specifies a comma separated list of objects to make private. By default the variables used only for loop indexing, implied-do indices, and FOR ALL indices are assumed to be private. Other variables, unless specified in the PRIVATE clause, are assumed to be shared. You may need to take special steps when using private variables. If a data object existed before the parallel region is entered and the object is made private, the object may not have the same contents inside of the region as it did outside the region. The same is true when exiting the parallel region. This same object may not have the same content outside of the region as it did within the region. Therefore, if you desire that a private object keep the same value when transitioning in and out of the parallel region, copy its value to a protected shared object so you can copy it back into the private object later. The ORDERED clause is needed if there are within the CSD parallel region, but not within CSD DO loops, any calls to procedures containing stand-alone CSD ORDERED directives. The clause is not needed if, within the CSD parallel region, only CSD DO loops contain calls to functions with stand-alone CSD ORDERED directives. If the clause is used and there are no called procedures containing a CSD ORDERED directive, the results produced by the code will be correct, but performance of that code will be slightly degraded. If the ORDERED clause is missing and there is a called procedure containing a CSD ORDERED directive, your results will be incorrect. The following example shows when the ORDERED clause is needed: !CSD$ PARALLEL ORDERED call par_sub ! par_sub contains a stand-alone ORDERED directive. !CSD DO ... !No calls to procedures containing stand-alone ORDERED directives !CSD END DO !CSD$ END PARALLEL S–3901–60 145Cray® Fortran Reference Manual The END PARALLEL directive marks the end of the CSD parallel region and has an implicit barrier synchronization. The implicit barrier protects an SSP from prematurely accessing shared data. Note: At the start of the PARALLEL directive, all SSPs are enabled; when the END PARALLEL directive is encountered, all SSPs are disabled. This example shows how to use the PARALLEL directive: !CSD$ PARALLEL PRIVATE(jx) x = 2 * PI !This line is computed on all SSPs do I = 1,NN jx = y(i) * z(i)**x !jx is private to each SSP ... end do !CSD$ END PARALLEL 6.3 Do Loops: DO and END DO The compiler distributes among the SSPs the iterations of DO loops encapsulated by the CSD DO and END DO directives. Iterations of DO loops not contained by the CSD DO directives are not distributed among the SSPs, but are all executed redundantly by all SSPs. See Section 6.10, page 153 for placement restrictions of the CSD DO directive. This is the form of the CSD DO directive: !CSD$ DO [ORDERED] [SCHEDULE(STATIC [, chunk_size])] [ORDERED] Do loop block [!CSD$ END DO [NOWAIT]] The SCHEDULE clause specifies how the loop iterations are distributed among the SSPs. This iteration distribution is fixed (STATIC) at compile time and cannot be changed by run time events. The iteration distribution is calculated by you or the compiler. You or the compiler will divide the number of iterations into groups or chunks. The compiler will then statically assign the chunks to the 4 SSPs in a round-robin fashion in iteration order. An SSP could have one or more chunks. The number of iterations per chunk is called the chunk size, which is specified by the chunk_size argument. The chunk_size argument specifies the maximum number of iterations a chunk can have. 146 S–3901–60Cray Streaming Directives (CSDs) (X1 only) [6] You can use these tips to calculate the chunk size: • Balance the parallel work load across all 4 SSPs (the number of SSPs in an MSP) by dividing the number of iterations by 4. If you have a remainder, add one to the chunk size. Using 4 chunks gives you the best performance, because multiple chunks per SSP increases the overhead caused by the CSD DO directive. That is, the fewer number of chunks per SSP (minimum 1), the better the performance. • The workload distribution among the SSPs will be imbalanced if the chunk size is greater than one fourth of the total number of iterations. • If the chunk size is greater than the total number of iterations, the first SSP (SSP0) will do all the work. The compiler calculates the iteration distribution (chunk_size) if the SCHEDULE clause or chunk_size argument is not specified. The value used is dependent on the conditions shown in Table 9. Table 9. Compiler-calculated Chunk Size Calculated chunk size Condition 1 When a CSD SYNC, CRITICAL, or ORDERED directive or a procedural call appears in the loop. Iterations / 4 The number of iterations are divided as evenly as possible into 4 chunks if these CSDs are not present in the CSD parallel region: SYNC, CRITICAL, or ORDERED directive or a procedural call. This maximum chunk size is 64. The ORDERED clause is needed if the DO loop encapsulated by the CSD DO directive calls any procedure containing a stand-alone CSD ORDERED directive. If the clause is used and there are no called procedures containing a stand-alone CSD ORDERED directive, the results produced by the code encapsulated by the directive will be correct, but performance of that code will be slightly degraded. If the ORDERED clause is missing and there is a called procedure containing a stand-alone CSD ORDERED directive, the results produced by the code encapsulated by the directive will be incorrect. S–3901–60 147Cray® Fortran Reference Manual The following example shows when the ORDERED clause is needed: !CSD$ PARALLEL !CSD$ DO ORDERED do i = 1, n call do_sub(i) !do_sub contains ORDERED directive end do !CSD$ END DO !CSD$ END PARALLEL The end of the DO loop or the presence of the optional CSD END DO directive marks the end of the streamed CSD DO region. An implicit barrier synchronization occurs at the end of the DO region, unless the NOWAIT clause is also specified. The implicit barrier protects a SSP from prematurely accessing shared data. The NOWAIT clause assumes that you are guaranteeing that consumption-before-production cannot occur. The following examples illustrate compiler and user-calculated chunk sizes. The compiler calculates the chunk size as 1 for this example, because of the subprogram call (consequently, the first SSP performs iterations 1, 5, 9, ....; the second SSP performs 2, 6, 10, ...; etc.): !CSD$ DO DO I = 1, NUM_SAMPLES CALL PROCESS_SAMPLE(SAMPLE(I)) END DO !CSD$ END DO For this example, because there are no SYNC, CRITICAL, or ORDERED directives or subprogram calls, the compiler calculates the chunk size as MIN(64, (ARRAY_SIZE + 3) / 4): !CSD$ DO DO I = 1, ARRAY_SIZE PRODUCT(I) = OPERAND1(I) * OPERAND2(I) END DO !CSD$ END DO Adding 3 to the array size produces an optimal chunk size by grouping the maximum number of iterations into 4 chunks. 148 S–3901–60Cray Streaming Directives (CSDs) (X1 only) [6] This example specifies the SCHEDULE clause and a chunk size of 128: !CSD$ DO SCHEDULE(STATIC, 128) DO I = 1, ARRAY_SIZE PRODUCT(I) = OPERAND1(I) * OPERAND2(I) END DO !CSD$ END DO In the above example, the compiler will use the chunk size based on this statement MIN(ARRAY_SIZE, 128). If the chunk size is larger than the array size, the compiler will use the array as the chunk size. If this is the case, then all the work will be done by SSP0. 6.4 Parallel Do Loops: PARALLEL DO and END PARALLEL DO The PARALLEL DO directive combines most of the functionality of the PARALLEL and DO directives into one directive. The PARALLEL DO directive is used on a single DO loop that contains or does not contain nested loops and is the equivalent to the following statements: !CSD$ PARALLEL [PRIVATE(list)] !CSD$ DO [SCHEDULE(STATIC [, chunk])] [ORDERED] Do_loop_block !CSD$ END DO !CSD$ END PARALLEL The differences between the PARALLEL DO and its counterparts include the lack of the NOWAIT clause, because it is not needed. This is the form of the PARALLEL DO directive: !CSD$ PARALLEL DO [PRIVATE(list)] [SCHEDULE(STATIC [, chunk_size])] Do loop block !CSD$ END PARALLEL DO For a description of the syntax of the PARALLEL DO directive, refer to the PARALLEL and DO directives at Section 6.2, page 144 and Section 6.3, page 146. S–3901–60 149Cray® Fortran Reference Manual 6.5 Synchronize SSPs: SYNC The SYNC directive synchronizes all SSPs within a multistreaming processor (MSP) and may under certain conditions synchronize memory with physical storage by calling MSYNC. The SYNC directive is normally used where additional intra-MSP synchronization is needed to prevent race conditions caused by forced multistreaming. The SYNC directive can appear anywhere within the CSD parallel region, even within the CSD DO and PARALLEL DO directives. If the SYNC directive appears within a CSD parallel region but outside of an enclosed CSD DO directive, then it performs an MSYNC on all four SSPs. This example shows how to use the SYNC directive: !CSD$ PARALLEL DO PRIVATE(J) DO I = 1, 4 DO J = 1, 100000 X(J, I) = ... ! Produce X END DO . . . !CSD$ SYNC DO J + 1, 100000 ... = X(J, 5-I) * ... ! Consume X END DO END DO NOWAIT !CSD$ END PARALLEL The two inner loops provide a producer and consumer pair for array x . The SYNC directive prevents the use of the array by the second inner loop before it is completely populated. 6.6 Specify Critical Regions: CRITICAL and END CRITICAL The CRITICAL and END CRITICAL directives specify a critical region where only one SSP at a time will execute the enclosed region. This is the form of the CRITICAL directive: !CSD$ CRITICAL Block of code !CSD$ END CRITICAL 150 S–3901–60Cray Streaming Directives (CSDs) (X1 only) [6] This example performs a streamed sum reduction of A and uses the CRITICAL directive to calculate the complete sum: SUM = 0 !Shared variable !CSD$ PARALLEL PRIVATE(PRIVATE_SUM) PRIVATE_SUM = 0 !CSD$ DO DO I = 1, A_SIZE PRIVATE_SUM = PRIVATE_SUM + A(I) END DO !CSD$ END DO NOWAIT !CSD$ CRITICAL SUM = SUM + PRIVATE_SUM !CSD$ END CRITICAL !CSD$ END PARALLEL 6.7 Define Order of SSP Execution: ORDERED and END ORDERED The CSD ORDERED and END ORDERED directives allow you to multistream loops with particular dependencies by ensuring the execution order of the SSPs and that only one SSP at a time executes the code. That is, first SSP0 completes execution of the block of code surrounded by the ordered directive; next SSP1 completes execution of that block of code etc. If a stand-alone CSD ORDERED directive is placed in a procedure that is called from a CSD parallel region, the CSD PARALLEL, PARALLEL DO, or DO directives that most closely encapsulates the call needs to specify the ORDERED clause to ensure correct results. See the appropriate CSD for more information. This is the format of the ORDERED directive: !CSD$ ORDERED Block of code !CSD$ END ORDERED S–3901–60 151Cray® Fortran Reference Manual In the following example, successive iterations of the loop depend upon previous iterations, because of A(I-1) and A(I-2) on the right side of the first assignment statement. The ORDERED directive ensures that each computation of A(I) is complete before the next iteration (which occurs on the next SSP) uses this value as its A(I-1) and similarly for A(I-2): !CSD$ PARALLEL DO SCHEDULE(STATIC, 1) DO I = 3, A_SIZE !CSD$ ORDERED A(I) = A(I-1) + A(I-2) !CSD$ END ORDERED ... ! other processing END DO !CSD$ END PARALLEL DO If the execution time for the code indicated by the other processing comment is larger compared to the time to compute the assignment within the ORDERED directives, then the loop will mostly run concurrently on the 4 SSPs, even if the ORDERED directives are used. 6.8 Suppress CSDs: [NO]CSD The NOCSD directive suppresses recognition of CSDs. It takes effect after the appearance of the directive and applies to the rest of the program unit unless it is superseded by a !DIR$ CSD statement. CSDs are also ignored if multistreaming optimization is disabled by the -O stream0 option. If the !DIR$ CSD statement follows a !DIR$ NOCSD statement within the same program unit, the compiler resumes recognition of CSDs. These are the formats of the directives: !DIR$ CSD !DIR$ NOCSD 152 S–3901–60Cray Streaming Directives (CSDs) (X1 only) [6] 6.9 Nested CSDs within Cray Parallel Programming Models CSDs can be used within all Cray programming models on Cray X1 series systems with the CSDs at the deepest level. These are the nesting levels: 1. Distributed memory models (MPI, SHMEM, UPC, and Fortran co-arrays) 2. Shared memory models (OpenMP and Pthreads) 3. Nonadvisory directives (CSDs) If the shared or distributed programming model is used, then you can nest the CSDs within either one, but these models cannot be nested within the CSDs. If both programming models are nested, then the CSDs must be nested within the shared model, and the shared model nested within the distributed model. 6.10 CSD Placement CSDs must be used within the CSD parallel region as defined by the parallel directives (PARALLEL and END PARALLEL). Some must be used where the parallel directives are used; that is, used within the same block of code. Other CSDs can be used in the same block of code or be placed in a procedure and called from the parallel region (in effect, appearing as if they were within the parallel region). These CSDs will be referred to as stand-alone CSDs. The CSD DO directive is the only one that must be used within the same block of code as this example shows: !CSD$ PARALLEL ... !CSD$ DO Do loop block... !CSD$ END DO !CSD$ END PARALLEL The stand-alone CSDs are SYNC, CRITICAL, and ORDERED. If stand-alone CSDs are placed in a procedure and the procedure is not called from a parallel region, the code will execute as if no CSD requests were present. S–3901–60 153Cray® Fortran Reference Manual 6.11 Protection of Shared Data Updates to shared data by procedures called from a CSD parallel region must be protected against simultaneous access by SSPs used for the CSD parallel region. Shared data includes statically allocated data objects (such as data defined in a COMMON block or static files), dynamically allocated data objects pointed to by more than one SSP, and subprogram formal arguments where corresponding actual arguments are shared. Protecting your shared data includes using the CRITICAL directive or the DO loop indices. The CRITICAL directive can protect writes to shared data by ensuring that only one SSP at any one time can execute the enclosed code that accesses the shared data. Using the DO loop indices when accessing array elements is another way to protect your shared data. Within a CSD parallel region, iterations of a DO loop are distributed among the SSPs. This distribution can be used to divide the array among the SSPs, if the iteration of the DO loop are used to access the array. If each SSP accesses only its portion of the array, then in a sense, that portion of the array is private to the SSP. The following example illustrates this principle. The example performs a sum reduction on the entire shared A array by doing an intermediate sum reduction on all SSPs to the shared INTER_SUM vector and a final reduction on a single SSP to the SUM scalar. The INTER_SUM array is the shared array to consider. INTEGER A(SIZE1, SIZE2) INTEGER INTER_SUM(SIZE2) INTEGER SUM !CSD$ PARALLEL DO PRIVATE(INTER_SUM) DO I = 1, SIZE2 INTER_SUM(I) = 0 DO J = 1, SIZE1 INTER_SUM(I) = INTER_SUM(I) + A(J, I) END DO END DO !CSD$ END PARALLEL SUM = 0 DO I = 1, SIZE2 SUM = SUM + INTER_SUM(I) END DO 154 S–3901–60Cray Streaming Directives (CSDs) (X1 only) [6] Although the INTER_SUM array is shared within the parallel region, the accesses to it are private, because all accesses are indexed by the loop control variable of the loop to which the CSD DO was applied. 6.12 Dynamic Memory Allocation for CSD Parallel Regions There are certain precautions you should remember as you allocate or deallocate dynamic memory for private or shared data objects. Calls to the ALLOCATE and DEALLOCATE intrinsic procedures from within CSD parallel regions must be made by only one SSP at a time. In general, this requires the calls be made from CSD critical regions. This requirement may be relaxed in a future release. Dynamic memory for private data objects specified by the PRIVATE list of the PARALLEL directive must be allocated and deallocated within the CSD parallel region. Dynamic memory cannot be allocated for private objects before entry into the CSD parallel region and then made private. Dynamic memory can be allocated to shared data objects outside or within the CSD parallel region. If memory for the shared object is allocated or deallocated within the CSD parallel region, you must ensure that it is allocated or deallocated by only one SSP. If the shared or private data object does not have the SAVE attribute, its memory will be automatically deallocated at the end of the procedure containing the CSD parallel region. For private objects, this automatic deallocation may cause an error because deallocation occurs outside of the parallel region. Therefore, you must ensure that memory allocated to private objects are deallocated before exiting the CSD parallel region. 6.13 Compiler Options Affecting CSDs To enable CSDs, compile your code with the -O streamn option with n set to 1 or greater. Also, specify the -O gen_private_callee option to compile procedures called from the CSD parallel region. To disable CSDs, compile with -O stream0, -x all, or -x csd option. S–3901–60 155Cray® Fortran Reference Manual 156 S–3901–60Source Preprocessing [7] Source preprocessing can help you port a program from one platform to another by allowing you to specify source text that is platform specific. For a source file to be preprocessed automatically, it must have an uppercase extension, either .F (for a file in fixed source form), or .F90 or .FTN (for a file in free source form). To specify preprocessing of source files with other extensions, including lowercase ones, use the -eP or -eZ options described in Section 7.4, page 166. 7.1 General Rules You can alter the source code through source preprocessing directives. These directives are fully explained in Section 7.2, page 158. The directives must be used according to the following rules: • Do not use source preprocessor (#) directives within multiline compiler directives (CDIR$, !DIR$, CSD$, !CSD$, C$OMP, or !$OMP). • You cannot include a source file that contains an #if directive without a balancing #endif directive within the same file. The #if directive includes the #ifdef and #ifndef directives. • If a directive is too long for one source line, the backslash character (\) is used to continue the directive on successive lines. Successive lines of the directive can begin in any column. The backslash character (\) can appear in any location within a directive in which white space can occur. A backslash character (\) in a comment is treated as a comment character. It is not recognized as signaling continuation. • Every directive begins with the pound character (#), and the pound character (#) must be in column 1. • Blank and tab (HT) characters can appear between the pound character (#) and the directive keyword. • You cannot write form feed (FF) or vertical tab (VT) characters to separate tokens on a directive line. That is, a source preprocessing line must be continued, by using a backslash character (\), if it spans source lines. • Blanks are significant, so the use of spaces within a source preprocessing S–3901–60 157Cray® Fortran Reference Manual directive is independent of the source form of the file. The fields of a source preprocessing directive must be separated by blank or tab (HT) characters. • Any user-specified identifier that is used in a directive must follow Fortran rules for identifier formation. The exceptions to this rule are as follows: – The first character in a source preprocessing name (a macro name) can be an underscore character (_). – Source preprocessing names are significant in their first 132 characters whereas a typical Fortran identifier is significant only in its first 63 characters. • Source preprocessing identifier names are case sensitive. • Numeric literal constants must be integer literal constants or real literal constants, as defined for Fortran. • Comments written in the style of the C language, beginning with /* and ending with */, can appear anywhere within a source preprocessing directive in which blanks or tabs can appear. The comment, however, must begin and end on a single source line. • Directive syntax allows an identifier to contain the ! character. Therefore, placing the ! character to start a Fortran comment on the same line as the directive should be avoided. 7.2 Directives The blanks shown in the syntax descriptions of the source preprocessing directives are significant. The tab character (HT) can be used in place of a blank. Multiple blanks can appear wherever a single blank appears in a syntax description. 7.2.1 #include Directive The #include directive directs the system to use the content of a file. Just as with the INCLUDE line path processing defined by the Fortran standard, an #include directive effectively replaces that directive line by the content of filename. This directive has the following formats: #include "filename" #include 158 S–3901–60Source Preprocessing [7] filename A file or directory to be used. In the first form, if filename does not begin with a slash (/) character, the system searches for the named file, first in the directory of the file containing the #include directive, then in the sequence of directories specified by the -I option(s) on the ftn command line, and then the standard (default) sequence. If filename begins with a slash (/) character, it is used as is and is assumed to be the full path to the file. The second form directs the search to begin in the sequence of directories specified by the -I option(s) on the ftn command line and then search the standard (default) sequence. The Fortran standard prohibits recursion in INCLUDE files, so recursion is also prohibited in the #include form. The #include directives can be nested. When the compiler is invoked to do only source preprocessing, not compilation, text will be included by #include directives but not by Fortran INCLUDE lines. For information about the source preprocessing command line options, see Section 7.4, page 166. 7.2.2 #define Directive The #define directive lets you declare a variable and assign a value to the variable. It also allows you to define a function-like macro. This directive has the following format: #define identifier value #define identifier(dummy_arg_list) value The first format defines an object-like macro (also called a source preprocessing variable), and the second defines a function-like macro. In the second format, the left parenthesis that begins the dummy_arg_list must immediately follow the identifier, with no intervening white space. identifier The name of the variable or macro being defined. Rules for Fortran variable names apply; that is, the name cannot have a leading underscore character (_). For example, ORIG is a valid name, but _ORIG is invalid. S–3901–60 159Cray® Fortran Reference Manual dummy_arg_list A list of dummy argument identifiers. value The value is a sequence of tokens. The value can be continued onto more than one line using backslash (\) characters. If a preprocessor identifier appears in a subsequent #define directive without being the subject of an intervening #undef directive, and the value in the second #define directive is different from the value in the first #define directive, then the preprocessor issues a warning message about the redefinition. The second directive's value is used. For more information about the #undef directive, see Section 7.2.3, page 161. When an object-like macro's identifier is encountered as a token in the source file, it is replaced with the value specified in the macro's definition. This is referred to as an invocation of the macro. The invocation of a function-like macro is more complicated. It consists of the macro's identifier, immediately followed by a left parenthesis with no intervening white space, then a list of actual arguments separated by commas, and finally a terminating right parenthesis. There must be the same number of actual arguments in the invocation as there are dummy arguments in the #define directive. Each actual argument must be balanced in terms of any internal parentheses. The invocation is replaced with the value given in the macro's definition, with each occurrence of any dummy argument in the definition replaced with the corresponding actual argument in the invocation. For example, the following program prints Hello, world. when compiled with the -F option and then run: PROGRAM P #define GREETING 'Hello, world.' PRINT *, GREETING END PROGRAM P The following program prints Hello, Hello, world. when compiled with the -F option and then run: PROGRAM P #define GREETING(str1, str2) str1, str1, str2 PRINT *, GREETING('Hello, ', 'world.') END PROGRAM P 160 S–3901–60Source Preprocessing [7] 7.2.3 #undef Directive The #undef directive sets the definition state of identifier to an undefined value. If identifier is not currently defined, the #undef directive has no effect. This directive has the following format: #undef identifier identifier The name of the variable or macro being undefined. 7.2.4 # (Null) Directive The null directive simply consists of the pound character (#) in column 1 with no significant characters following it. That is, the remainder of the line is typically blank or is a source preprocessing comment. This directive is generally used for spacing out other directive lines. 7.2.5 Conditional Directives Conditional directives cause lines of code to either be produced by the source preprocessor or to be skipped. The conditional directives within a source file form if-groups. An if-group begins with an #if, #ifdef, or #ifndef directive, followed by lines of source code that you may or may not want skipped. Several similarities exist between the Fortran IF construct and if-groups: • The #elif directive corresponds to the ELSE IF statement. • The #else directive corresponds to the ELSE statement. • Just as an IF construct must be terminated with an END IF statement, an if-group must be terminated with an #endif directive. • Just as with an IF construct, any of the blocks of source statements in an if-group can be empty. For example, you can write the following directives: #if MIN_VALUE == 1 #else ... #endif Determining which group of source lines (if any) to compile in an if-group is essentially the same as the Fortran determination of which block of an IF construct should be executed. S–3901–60 161Cray® Fortran Reference Manual 7.2.5.1 #if Directive The #if directive has the following format: #if expression expression An expression. The values in expression must be integer literal constants or previously defined preprocessor variables. The expression is an integer constant expression as defined by the C language standard. All the operators in the expression are C operators, not Fortran operators. The expression is evaluated according to C language rules, not Fortran expression evaluation rules. Note that unlike the Fortran IF construct and IF statement logical expressions, expression in an #if directive need not be enclosed in parentheses. The #if expression can also contain the unary defined operator, which can be used in either of the following formats: defined identifier defined(identifier) When the defined subexpression is evaluated, the value is 1 if identifier is currently defined, and 0 if it is not. All currently defined source preprocessing variables in expression, except those that are operands of defined unary operators, are replaced with their values. During this evaluation, all source preprocessing variables that are undefined evaluate to 0. Note that the following two directive forms are not equivalent: • #if X • #if defined(X) In the first case, the condition is true if X has a nonzero value. In the second case, the condition is true only if X has been defined (has been given a value that could be 0). 162 S–3901–60Source Preprocessing [7] 7.2.5.2 #ifdef Directive The #ifdef directive is used to determine if identifier is predefined by the source preprocessor, has been named in a #define directive, or has been named in a ftn -D command line option. For more information about the -D option, see Section 7.4, page 166. This directive has the following format: #ifdef identifier The #ifdef directive is equivalent to either of the following two directives: • #if defined identifier • #if defined(identifier) 7.2.5.3 #ifndef Directive The #ifndef directive tests for the presence of an identifier that is not defined. This directive has the following format: #ifndef identifier This directive is equivalent to either of the following two directives: • #if ! defined identifier • #if ! defined(identifier) 7.2.5.4 #elif Directive The #elif directive serves the same purpose in an if-group as does the ELSE IF statement of a Fortran IF construct. This directive has the following format: #elif expression expression The expression follows all the rules of the integer constant expression in an #if directive. 7.2.5.5 #else Directive The #else directive serves the same purpose in an if-group as does the ELSE statement of a Fortran IF construct. This directive has the following format: #else S–3901–60 163Cray® Fortran Reference Manual 7.2.5.6 #endif Directive The #endif directive serves the same purpose in an if-group as does the END IF statement of a Fortran IF construct. This directive has the following format: #endif 7.3 Predefined Macros The Cray Fortran compiler source preprocessing supports a number of predefined macros. They are divided into groups as follows: • Macros that are based on the host machine • Macros that are based on UNICOS/mp and UNICOS/lc system targets The following predefined macros are based on the host system (the system upon which the compilation is being done): unix, __unix, __unix__ Always defined. (The leading characters in the second form consist of 2 consecutive underscores; the third form consists of 2 leading and 2 trailing underscores.) The following predefined macros are based on UNICOS/mp and UNICOS/lc systems as targets: __crayx1 Defined as 1 on all Cray X1 series systems. __crayx1e Defined as 1 on all Cray X1E systems. __crayx2 Defined as 1 on all Cray X2 systems. _UNICOSMP Defined as 1 on all Cray X1 series systems. cray, CRAY, _CRAY (X1 only) These macros are defined for UNICOS/mp systems as targets. 164 S–3901–60Source Preprocessing [7] _CRAYIEEE Defined as 1 on all Cray X1 series and X2 systems as targets. _MAXVL Defined as the hardware vector register length (64 for the Cray X1 and 128 for the Cray X2). _ADDR64 Defined for UNICOS/mp and UNICOS/lc systems as targets. The target system must have 64-bit address registers. The following predefined macros are based on the source file: __line__, __LINE__ Defined to be the line number of the current source line in the source file. __file__, __FILE__ Defined to be the name of the current source file. __date__, __DATE__ Defined to be the current date in the form mm/dd/yy. __time__, __TIME__ Defined to be the current in the form hh:mm:ss. S–3901–60 165Cray® Fortran Reference Manual 7.4 Command Line Options The following ftn command line options affect source preprocessing. • The -D identifier[=value] option, which defines variables used for source preprocessing. For more information about this option, see Section 3.6, page 26. • The -eP option, which performs source preprocessing on file.f[90], file.F[90], file.ftn, or file.FTN but does not compile. The -eP option produces file.i. For more information about this option, see Section 3.5, page 18. • The -eZ option, which performs source preprocessing and compilation on file.f[90], file.F[90], file.ftn, or file.FTN. The -eZ option produces file.i. For more information about this option, see Section 3.5, page 18. • The -F option, which enables macro expansion throughout the source file. For more information about this option, see Section 3.8, page 26. • The -U identifier [, identifier] ... option, which undefines variables used for source preprocessing. For more information about this option, see Section 3.28, page 76. The -D identifier [=value], -F, and -U identifier [, identifier] ... options are ignored unless one of the following is true: • The Fortran input source file is specified as either file.F, file.F90, or file.FTN. • The -eP or -eZ options have been specified. 166 S–3901–60OpenMP Fortran API [8] OpenMP Fortran is a parallel programming model that is portable across shared memory architectures from Cray and other vendors. The Cray Fortran compiler supports the OpenMP Fortran Application Program Interface, version 2.5 standard. All OpenMP library procedures and directives, except for limitations in a few directive clauses, are supported. All OpenMP directives and library procedures are documented by the OpenMP Fortran specification which is accessible at http://www.openmp.org/drupal/node/view/8. For information about Cray specific OpenMP Fortran information like implementation differences, see the following sections: • Cray Implementation Differences (Section 8.1, page 167) • OMP_THREAD_STACK_SIZE Environment Variable (Section 8.2, page 169) • OpenMP Optimizations (Section 8.3, page 170) • Compiler Options that Affect OpenMP (Section 8.4, page 172) • OpenMP Program Execution (Section 8.5, page 172) 8.1 Cray Implementation Differences The OpenMP Fortran Application Program Interface specification defines areas of implementation that have vendor-specific behaviors. This section documents those areas and other areas not defined by the specification. These OpenMP items have Cray specific behaviors in areas defined as implementation-dependent by the OpenMP specification: • Implementation-dependent areas of parallel region constructs: – If a parallel region is encountered while dynamic adjustment of the number of threads is disabled, and the number of threads specified for the parallel region exceeds the number that the run-time system can supply, the program will terminate. – The number of physical processors actually hosting the threads at any given time is fixed at program startup and is specified by the aprun -d depth option. S–3901–60 167Cray® Fortran Reference Manual • Implementation-dependent areas of DO and PARALLEL DO directives: – SCHEDULE(GUIDED,chunk)—The size of the initial chunk for the master thread and other team members is approximately equal to the trip count divided by the number of threads. – SCHEDULE(RUNTIME)—The schedule type and chunk size can be chosen at run time by setting the OMP_SCHEDULE environment variable. If this environment variable is not set, the schedule type and chunk size default to GUIDED and 1, respectively. – Default schedule—In the absence of the SCHEDULE clause, the default schedule is STATIC and the default chunk size is roughly the number of iterations divided by the number of threads. • Implementation-dependent area of the THREADPRIVATE directive—If the dynamic threads mechanism is enabled, the definition and association status of a thread's copy of the variable is undefined, and the allocation status of an allocatable array is undefined. • Implementation-dependent area of the PRIVATE clause—If a variable is declared as PRIVATE, and the variable is referenced in the definition of a statement function, and the statement function is used within the lexical extent of the directive construct, then the statement function references the PRIVATE version of the variable. • Implementation-dependent areas of the ATOMIC directive—The ATOMIC directive is replaced with a critical section that encloses the statement. • Implementation-dependent areas of OpenMP library functions: – OMP_GET_NESTED—This procedure always returns .FALSE. because nested parallel regions are always serialized. – OMP_GET_NUM_THREAD—If the number of threads has not been explicitly set by the user, the default is the depth value defined through the aprun -d depth option. If this option is not set, the aprun command defaults depth to 1, which sets the number of threads to one, which value OMP_GET_NUM_THREAD returns. – OMP_SET_NUM_THREADS—If dynamic adjustment of the number of threads is disabled, the number_of_threads argument sets the number of threads for all subsequent parallel regions until this procedure is called again with a different value. – OMP_SET_DYNAMIC—The default for dynamic thread adjustment is on. 168 S–3901–60OpenMP Fortran API [8] – OMP_SET_NESTED—Calls to this function are ignored since nested parallel regions are always serialized. • Implementation-dependent areas of OpenMP environment variables: – OMP_DYNAMIC—The default value is .TRUE. – OMP_SET_NESTED—This environment variable is ignored because nested parallel regions are always serialized and executed by a team of one thread. – OMP_NUM_THREADS—The default value is the value of depth as defined by the aprun -d depth option or 1 if the option is not specified. If the requested value of OMP_NUM_THREADS is more than the number of threads an implementation can support, the behavior of the program depends on the value of the OMP_DYNAMIC environment variable. If OMP_DYNAMIC is .FALSE., the program terminates; otherwise, it uses up to 16 threads on the Cray X1 series and X2 systems. – OMP_SCHEDULE—The default values for this environment variable are GUIDED for schedule and 1 for chunk size. • Implementation-dependent areas of OpenMP library routines that have generic interfaces—If an OMP run-time library routine interface is defined to be generic by an implementation, use of arguments of kind other than those specified by the OMP_*_KIND constants is undefined. These OpenMP features have Cray specific behaviors in areas not defined as implementation-dependent by the OpenMP specification: • If the omp_lib module is not used and the kind of the actual argument does not match the kind of the dummy argument, the behavior of the procedure is undefined. • The omp_get_wtime and omp_get_wtick procedures return REAL(KIND=8) values instead of DOUBLE PRECISION values. 8.2 OMP_THREAD_STACK_SIZE Environment Variable OMP_THREAD_STACK_SIZE is a Cray specific OpenMP environment variable that affects programs at run time. It changes the size of the thread stack from the default size of 16 MB to the specified size. The size of the thread stack should be increased when private variables may utilize more than 16 MB of memory. S–3901–60 169Cray® Fortran Reference Manual (X1 only) The requested thread stack space is allocated from the local heap when the threads are created. The amount of space used by each thread for thread stacks depend on whether you are using MSP or SSP mode. In MSP mode, the memory used is 5 times the specified thread stack size because each SSP is assigned one thread stack and one thread stack is used as the MSP common stack. For SSP mode, the memory used is one times the specified thread stack size. (X1 only) Since memory is allocated from the local heap, you may want to consider how increasing the size of the thread stacks will affect available space in the local heap. To adjust the size of the local heap, see the X1_HEAP_SIZE and X1_LOCAL_HEAP_SIZE environment variables in the memory(7) man page. (X2 only) The heaps on X2 do not have to be sized statically as on X1; their sizes are adjusted as needed. This is the format for the OMP_THREAD_STACK_SIZE environment variable: OMP_THREAD_STACK_SIZE n where n is a hex, octal or decimal integer specifying the amount of memory, in bytes , to allocate for a thread's stack. 8.3 OpenMP Optimizations A certain amount of overhead is associated with multiprocessing a loop. If the work occurring in the loop is small, the loop can actually run slower by multiprocessing than by single processing. To avoid this, make the amount of work inside the multiprocessed region as large as possible, as is shown in the following examples. For more information about optimization, see Optimizing Applications on Cray X1 Series Systems(for the X1 series), and Optimizing Applications on Cray X2 Systems (for the X2 systems). Example 1: Loop interchange. Consider the following code: DO K = 1, N DO I = 1, N DO J = 1, N A(I,J) = A(I,J) + B(I,K) * C(K,J) END DO END DO END DO 170 S–3901–60OpenMP Fortran API [8] For the preceding code fragment, you can parallelize the J loop or the I loop. You cannot parallelize the K loop because different iterations of the K loop read and write the same values of A(I,J). Try to parallelize the outermost DO loop if possible, because it encloses the most work. In this example, that is the I loop. For this example, use the technique called loop interchange. Although the parallelizable loops are not the outermost ones, you can reorder the loops to make one of them outermost. Thus, loop interchange would produce the following code fragment: !$OMP PARALLEL DO PRIVATE(I, J, K) DO I = 1, N DO K = 1, N DO J = 1, N A(I,J) = A(I,J) + B(I,K) * C(K,J) END DO END DO END DO Now the parallelizable loop encloses more work and shows better performance. In practice, relatively few loops can be reordered in this way. However, it does occasionally happen that several loops in a nest of loops are candidates for parallelization. In such a case, it is usually best to parallelize the outermost one. Occasionally, the only loop available to be parallelized has a fairly small amount of work. It may be worthwhile to force certain loops to run without parallelism or to select between a parallel version and a serial version, on the basis of the length of the loop. Example 2: Conditional parallelism. The loop is worth parallelizing if N is sufficiently large. To overcome the parallel loop overhead, N needs to be around 1000, depending on the specific hardware and the context of the program. The optimized version would use an IF clause on the PARALLEL DO directive: !$OMP PARALLEL DO IF (N .GE. 1000), PRIVATE(I) DO I = 1, N A(I) = A(I) + X*B(I) END DO S–3901–60 171Cray® Fortran Reference Manual 8.4 Compiler Options that Affect OpenMP These Cray Fortran compiler options enable or disable the OpenMP directives or determine the type of processing elements each thread runs on: • Enable OpenMP directive recognition: -O task1 (default) • Disable OpenMP directive recognition: -O 0, -O task0, or -x omp • (X1 only) Compile the code to allow the threads to run on MSPs or SSPs: -O msp (default), -O ssp 8.5 OpenMP Program Execution The -d depth option of the aprun command is required to reserve more than one physical processor for an OpenMP process. For best performance, depth should be the same as the maximum number of threads the program uses. This example shows how to reserve the physical processors: aprun -d depth ompProgram (X1 only) If the program is compiled for MSP mode, depth must be less than or equal to 4; for SSP mode less than or equal to 16. If depth is not specified, the aprun command defaults depth to 1. (X2 only) If the program is compiled for X2 systems, depth must be less than or equal to 4, the size of an X2 SMP node. If the OMP_NUM_THREADS environment variable is not set, the program behaves as if OMP_NUM_THREADS is set to the same value as depth. The aprun options -n processes and -N processes_per_node are compatible with OpenMP but do not directly affect the execution of OpenMP programs. 172 S–3901–60Cray Fortran Defined Externals [9] This chapter describes global variables used by the Cray Fortran compiler targeting UNICOS/mp and UNICOS/lc systems. 9.1 Conformance Checks The amount of error checking of edit descriptors with input/output (I/O) list items during formatted READ and WRITE statements can be selected through a loader option or through an environment variable. The default error checking provides only limited error checking. Use the loader options to choose the table to be used for the conformance check. The table is then part of the executable and no environment variable is required to run the executable. The loader options allow a choice of checking or no checking with a particular version of the Fortran standard for formatted READ and WRITE. See the following tables: Table 17, page 202, Table 18, page 203, Table 19, page 203, and Table 20, page 203. The environment variable FORMAT_TYPE_CHECKING is evaluated during execution. The environment variable will override a table chosen through the loader option. The environment variable provides an intermediate type of checking that is not provided by the loader option. The environment variable FORMAT_TYPE_CHECKING is described in section 4.1.3. S–3901–60 173Cray® Fortran Reference Manual To select the least amount of checking, use one or more of the following ftn command line options. • On UNICOS/mp systems with formatted READ, use: ftn -W1,-equiv,_RCHK=_RNOCHK ... • On UNICOS/mp systems with formatted WRITE, use: ftn -W1,-equiv,_WCHK=_WNOCHK *.f • On UNICOS/mp systems with both formatted READ and WRITE, use: ftn -W1,-equiv,_WCHK=_WNOCHK -W1,-equiv,_RCHK=_RNOCHK *.f • On UNICOS/lc systems with formatted READ, use: ftn -W1,--defsym,_RCHK=_RNOCHK *.f(note the double dashes that precede defsym) • On UNICOS/lc systems with formatted WRITE, use: ftn -W1,--defsym,_WCHK=_WNOCHK *.f • On UNICOS/lc systems with both formatted READ and WRITE, use: ftn -W1,--defsym,_WCHK=_WNOCHK -W1,--defsym,_RCHK=_RNOCHK *.f 174 S–3901–60Cray Fortran Defined Externals [9] To select strict amount of checking for either FORTRAN 77 or Fortran 90, use one or more of the following ftn command line options. • On UNICOS/mp systems with formatted READ, use: ftn -W1,-equiv,_RCHK=_RCHK77 *.f ftn -W1,-equiv,_RCHK=_RCHK90 *.f • On UNICOS/mp systems with formatted WRITE, use: ftn -W1,-equiv,_WCHK=_WCHK77 *.f ftn -W1,-equiv,_WCHK=_WCHK90 *.f • On UNICOS/mp systems with both formatted READ and WRITE, use: ftn -W1,-equiv,_WCHK=_WCHK77 -W1,-equiv,_RCHK=_RCHK77 *.f ftn -W1,-equiv,_WCHK=_WCHK90 -W1,-equiv,_RCHK=_RCHK90 *.f • On UNICOS/lc systems with formatted READ, use: ftn -W1,--defsym,_RCHK=_RCHK77 *.f ftn -W1,--defsym,_RCHK=_RCHK90 *.f • On UNICOS/lc systems with formatted WRITE, use: ftn -W1,--defsym,_WCHK=_WCHK77 *.f ftn -W1,--defsym,_WCHK=_WCHK90 *.f • On UNICOS/lc systems with both formatted READ and WRITE, use: ftn -W1,--defsym,_WCHK=_WCHK77 -W1,--defsym,_RCHK=_RCHK77 *.f ftn -W1,--defsym,_WCHK=_WCHK90 -W1,--defsym,_RCHK=_RCHK90 *.f S–3901–60 175Cray® Fortran Reference Manual 176 S–3901–60Part II: Cray Fortran and Fortran 2003 Differences The Cray Fortran compiler is based on the Fortran 2003 standard. Part II documents only the differences between the Cray Fortran implementation and the Fortran standard. It is divided into the following chapters: • Cray Fortran Language Extensions (Chapter 10, page 179) • Cray Fortran Obsolete Features (Chapter 11, page 229) • Cray Fortran Deferred Implementation and Optional Features (Chapter 12, page 257) • Cray Fortran Implementation Specifics (Chapter 13, page 259)Cray Fortran Language Extensions [10] The Cray Fortran compiler supports several features beyond those specified by the standard. These features are referred to as extensions. The extensions described in this chapter include extensions widely implemented in other compilers and facilities designed to provide access to hardware features of the Cray X1 series and X2 systems. Also included are extensions that might become features in a future Fortran standard. The implementation of such features in the compiler might be modified as needed in the future to conform to the new standard. For information about obsolete features, see Obsolete Features (Chapter 11, page 229). The listings provided by the compiler will identify language extensions when the -e n command line option is specified. 10.1 Characters, Lexical Tokens, and Source Form 10.1.1 Low-level Syntax 10.1.1.1 Characters Allowed in Names Variables, named constants, program units, common blocks, procedures, arguments, constructs, derived types (types for structures), namelist groups, structure components, dummy arguments, and function results are among the elements in a program that have a name. As extensions, the Cray Fortran compiler permits the following characters in names: alphanumeric_character is currency_symbol or at_sign currency_symbol is $ at_sign is @ A name must begin with a letter and can consist of letters, digits, and underscores. The Cray Fortran compiler allows you to use the at sign (@) and dollar sign ($) in a name, but they cannot be the first character of a name. Cray does not recommend using @ and $ in user names because they could cause conflicts with the names of internal variables or library routines. S–3901–60 179Cray® Fortran Reference Manual 10.1.1.2 Switching Source Forms The Cray Fortran compiler allows you to switch between fixed and free source forms within a file or include file by using the FIXED and FREE compiler directives. 10.1.1.3 Continuation Line Limit The Cray Fortran compiler allows a statement to have an unlimited number of continuation lines. The Fortran standard allows only 255 continuation lines. 10.1.1.4 D Lines in Fixed Source Form The Cray Fortran compiler allows a D or d character to occur in column one in fixed source form. Typically, the compiler treats a line with a D or d character in column one as a comment line. When the -e d command line option is in effect, however, the compiler replaces the D or d character with a blank and treats the rest of the line as a source statement. This can be used, for example, for debugging purposes if the rest of the line contains a PRINT statement. This functionality is controlled through the -e d and -d d options on the compiler command line. For more information about these options, see the ftn(1) man page. 10.2 Types 10.2.1 The Concept of Type The Cray Fortran compiler supports the following additional data types. This preserves compatibility with other vendor's systems. • Cray pointer • Cray character pointer • Boolean (or typeless) The Cray Fortran compiler also supports the TYPEALIAS statement as a means of creating alternate names for existing types and supports an expanded form of the ENUM statement. 180 S–3901–60Cray Fortran Language Extensions [10] 10.2.1.1 Alternate Form of LOGICAL Constants The Cray Fortran compiler accepts .T. and .F. as alternate forms of .true. and .false., respectively. 10.2.1.2 Cray Pointer Type The Cray POINTER statement declares one variable to be a Cray pointer (that is, to have the Cray pointer data type) and another variable to be its pointee. The value of the Cray pointer is the address of the pointee. This POINTER statement has the following format: POINTER (pointer_name, pointee_name [ (array_spec) ]) [, (pointer_name, pointee_name [ (array_spec) ]) ] ... pointer_name Pointer to the corresponding pointee_name. pointer_name contains the address of pointee_name. Only a scalar variable can be declared type Cray pointer; constants, arrays, statement functions, and external functions cannot. pointee_name Pointee of corresponding pointer_name. Must be a variable name, array declarator, or array name. The value of pointer_name is used as the address for any reference to pointee_name; therefore, pointee_name is not assigned storage. If pointee_name is an array declarator, it can be explicit-shape (with either constant or nonconstant bounds) or assumed-size. array_spec If present, this must be either an explicit_shape_spec_list, with either constant or nonconstant bounds) or an assumed_size_spec. Fortran pointers are declared as follows: POINTER :: [ object-name-list ] Cray Fortran pointers and Fortran standard pointers cannot be mixed. Example: POINTER(P,B),(Q,C) This statement declares Cray pointer P and its pointee B, and Cray pointer Q and pointee C; the pointer's current value is used as the address of the pointee whenever the pointee is referenced. S–3901–60 181Cray® Fortran Reference Manual An array that is named as a pointee in a Cray POINTER statement is a pointee array. Its array declarator can appear in a separate type or DIMENSION statement or in the pointer list itself. In a subprogram, the dimension declarator can contain references to variables in a common block or to dummy arguments. As with nonconstant bound array arguments to subprograms, the size of each dimension is evaluated on entrance to the subprogram, not when the pointee is referenced. For example: POINTER(IX, X(N,0:M)) In addition, pointees must not be deferred-shape or assumed-shape arrays. An assumed-size pointee array is not allowed in a main program unit. You can use pointers to access user-managed storage by dynamically associating variables and arrays to particular locations in a block of storage. Cray pointers do not provide convenient manipulation of linked lists because, for optimization purposes, it is assumed that no two pointers have the same value. Cray pointers also allow the accessing of absolute memory locations. The range of a Cray pointer or Cray character pointer depends on the size of memory for the machine in use. Restrictions on Cray pointers are as follows: • A Cray pointer variable should only be used to alias memory locations by using the LOC intrinsic. • A Cray pointer cannot be pointed to by another Cray or Fortran pointer; that is, a Cray pointer cannot also be a pointee or a target. • A Cray pointer cannot appear in a PARAMETER statement or in a type declaration statement that includes the PARAMETER attribute. • A Cray pointer variable cannot be declared to be of any other data type. • A Cray character pointer cannot appear in a DATA statement. For more information about Cray character pointers, see Section 10.2.1.3, page 186. • An array of Cray pointers is not allowed. • A Cray pointer cannot be a component of a structure. 182 S–3901–60Cray Fortran Language Extensions [10] Restrictions on Cray pointees are as follows: • A Cray pointee cannot appear in a SAVE, STATIC, DATA, EQUIVALENCE, COMMON, AUTOMATIC, or PARAMETER statement or Fortran pointer statement. • A Cray pointee cannot be a dummy argument; that is, it cannot appear in a FUNCTION, SUBROUTINE, or ENTRY statement. • A function value cannot be a Cray pointee. • A Cray pointee cannot be a structure component. • An equivalence object cannot be a Cray pointee. Note: Cray pointees can be of type character, but their Cray pointers are different from other Cray pointers; the two kinds cannot be mixed in the same expression. The Cray pointer is a variable of type Cray pointer and can appear in a COMMON list or be a dummy argument in a subprogram. The Cray pointee does not have an address until the value of the Cray pointer is defined; the pointee is stored starting at the location specified by the pointer. Any change in the value of a Cray pointer causes subsequent references to the corresponding pointee to refer to the new location. Cray pointers can be assigned values in the following ways: • A Cray pointer can be set as an absolute address. For example: Q = 0 • Cray pointers can have integer expressions added to or subtracted from them and can be assigned to or from integer variables. For example: P = Q + 100 However, Cray pointers are not integers. For example, assigning a Cray pointer to a real variable is not allowed. The (nonstandard) LOC(3i) intrinsic function generates the address of a variable and can be used to define a Cray pointer, as follows: P = LOC(X) S–3901–60 183Cray® Fortran Reference Manual The following example uses Cray pointers in the ways just described: SUBROUTINE SUB(N) INTEGER WORDS COMMON POOL(100000), WORDS(1000) INTEGER BLK(128), WORD64 REAL A(1000), B(N), C(100000-N-1000) POINTER(PBLK,BLK), (IA,A), (IB,B), & (IC,C), (ADDRESS,WORD64) ADDRESS = LOC(WORDS) + 64*KIND(WORDS) PBLK = LOC(WORDS) IA = LOC(POOL) IB = IA + 1000*KIND(POOL) IC = IB + N*KIND(POOL) BLK is an array that is another name for the first 128 words of array WORDS. A is an array of length 1000; it is another name for the first 1000 elements of POOL. B follows A and is of length N. C follows B. A, B, and C are associated with POOL. WORD64 is the same as BLK(65) because BLK(1) is at the initial address of WORDS. 184 S–3901–60Cray Fortran Language Extensions [10] If a pointee is of a noncharacter data type that is one machine word or longer, the address stored in a pointer is a word address. If the pointee is of type character or of a data type that is less than one word, the address is a byte address. The following example also uses Cray pointers: PROGRAM TEST REAL X(*), Y(*), Z(*), A(10) POINTER (P_X,X) POINTER (P_Y,Y) POINTER (P_Z,Z) INTEGER*8 I,J !USE LOC INTRINSIC TO SET POINTER MEMORY LOCATIONS !*** RECOMMENDED USAGE, AS PORTABLE CRAY POINTERS *** P_X = LOC(A(1)) P_Y = LOC(A(2)) !USE POINTER ARITHMETIC TO DEMONSTRATE COMPILER AND COMPILER !FLAG DIFFERENCES !*** USAGE NOT RECOMMENDED, HIGHLY NON-PORTABLE *** P_Z = P_X + 1 I = P_Y J = P_Z IF ( I .EQ. J ) THEN PRINT *, 'NOT A BYTE-ADDRESSABLE MACHINE' ELSE PRINT *, 'BYTE-ADDRESSABLE MACHINE' ENDIF END On Cray X1 series and X2 systems, this prints the following: Byte-addressable machine Note: Cray does not recommend the use of pointer arithmetic because it is not portable. S–3901–60 185Cray® Fortran Reference Manual For purposes of optimization, the compiler assumes that the storage of a pointee is never overlaid on the storage of another variable; that is, it assumes that a pointee is not associated with another variable or array. This kind of association occurs when a Cray pointer has two pointees, or when two Cray pointers are given the same value. Although these practices are sometimes used deliberately (such as for equivalencing arrays), results can differ depending on whether optimization is turned on or off. You are responsible for preventing such association. For example: POINTER(P,B), (P,C) REAL X, B, C P = LOC(X) B = 1.0 C = 2.0 PRINT *, B Because B and C have the same pointer, the assignment of 2.0 to C gives the same value to B; therefore, B will print as 2.0 even though it was assigned 1.0. As with a variable in common storage, a pointee, pointer, or argument to a LOC(3i) intrinsic function is stored in memory before a call to an external procedure and is read out of memory at its next reference. The variable is also stored before a RETURN or END statement of a subprogram. 10.2.1.3 Cray Character Pointer Type If a pointee is declared as character type, its Cray pointer is a Cray character pointer. Restrictions for Cray pointers also apply to Cray character pointers. In addition, the following restrictions apply: • When included in an I/O statement iolist, a Cray character pointer is treated as an integer. • If the length of the pointee is explicitly declared (that is, not of an assumed length), any reference to that pointee uses the explicitly declared length. • If a pointee is declared with an assumed length (that is, as CHARACTER(*)), the length of the pointee comes from the associated Cray character pointer. • A Cray character pointer can be used in a relational operation only with another Cray character pointer. Such an operation applies only to the character address and bit offset; the length field is not used. 186 S–3901–60Cray Fortran Language Extensions [10] 10.2.1.4 Boolean Type A Boolean constant represents the literal constant of a single storage unit. There are no Boolean variables or arrays, and there is no Boolean type statement. Binary, octal, and hexadecimal constants are used to represent Boolean values. For more information about Boolean expressions, see Section 10.4.1, page 191. 10.2.1.5 Alternate Form of ENUM Statement An enumeration defines the name of a group of related values and the name of each value within the group. The Cray Fortran compiler allows the following additional form for enum_def (enumerations): enum_def_stmt is ENUM, [,BIND(C)] [[::] type_alias_name] or ENUM [ kind_selector ] [[ :: ] type_alias_name] • kind_selector. If it is not specified, the compiler uses the default integer kind. • type_alias_name is the name you assign to the group. This name is treated as a type alias name. 10.2.1.6 TYPEALIAS Statement A TYPEALIAS statement allows you to define another name for an intrinsic data type or user-defined data type. Thus, the type alias and the type specification it aliases are interchangeable. Type aliases do not define a new type. This is the form for type aliases: type_alias_stmt is TYPEALIAS :: type_alias_list type_alias is type_alias_name => type_spec S–3901–60 187Cray® Fortran Reference Manual This example shows how a type alias can define another name for an intrinsic type, a user-defined type, and another type alias: TYPEALIAS :: INTEGER_64 => INTEGER(KIND = 8), & TYPE_ALIAS => TYPE(USER_DERIVED_TYPE), & ALIAS_OF_TYPE_ALIAS => TYPE(TYPE_ALIAS) INTEGER(KIND = 8) :: I TYPE(INTEGER_64) :: X, Y TYPE(TYPE_ALIAS) :: S TYPE(ALIAS_OF_TYPE_ALIAS) :: T You can use a type alias or the data type it aliases interchangeably. That is, explicit or implicit declarations that use a type alias have the same effect as if the data type being aliased was used. For example, the above declarations of I, X, and Y are the same. Also, S and T are the same. If the type being aliased is a derived type, the type alias name can be used to declare a structure constructor for the type. The following are allowed as the type_spec in a TYPEALIAS statement: • Any intrinsic type defined by the Cray Fortran compiler. • Any type alias in the same scoping unit. • Any derived type in the same scoping unit. 10.3 Data Object Declarations and Specifications The Cray Fortran compiler accepts the following extensions to declarations. 10.3.1 Attribute Specification Statements 10.3.1.1 BOZ Constants in DATA Statements The Cray Fortran compiler permits a default real object to be initialized with a BOZ, typeless, or character (used as Hollerith) constant in a DATA statement. BOZ constants are formatted in binary, octal, or hexadecimal. No conversion of the BOZ value, typeless value, or character constant takes place. 188 S–3901–60Cray Fortran Language Extensions [10] The Cray Fortran compiler permits an integer object to be initialized with a BOZ, typeless, or character (used as Hollerith) constant in a type declaration statement. The Cray Fortran compiler also allows an integer object to be initialized with a typeless or character (used as Hollerith) constant in a DATA statement. If the last item in the data_object_list is an array name, the value list can contain fewer values than the number of elements in the array. Any element that is not assigned a value is undefined. The following alternate forms of BOZ constants are supported. literal-constant is typeless-constant typeless-constant is octal-typeless-constant octal-typeless-constant is digit [ digit... ] B or " digit [ digit... ] "O or ' digit [ digit... ] 'O hexadecimal-typeless-constant is X' hex-digit [ hex-digit... ]' or X" hex-digit [ hex-digit... ] " or ' hex-digit [ hex-digit... ] 'X or " hex-digit [ hex-digit... ] "X 10.3.1.2 Attribute Respecification The Cray Fortran compiler permits an attribute to appear more than once in a given type declaration. 10.3.1.3 AUTOMATIC Attribute and Statement The Cray Fortran AUTOMATIC attribute specifies stack-based storage for a variable or array. Such variables and arrays are undefined upon entering and exiting the procedure. The following is the format for the AUTOMATIC specification: type, AUTOMATIC [ , attribute-list ] :: entity-list automatic-stmt is AUTOMATIC [[::] ]entity-list S–3901–60 189Cray® Fortran Reference Manual entity-list For entity-list, specify a variable name or an array declarator. If an entity-list item is an array, it must be declared with an explicit-shape-spec with constant bounds. If an entity-list item is a pointer, it must be declared with a deferred-shape-spec. If an entity-list item has the same name as the function in which it is declared, the entity-list item must be scalar and of type integer, real, logical, complex, or double precision. If the entity-list item is a pointer, the AUTOMATIC attribute applies to the pointer itself and not to any target that may become associated with the pointer. Subject to the rules governing combinations of attributes, attribute-list can contain the following: DIMENSION TARGET POINTER VOLATILE The following entities cannot have the AUTOMATIC attribute: • Pointers or arrays used as function results • Dummy arguments • Statement functions • Automatic array or character data objects An entity-list item cannot have the following characteristics: • It cannot be defined in the scoping unit of a module. • It cannot be a common block item. • It cannot be specified more than once within the same scoping unit. • It cannot be initialized with a DATA statement or with a type declaration statement. • It cannot also have the SAVE or STATIC attribute. • It cannot be specified as a Cray pointee. 190 S–3901–60Cray Fortran Language Extensions [10] 10.3.2 IMPLICIT Statement 10.3.2.1 IMPLICIT Extensions The Cray Fortran compiler accepts the IMPLICIT AUTOMATIC or IMPLICIT STATIC syntax. It is recommended that none of the IMPLICIT extensions be used in new code. 10.3.3 Storage Association of Data Objects 10.3.3.1 EQUIVALENCE Statement Extensions The Cray Fortran compiler allows equivalencing of character data with noncharacter data. The Fortran standard does not address this. It is recommended that you do not perform equivalencing in this manner, however, because alignment and padding differs across platforms, thus rendering your code less portable. 10.3.3.2 COMMON Statement Extensions The Cray Fortran compiler treats named common blocks and blank common blocks identically, as follows: • Variables in blank common and variables in named common blocks can be initialized. • Named common blocks and blank common are always saved. • Named common blocks of the same name and blank common can be of different sizes in different scoping units. 10.4 Expressions and Assignment 10.4.1 Expressions In Fortran, calculations are specified by writing expressions. Expressions look much like algebraic formulas in mathematics, particularly when the expressions involve calculations on numerical values. Expressions often involve nonnumeric values, such as character strings, logical values, or structures; these also can be considered to be formulas that involve nonnumeric quantities rather than numeric ones. S–3901–60 191Cray® Fortran Reference Manual 10.4.1.1 Rules for Forming Expressions The Cray Fortran compiler supports exclusive disjunct expressions of the form: exclusive-disjunct-expr is [ exclusive-disjunct-expr .XOR. ] inclusive-disjunct-expr 10.4.1.2 Intrinsic and Defined Operations Cray supports the following intrinsic operators as extensions: less_greater_op is .LG. or <> not_op is .N. and_op is .A. or_op is .O. exclusive_disjunct_op is .XOR. or .X. The Cray Fortran less than or greater than intrinsic operation is represented by the <> operator and the .LG. keyword. This operation is suggested by the IEEE standard for floating-point arithmetic, and the Cray Fortran compiler supports this operator. Only values of type real can appear on either side of the <> or .LG. operators. If the operands are not of the same kind type value, the compiler converts them to equivalent kind types. The <> and .LG. operators perform a less-than-or-greater-than operation as specified in the IEEE standard for floating-point arithmetic. The Cray Fortran compiler allows abbreviations for the logical and masking operators. The abbreviations .A., .O., .N., and .X. are synonyms for .AND., .OR., .NOT., and .XOR., respectively. The masking of Boolean operators and their abbreviations, which are extensions to Fortran, can be redefined as defined operators. If you redefine a masking operator, your definition overrides the intrinsic masking operator definition. See Table 11, page 194, for a list of the operators. 192 S–3901–60Cray Fortran Language Extensions [10] 10.4.1.3 Intrinsic Operations In the following table, the symbols I, R, Z, C, L, B, and P stand for the types integer, real, complex, character, logical, Boolean, and Cray pointer, respectively. Where more than one type for x 2 is given, the type of the result of the operation is given in the same relative position in the next column. Boolean and Cray pointer types are extensions of the Fortran standard. Table 10. Operand Types and Results for Intrinsic Operations Intrinsic operator Type of x 1 Type of x 2 Type of result Unary +, - I, R, Z, B, P I, R, Z, I, P Binary +, -, *, /, ** I I, R, Z, B, P I, R, Z, I, P R I, R, Z, B R, R, Z, R Z I, R, Z Z, Z, Z B I, R, B, P I, R, B, P P I, B, P P, P, P (For Cray pointer, only + and - are allowed.) // C C C .EQ., ==, .NE., /= I I, R, Z, B, P L, L, L, L, L R I, R, Z, B, P L, L, L, L, L Z I, R, Z, B, P L, L, L, L, L B I, R, Z, B, P L, L, L, L, L P I, R, Z, B, P L, L, L, L, L C C L .GT., >, .GE., >=, .LT., <, .LE., <= I I, R, B, P L, L, L, L R I, R, B L, L, L C C L P I, P L, L .LG., <> R R L .NOT. L L S–3901–60 193Cray® Fortran Reference Manual Intrinsic operator Type of x 1 Type of x 2 Type of result I, R, B B .AND., .OR., .EQV., .NEQV., .XOR. L L L I, R, B I, R, B B The operators .NOT., .AND., .OR., .EQV., and .XOR. can also be used in the Cray Fortran compiler's bitwise masking expressions; these are extensions to the Fortran standard. The result is Boolean (or typeless) and has no kind type parameters. 10.4.1.4 Bitwise Logical Expressions A bitwise logical expression (also called a masking expression) is an expression in which a logical operator operates on individual bits within integer, real, Cray pointer, or Boolean operands, giving a result of type Boolean. Each operand is treated as a single storage unit. The result is a single storage unit, which is either 32 or 64 bits depending on the -s option specified during compilation. Boolean values and bitwise logical expressions use the same operators but are different from logical values and expressions. Table 11. Cray Fortran Intrinsic Bitwise Operators and the Allowed Types of their Operands Operator category Intrinsic operator Operand types Bitwise masking (Boolean) expressions .NOT., .AND., .OR., .XOR., .EQV., .NEQV. Integer, real, typeless, or Cray pointer. Bitwise logical operators can also be written as functions; for example A .AND. B can be written as IAND(A,B) and .NOT. A can be written as NOT(A). Table 12, page 195 shows which data types can be used together in bitwise logical operations. 194 S–3901–60Cray Fortran Language Extensions [10] Table 12. Data Types in Bitwise Logical Operations x 1 x 2 1 Integer Real Boolean Pointer Logical Character Integer Masking operation, Boolean result. Masking operation, Boolean result. Masking operation, Boolean result. Masking operation, Boolean result. Not valid Not valid2 Real Masking operation, Boolean result. Masking operation, Boolean result. Masking operation, Boolean result. Masking operation, Boolean result. Not valid Not valid2 Boolean Masking operation, Boolean result. Masking operation, Boolean result. Masking operation, Boolean result. Masking operation, Boolean result. Not valid Not valid2 Pointer Masking operation, Boolean result. Masking operation, Boolean result. Masking operation, Boolean result. Masking operation, Boolean result. Not valid Not valid2 Logical Not valid2 Not valid2 Not valid2 Not valid2 Logical operation logical result Not valid2 Character Not valid2 Not valid2 Not valid2 Not valid2 Not valid Not valid2 Bitwise logical expressions can be combined with expressions of Boolean or other types by using arithmetic, relational, and logical operators. Evaluation of an arithmetic or relational operator processes a bitwise logical expression with no type conversion. Boolean data is never automatically converted to another type. 1 x 1 and x 2 represent operands for a logical or bitwise expression, using operators .NOT., .AND., .OR., .XOR., .NEQV., and .EQV.. 2 Indicates that if the operand is a character operand of 32 or fewer characters, the operand is treated as a Hollerith constant and is allowed. S–3901–60 195Cray® Fortran Reference Manual A bitwise logical expression performs the indicated logical operation separately on each bit. The interpretation of individual bits in bitwise multiplication-exprs, summation-exprs, and general expressions is the same as for logical expressions. The results of binary 1 and 0 correspond to the logical results TRUE and FALSE, respectively, in each of the bit positions. These values are summarized as follows: .NOT. 1100 1100 1100 1100 1100 =0011 .AND. 1010 .OR. 1010 .XOR. 1010 .EQV. 1010 ---- ---- ---- ---- 1000 1110 0110 1001 10.4.2 Assignment 10.4.2.1 Assignment The Cray Fortran compiler supports Boolean and Cray pointer intrinsic assignments. The Cray Fortran compiler supports type Boolean or BOZ constants in assignment statements in which the variable is of type integer or real. The bits specified by the constant are moved into the variable with no type conversion. 10.5 Execution Control 10.5.1 STOP Code Extension The STOP statement terminates the program whenever and wherever it is executed. The STOP statement is defined as follows: stop-stmt is STOP [stop_code] stop-code is scalar_char_constant or digit ... The character constant or list of digits identifying the STOP statement is optional and is called a stop-code. When the stop-code is a string of digits, leading zeros are not significant; 10 and 010 are the same stop-code. The Cray Fortran compiler accepts 1 to 80 digits; the standard accepts up to 5 digits. 196 S–3901–60Cray Fortran Language Extensions [10] The stop code is accessible following program termination. The Cray Fortran compiler sends it to the standard error file (stderr). The following are examples of STOP statements: STOP STOP 'Error #823' STOP 20 10.6 Input/Output Statements The Fortran standard does not specifically describe the implementation of I/O processing. This section provides information about processor-dependent areas and the implementation of the support for I/O. 10.6.1 File Connection 10.6.1.1 OPEN Statement The OPEN statement specifies the connection properties between the file and the unit, using keyword specifiers, which are described in this section. Table 13 indicates the Cray Fortran compiler extension in an OPEN statement. Table 13. Values for Keyword Specifier Variables in an OPEN Statement Specifier Possible values Default value FORM= SYSTEM Unformatted with no record marks The FORM= specifier has the following format: FORM= scalar-char-expr A file opened with SYSTEM is unformatted and has no record marks. S–3901–60 197Cray® Fortran Reference Manual 10.7 Error, End-of-record, and End-of-file Conditions 10.7.1 End-of-file Condition and the END-specifier 10.7.1.1 Multiple End-of-file Records The file position prior to data transfer depends on the method of access: sequential or direct. Although the Fortran standard does not allow files that contain an end-of-file to be positioned after the end-of-file prior to data transfer, the Cray Fortran compiler permits more than one end-of-file for some file structures. 10.8 Input/Output Editing 10.8.1 Data Edit Descriptors 10.8.1.1 Integer Editing The Cray Fortran compiler allows w to be zero for the G edit descriptor, and it permits w to be omitted for the I, B, O, Z, or G edit descriptors. The Cray Fortran compiler allows signed binary, octal, or hexadecimal values as input. If the minimum digits (m) field is specified, the default field width is increased, if necessary, to allow for that minimum width. Note: UNICOS/mp and UNICOS/lc systems support 1- and 2-byte data types when the -eh compiler option is enabled. Cray discourages the use of this option because it can severely degrade performance. For more information about the -eh option, see Section 3.5, page 18. 10.8.1.2 Real Editing The Cray Fortran compiler allows the use of B, O, and Z edit descriptors of REAL data items. The Cray Fortran compiler accepts the D[w.dEe] edit descriptor. 198 S–3901–60Cray Fortran Language Extensions [10] The Cray Fortran compiler accepts the ZERO_WIDTH_PRECISION environment variable, which can be used to modify the default size of the width w field. This environment variable is examined only upon program startup. Changing the value of the environment variable during program execution has no effect. For more information about the ZERO_WIDTH_PRECISION environment, see Section 4.1.9, page 85. The Cray Fortran compiler allows w to be zero or omitted for the D, E, EN, ES, or G edit descriptors. The Cray Fortran compiler does not restrict the use of Ew.d and Dw.d to an exponent less than or equal to 999. The Ew.dEe form must be used. Table 14. Default Fractional and Exponent Digits Data size and representation w d e 4-byte (32-bit) IEEE 17 9 2 8-byte (64-bit) IEEE 26 17 3 16-byte (128-bit) IEEE 46 36 4 10.8.1.3 Logical Editing The Cray Fortran compiler allows w to be zero or omitted on the L or G edit descriptors. 10.8.1.4 Character Editing The Cray Fortran compiler allows w to be zero or omitted on the G edit descriptor. 10.8.2 Control Edit Descriptors 10.8.2.1 Q Editing The Cray Fortran supports the Q edit descriptor. The Q edit descriptor is used to determine the number of characters remaining in the input record. It has the following format: Q S–3901–60 199Cray® Fortran Reference Manual When a Q edit descriptor is encountered during execution of an input statement, the corresponding input list item must be of type integer. Interpretation of the Q edit descriptor causes the input list item to be defined with a value that represents the number of characters remaining to be read in the formatted record. For example, if c is the character position within the current record of the next character to be read, and the record consists of n characters, then the item is defined with the following value MAX(n-c+1,0). If no characters have yet been read, then the item is defined as n (the length of the record). If all the characters of the record have been read (c>n), then the item is defined as zero. The Q edit descriptor must not be encountered during the execution of an output statement. The following example code uses Q on input: INTEGER N CHARACTER LINE * 80 READ (*, FMT='(Q,A)') N, LINE(1:N) 10.8.3 List-directed Formatting 10.8.3.1 List-directed Input Input values are generally accepted as list-directed input if they are the same as those required for explicit formatting with an edit descriptor. The exceptions are as follows: • When the data list item is of type integer, the constant must be of a form suitable for the I edit descriptor. The Cray Fortran compiler permits binary, octal, and hexadecimal based values in a list-directed input record to correspond to I edit descriptors. 200 S–3901–60Cray Fortran Language Extensions [10] 10.8.4 Namelist Formatting 10.8.4.1 Namelist Extensions The Cray Fortran compiler has extended the namelist feature. The following additional rules govern namelist processing: • An ampersand (&) or dollar sign ($) can precede the namelist group name or terminate namelist group input. If an ampersand precedes the namelist group name, either the slash (/) or the ampersand must terminate the namelist group input. If the dollar sign precedes the namelist group name, either the slash or the dollar sign must terminate the namelist group input. • Octal and hexadecimal constants are allowed as input to integer and single-precision real namelist group items. An error is generated if octal and hexadecimal constants are specified as input to character, complex, or double-precision real namelist group items. Octal constants must be of the following form: – O"123" – O'123' – o"123" – o'123' Hexadecimal constants must be of the following form: – Z"1a3" – Z'1a3' – z"1a3" – z'1a3' 10.8.5 I/O Editing Usually, data is stored in memory as the values of variables in some binary form. On the other hand, formatted data records in a file consist of characters. Thus, when data is read from a formatted record, it must be converted from characters to the internal representation. When data is written to a formatted record, it must be converted from the internal representation into a string of characters. S–3901–60 201Cray® Fortran Reference Manual Table 15 and Table 16, list the control and data edit descriptor extensions supported by the Cray Fortran compiler and provide a brief description of each. Table 15. Summary of Control Edit Descriptors Descriptor Description $ or \ Suppress carriage control Table 16. Summary of Data Edit Descriptors Descriptor Description Q Return number of characters left in record For more information about the Q edit descriptor, see Section 10.8.2.1, page 199. The following tables show the use of the Cray Fortran compiler's edit descriptors with all intrinsic data types. In these tables: • NA indicates invalid usage that is not allowed. • I,O indicates that usage is allowed for both input and output. • I indicates legal usage for input only. Table 17. Default Compatibility Between I/O List Data Types and Data Edit Descriptors Data types Q Z R O L I G F ES EN E D B A Integer I I,O I,O I,O NA I,O I,O NA NA NA NA NA I,O I,O Real NA I,O I,O I,O NA NA I,O I,O I,O I,O I,O I,O I,O I,O Complex NA I,O I,O I,O NA NA I,O I,O I,O I,O I,O I,O I,O I,O Logical NA I,O I,O I,O I,O NA I,O NA NA NA NA NA I,O I,O Character NA NA NA NA NA NA I,O NA NA NA NA NA NA I,O Table 18, page 203 shows the restrictions for the various data types that are allowed when you set the FORMAT_TYPE_CHECKING environment variable to RELAXED. Not all data edit descriptors support all data sizes; for example, you cannot read/write a 16–byte real variable with an I edit descriptor. 202 S–3901–60Cray Fortran Language Extensions [10] Table 18. RELAXED Compatibility Between Data Types and Data Edit Descriptors Data types Q Z R O L I G F ES EN E D B A Integer I I,O I,O I,O I,O I,O I,O I,O I,O I,O I,O NA I,O I,O Real NA I,O I,O I,O I,O I,O I,O I,O I,O I,O I,O I,O I,O I,O Complex NA I,O I,O I,O NA NA I,O I,O I,O I,O I,O I,O I,O I,O Logical NA I,O I,O I,O I,O I,O I,O I,O I,O I,O I,O NA I,O I,O Character NA NA NA NA NA NA I,O NA NA NA NA NA NA I,O Table 19 shows the restrictions for the various data types that are allowed when you set the FORMAT_TYPE_CHECKING environment variable to STRICT77. Table 19. STRICT77 Compatibility Between Data Types and Data Edit Descriptors Data types Q Z R O L I G F ES EN E D B A Integer NA I,O NA I,O NA I,O NA NA NA NA NA NA I,O NA Real NA NA NA NA NA NA I,O I,O NA NA I,O I,O NA NA Complex NA NA NA NA NA NA I,O I,O NA NA I,O I,O NA NA Logical NA NA NA NA I,O NA NA NA NA NA NA NA NA NA Character NA NA NA NA NA NA NA NA NA NA NA NA NA I,O Table 20 shows the restrictions for the various data types that are allowed when you set the FORMAT_TYPE CHECKING environment variable to STRICT90 or STRICT95. Table 20. STRICT90 and STRICT95 Compatibility Between Data Types and Data Edit Descriptors Data types Q Z R O L I G F ES EN E D B A Integer NA I,O NA I,O NA I,O I,O NA NA NA NA NA I,O NA Real NA NA NA NA NA NA I,O I,O I,O I,O I,O I,O NA NA Complex NA NA NA NA NA NA I,O I,O I,O I,O I,O I,O NA NA Logical NA NA NA NA I,O NA I,O NA NA NA NA NA NA NA Character NA NA NA NA NA NA I,O NA NA NA NA NA NA I,O S–3901–60 203Cray® Fortran Reference Manual 10.9 Program Units 10.9.1 Main Program 10.9.1.1 Program Statement Extension The Cray Fortran compiler supports the use of a parenthesized list of args at the end of a program statement. The compiler ignores any args specified after program-name 10.9.2 Block Data Program Units 10.9.2.1 Block Data Program Unit Extension The Cray Fortran compiler permits named common blocks to appear in more than one block data program unit. 10.10 Procedures 10.10.1 Procedure Interface 10.10.1.1 Interface Duplication The Cray Fortran compiler allows you to specify an interface body for the program unit being compiled if the interface body matches the program unit definition. 10.10.2 Procedure Definition 10.10.2.1 Recursive Function Extension The Cray Fortran compiler allows direct recursion for functions that do not specify a RESULT clause on the FUNCTION statement. 10.10.2.2 Empty CONTAINS Sections The Cray Fortran compiler allows a CONTAINS statement with no internal or module procedure following. This is proposed for the 2008 Fortran standard. 204 S–3901–60Cray Fortran Language Extensions [10] 10.11 Intrinsic Procedures and Modules 10.11.1 Standard Generic Intrinsic Procedures 10.11.1.1 Intrinsic Procedures The Cray Fortran compiler has implemented intrinsic procedures in addition to the ones required by the standard. These procedures have the status of intrinsic procedures, but programs that use them may not be portable. It is recommended that such procedures be declared INTRINSIC to allow other processors to diagnose whether or not they are intrinsic for those processors. The nonstandard intrinsic procedures supported by the Cray Fortran compiler that are not obsolete are summarized in the following list. For more information about a particular procedure, see its man page. ACOSD Arccosine, value in degrees ADD_CARRY@ Add vectors with carry ADD_CARRY_S@ Add scalars with carry AMO_AADD Atomic memory add AMO_AFADD Atomic memory add, return old AMO_AAX Atomic memory logicals AMO_AFAX Atomic memory logicals, return old AMO_ACSWAP Atomic compare and swap ASIND Arcsine, value in degrees ATAND Arctangent, value in degrees ATAND2 Arctangent, value in degrees COSD Cosine, argument in degrees COT Cotangent DSHIFTL Double word left shift DSHIFTR Double word right shift S–3901–60 205Cray® Fortran Reference Manual END_CRITICAL End of a critical region EXIT Program termination FREE Free Cray pointee memory GET_BORROW@ Get vector borrow bits GET_BORROW_S@ Get scalar borrow bit GSYNC Complete outstanding memory references IBCHNG Reverse bit within a word ILEN Length in bits of an integer INT_MULT_UPPER Upper bits of integer product LEADZ Number of leading 0 bits LOC Address of argument LOG2_IMAGES Logarithm base 2 of number of images M@CLR Clears BML bit M@LD Bit matrix load M@LDMX Combined bit matrix load and multiply M@MOR Bit matrix inclusive or M@MX Bit matrix multiply M@UL Bit matrix unload MALLOC Allocate Cray pointee memory MASK Creates a bit mask in a word NUMARG Number of arguments in a call 206 S–3901–60Cray Fortran Language Extensions [10] NUM_IMAGES Number of executing images POPCNT Number of 1 bits in a word POPPAR XOR reduction of bits in a word QPROD Quad precision product REM_IMAGES Mod (num_images(), 2**log2_images()) SET_BORROW@ Set vector borrow bits SET_BORROW_S@ Set scalar borrow bits SET_CARRY@ Set vector carry bits SET_CARRY_S@ Set scalar carry bits SHIFTA Arithmetic right shift SHIFTL Left shift, zero fill SHIFTR Right shift, zero fill SIND Sin, argument in degrees SIZEOF Size of argument in bytes SSPID@ SSP number within an MSP (0..3) (X1 only) START_CRITICAL Begin critical region STREAMING@ Indicates if streaming is allowed (X1 only) SUB_BORROW@ Subtract vector with borrow SUB_BORROW_S@ Subtract scalar with borrow SYNC_ALL Synchronize all images S–3901–60 207Cray® Fortran Reference Manual SYNC_FILE Synchronize file access among images SYNC_IMAGES Synchronize indicated images SYNC_MEMORY Memory barrier (same as GSYNC) SYNC_TEAM Synchronize a team of images TAND Tangent, argument in degrees THIS_IMAGE Image number of executing image All Cray Fortran intrinsic procedures are described in man pages that can be accessed online through the man(1) command. Many intrinsic procedures have both a vector and a scalar version. If a vector version of an intrinsic procedure exists, and the intrinsic is called within a vectorizable loop, the compiler uses the vector version of the intrinsic. For information about which intrinsic procedures vectorize, see intro_intrin(3i). 10.12 Exceptions and IEEE Arithmetic 10.12.1 The Exceptions 10.12.1.1 IEEE Intrinsic Module Extensions The intrinsic module IEEE_EXCEPTIONS supplied with the Cray Fortran compiler contains three named constants in addition to those specified by the standard. These are of type IEEE_STATUS_TYPE and can be used as arguments to the IEEE_SET_STATUS subroutine. Their definitions correspond to common combinations of settings and allow for simple and fast changes to the IEEE mode settings. The constants are: 208 S–3901–60Cray Fortran Language Extensions [10] Table 21. Cray Fortran IEEE Intrinsic Module Extensions Name Effect of CALL IEEE_SET_STATUS (Name) ieee_cri_silent_mode • Clears all currently set exception flags • Disables halting for all exceptions • Disables setting of all exception flags • Sets rounding mode to round_to_nearest ieee_cri_nostop_mode • Clears all currently set exception flags • Disables halting for all exceptions • Enables setting of all exception flags • Sets rounding mode to round_to_nearest ieee_cri_default_mode • Clears all currently set exception flags • Enables halting for overflow, divide_by_zero, and invalid • Disables halting for underflow and inexact • Enables setting of all exception flags • Sets rounding mode to round_to_nearest S–3901–60 209Cray® Fortran Reference Manual 10.13 Interoperability With C 10.13.1 Interoperability Between Fortran and C Entities 10.13.1.1 BIND(C) Syntax The proc-language-binding-spec specification allows Fortran programs to interoperate with C objects. The optional commas in SUBROUTINE name(), BIND(C) and FUNCTION name(), BIND(C) are Cray extensions to the Fortran standard. 10.14 Co-arrays The Cray Fortran compiler implements co-arrays as a mechanism for data exchange in parallel programs. Data passing has proven itself to be an effective method for programming single-program-multiple-data (SPMD) parallel computation. Its chief advantage over message passing is lower latency for data transfers, which leads to better scalability of parallel applications. Co-arrays are a syntactic extension to the Fortran Language that offers a method for programming data passing. Data passing can also be accomplished by using the shared memory (SHMEM) library routines. Using SHMEM, the program transfers data from an object on one processing element to an object on another via subroutine calls. This technique is often referred to as one-sided communication. Co-arrays provide an alternative syntax for specifying these transfers. With co-arrays, the concept of a processing element is replaced by the concept of an image. When data objects are declared as co-arrays, the corresponding co-arrays on different images can be referenced or defined in a fashin similar to the way in which arrays are referenced or defined in Fortran. This is done by adding additional dimensions, or co-dimensions, within brackets ([ ]) to an object's declarations and references. These extra dimensions express the image upon which the object resides. Since no subroutine calls are involved in data passing using co-arrays, this technique is referred to as zero-sided communication. 210 S–3901–60Cray Fortran Language Extensions [10] Co-arrays offer the following advantages over SHMEM: • Co-arrays are syntax-based, so programs that use them can be analyzed and optimized by the compiler. This offers greater opportunity for hiding data transfer latency. • Co-array syntax can eliminate the need to create and copy data to local temporary arrays. • Co-arrays express data transfer naturally through the syntax of the language, making the code more readable and maintainable. • The unique bracket syntax allows you to scan for and to identify communication in a program easily. Consider the following SHMEM code fragment from a finite differencing algorithm: CALL SHMEM_REAL_GET(T1, U, NROW, LEFT) CALL SHMEM_REAL_GET(T2, U, NROW, RIGHT) NU(1:NROW) = NU(1:NROW) + T1(1:NROW) + T2(1:NROW) Co-arrays can be used to express this fragment simply as: NU(1:NROW) = NU(1:NROW) + U(1:NROW)[LEFT] + U(1:NROW)[RIGHT] Notice that the resulting code is more concise, easier to read, and that the copies to local temporary objects T1 and T2 are eliminated. Co-arrays can interoperate with the other message passing and data passing models. This interoperability allows you to introduce co-arrays gradually into codes that presently use the Message Passing Interface (MPI) or SHMEM. This chapter describes the syntax and semantics of the co-array extension to the Cray Fortran compiler. S–3901–60 211Cray® Fortran Reference Manual The following technical papers may be of use to you when using co-arrays: • R. W. Numrich and J. Reid, Co-array Fortran for Parallel Programming, vol. 17, Number 2 (ACM Fortran Forum, 1998), 1–31 You can also access the document at this address: ftp://matisa.cc.rl.ac.uk/pub/reports/nrRAL98060.ps.gz • R. W. Numrich, J. L. Steidel, B. H. Johnson, B. D. de Dinechin, G. W. Elsesser, G. S. Fischer, and T. A. MacDonald, Definition of the F – – Extension to Fortran 90, Proceedings of the 10th International Workshop on Languages and Compilers for Parallel Computers, Lectures on Computer Science Series, Number 1366, (Speinger-Verlag, 1998), 282–306 10.14.1 Execution Model and Images Programs with Cray Fortran co-arrays use the single-program-multiple-data (SPMD) execution model. In the SPMD model, the program and all its data are replicated and executed asynchronously. Each replication of the program is an image. Each image is executed on a processing element. Images are numbered consecutively starting with one. Note: (X1 only) Indicating the processing element type an image runs on (multistreaming processor (MSP) or single streaming processor (SSP)), is determined at the command line of the Cray Fortran compiler. See Section 10.15.1, page 225. The total number of images that are executing can be accessed through the NUM_IMAGES intrinsic function. An image can access its own image number through the THIS_IMAGE intrinsic function. Images can synchronize through the SYNC_ALL intrinsic subroutine. 10.14.2 Specifying Co-arrays A co-array is a data object that is identically allocated on each image and, more significantly, can be directly referenced by any other image syntactically. A co-array specification consists of the local object specification and the co-dimensions specification. The local object is the data object to be replicated on each image. The co-dimensions are the dimensions of the co-array, which are specified within brackets ([ ]) and appended to the specification for the local object. 212 S–3901–60Cray Fortran Language Extensions [10] Example 1. The following statements show co-array declarations: REAL, DIMENSION(20)[8,*] :: A, C REAL :: B(20)[8,*], D[*], E[0:*] INTEGER :: IB(10)[*] Note: Generally, a co-dimension specification in brackets takes the same form as a dimension specification in parentheses. The exception is that for co-dimensions, the upper bound of the right-most co-dimension must be an asterisk (*). This is because co-array objects are replicated on all images, so co-size is always equal to NUM_IMAGES. Elements of co-arrays on other images can be referenced by appending square brackets to the end of a reference to the local object. As the following shows, the brackets contain subscripts, one for each co-dimension: A(5)[7,3] = IB(5)[3] D[11] = E A(:)[2,3] = C(1)[1,1] The co-dimension specification of a co-array creates a mapping of subscripts to images. This mapping is identical to the mapping that parenthesized array dimensions create between subscripts and elements of an array. For example, the following table lists the image number for some references of the objects declared in Example 1: Reference Image IB(5)[3] 3 A(5)[7,3] 31 D[11] 11 E[11] 12 The terms local rank, local size, and local shape refer to the rank, size and shape of the local object of a co-array. The terms co-rank, co-size, and co-shape refer to those properties implied by the co-dimensions of a co-array. For example, for co-array A declared in the preceding list, its local rank is 1; its local size is 20; its co-rank is 2; and its co-size is equal to NUM_IMAGES. The co-rank of a co-array cannot exceed 7. S–3901–60 213Cray® Fortran Reference Manual The local object of a co-array can be of a derived type, but a co-array cannot be a component within a derived type. For example: TYPE DTYPE1 REAL :: X REAL :: Y END TYPE DTYPE1 TYPE(DTYPE1) :: DT(100)[*] ! PERMITTED: CO-ARRAY OF DERIVED TYPE TYPE DTYPE2 REAL :: X REAL :: Y[*] ! NOT PERMITTED: ! CO-ARRAY IN DERIVED TYPE END TYPE DTYPE2 Most objects can be the local object of a co-array, but the following list indicates restrictions on co-array specifications: • Co-arrays with assumed-size local size are not supported. For example: REAL :: Y(*)[*] ! NOT SUPPORTED: LOCAL OBJECT ASSUMED SIZE • Co-arrays with deferred-shape local shape or co-shape are supported, but the co-array must be allocatable. Co-array pointers are not supported. For example: REAL, ALLOCATABLE :: WA(:)[:] ! SUPPORTED: ALLOCATABLE REAL, POINTER :: WP(:)[:] ! NOT SUPPORTED: POINTER • Co-arrays with assumed-shape local shape or co-shape are not supported. For example: SUBROUTINE S1( Z1, Z2 ) REAL :: Z1(:)[*] ! NOT SUPPORTED: ASSUMED-SHAPE LOCAL SHAPE REAL :: Z2(:)[:] ! NOT SUPPORTED: ASSUMED-SHAPE CO-SHAPE • Automatic co-arrays are not supported. For example: SUBROUTINE S2( A, N ) REAL :: A(N)[*] ! SUPPORTED: CO-ARRAY ACTUAL ARGUMENT REAL :: W(N)[*] ! NOT SUPPORTED: AUTOMATIC LOCAL OBJECT 10.14.3 Referencing Co-arrays Co-arrays can be referenced two ways: with brackets and without brackets. 214 S–3901–60Cray Fortran Language Extensions [10] When brackets are omitted, the object local to the invoking image is referenced; this is called a local reference. For example: REAL, DIMENSION(100)[*] :: A, B, C, D, E A(I) = B(I) + C(I) ! LOCAL REFERENCES TO A, B, C D = E ! LOCAL REFERENCES TO D, E When brackets are specified, the object on the image specified by the subscripts within the brackets is referenced. This is called a bracket reference. For example: A(I)[IP] = B(I) + C(I) ! REFERENCE TO A ON IMAGE "IP"; ! LOCAL REFERENCES TO B, C D(:) = E(:)[IP2] ! REFERENCES TO E ON IMAGE "IP2" ! LOCAL REFERENCES TO D Components of derived type co-arrays are specified by appending the component specification after the brackets. For example: TYPE DTYPE3 REAL :: X(100) INTEGER :: ICNT END TYPE DTYPE3 TYPE (DTYPE3) :: DT3[*] DT3%ICNT = DT3[IP]%ICNT ! SUPPORTED: BRACKET IN DERIVED TYPE DT3%X(J) = DT3[IP]%X(J) ! COMPONENT REFERENCES The co-subscripts of a co-array reference must translate to an image number between 1 and NUM_IMAGES, otherwise the behavior of the reference is undefined. There is a restriction for co-array references. Specification of subscripts for co-dimensions generally follows the specification of subscripts within parentheses. However, support for triplet subscript notation within brackets is not supported. For example: D(K)[1:N:2] = E(K)[1:N:2] ! NOT SUPPORTED: ! TRIPLET NOTATION IN []S S–3901–60 215Cray® Fortran Reference Manual 10.14.4 Initializing Co-arrays Co-arrays can be initialized using the DATA statement, but only the initialization of the local object can be specified. Bracket references are not allowed in a DATA statement. For example: REAL :: AI(100)[*] DATA AI(3) /1.0/ ! PERMITTED DATA AI(3)[11] /1.0/ ! NOT PERMITTED When the program is executed, the co-array local objects on every image are initialized identically, as specified. 10.14.5 Using Co-arrays with Procedure Calls If a procedure with a co-array dummy argument is called, the called procedure must have an explicit interface, and the actual argument must be a local reference to a co-array. If the actual argument has subscripts, their values should be the same across all images, otherwise the program behavior is undefined. For example: INTERFACE SUBROUTINE S3( A, N ) REAL :: A(N)[*] END INTERFACE REAL :: X(100,100), Y(100,100)[*] CALL S3( X(1,K), 100 ) ! NOT PERMITTED: ! LOCAL ACTUAL, CO-ARRAY DUMMY CALL S3( Y(1,K), 100 ) ! PERMITTED: CO-ARRAY ACTUAL AND DUMMY; ! UNDEFINED IF "K" NOT SAME VALUE ON ! ALL IMAGES Bracket references cannot appear as actual arguments in subroutine calls or function calls. For example: CALL S3( Y(1,K)[IP], 100 ) ! NOT PERMITTED: BRACKET ACTUAL Co-array bracket references can appear within an actual argument, but only as part of an expression that is passed as the actual argument. Parentheses can be used to turn a bracket reference into an expression. For example: CALL S3( ( Y(1,K)[IP] ), 100 ) ! PERMITTED: ACTUAL IS EXPRESSION 216 S–3901–60Cray Fortran Language Extensions [10] The rules of resolving generic procedure references are the same as those in the Fortran standard. The following restrictions affect co-arrays used in procedures: • A function result is not permitted to be a co-array. • A pure procedure is not permitted to contain any co-arrays. 10.14.6 Specifying Co-arrays in COMMON and EQUIVALENCE Statements Co-arrays can be specified in COMMON statements. For example: COMMON /CCC/ W1(100)[*], W2(100)[16,*] ! PERMITTED: ! CO-ARRAYS IN COMMON The layout of the common block on any one image is as if all objects of the common block were declared without co-dimensions. Data objects that are not co-array data objects can appear in the same common block as co-arrays. Co-arrays can be specified in EQUIVALENCE statements, but bracket references cannot appear in EQUIVALENCE statements. For example: REAL :: V1(100)[*], V2(100)[*], V3(100) EQUIVALENCE ( V1(50), V2(1) ) ! PERMITTED: CO-ARRAYS EQUIVALENCE ( V1(1)[16], V2(1)[1] ) ! NOT PERMITTED: ! SQUARE BRACKETS Data objects that are not co-array data objects cannot be equivalenced to co-array data objects. For example: EQUIVALENCE (V1(50), V3(1)) ! NOT PERMITTED: V3 NOT ! CO-ARRAY OBJECT S–3901–60 217Cray® Fortran Reference Manual 10.14.7 Allocatable Co-arrays A co-array can be allocatable. Co-dimensions are specified by appending brackets containing the co-dimension specification to the co-array local specification in an ALLOCATE statement. For example: REAL, ALLOCATABLE :: A1(:)[:], A2(:)[:,:] ALLOCATE ( A1(10)[*]) ! PERMITTED: ALLOCATABLE CO-ARRAY ALLOCATE ( A2(24)[0:7,0:*] ) As with the specification of statically allocated co-arrays, the upper bound of the final co-dimension must be an asterisk (*) and the values of all other bounds must be identical across all images. ! Caution: Execution of ALLOCATE and DEALLOCATE statements containing co-array objects causes an implicit barrier synchronization of all images. All images must participate in the execution of these statements, or deadlock can occur. 10.14.8 Pointer Components in Derived Type Co-arrays A pointer cannot be declared as a co-array, but a co-array can be of a derived type containing a pointer component. This enables construction of irregularly sized data structures across images and indirect addressing of non-co-array data. For example: TYPE DTYPE4 INTEGER :: LEN REAL, POINTER :: AP(:) END TYPE DTYPE4 TYPE(DTYPE4) :: D4[*] ! PERMITTED: CO-ARRAY OF DERIVED ! TYPE CONTAINING POINTER To help prevent the possibility of pointers being assigned invalid data, co-array bracket references cannot appear in pointer assignment statements. For example: REAL :: Q(100) D4[IP]%AP => Q ! NOT PERMITTED: BRACKET IN Q => D4[IP]%AP ! POINTER ASSIGNMENT Pointer components of a co-array can be associated only with a local target, either through pointer assignment or allocation. 218 S–3901–60Cray Fortran Language Extensions [10] 10.14.9 Allocatable Components in Derived Type Co-arrays Co-array derived types are allowed to have allocatable components. This enables construction of irregularly sized data structures across images. TYPE DTYPE4 INTEGER :: LEN REAL, ALLOCATABLE :: AP(:) END TYPE DTYPE4 TYPE(DTYPE4) :: D4[*] ! PERMITTED: CO-ARRAY OF DERIVED ! TYPE CONTAINING ALLOCATABLE COMPONENT A bracket reference to a allocatable component in a derived type co-array returns the value from the object on the specified image. For example, the reference D4[7]%AP(22) returns the value of D4%AP(22) as evaluated on image 7. Allocatable components are allocated independently on each image. The allocation must not include square brackets. 10.14.10 Intrinsic Procedures These co-array intrinsics return information about images: • LOG2_IMAGES returns the base 2 logarithm of the number of executing images, truncated to an integer • NUM_IMAGES returns the total number of Co-array Fortran images executing • REM_IMAGES returns MOD(NUM_IMAGES(), 2**LOG2_IMAGES()) • THIS_IMAGE returns the index of, or co-subscripts related to, the invoking image Only NUM_IMAGES, LOG2_IMAGES, and REM_IMAGES can appear in specification statements. None of the intrinsics are permitted in initialization expressions. S–3901–60 219Cray® Fortran Reference Manual These co-array intrinsic subroutines synchronize access to co-array data among the images: • SYNC_ALL • SYNC_TEAM • SYNC_MEMORY • START_CRITICAL and END_CRITICAL • SYNC_FILE The following sections contain more information about these intrinsic procedures. 10.14.11 Program Synchronization Co-arrays provide synchronization procedures which allow you to ensure that access to co-array data in primary memory is coherent (reliable) across all images or a group of images called a team. That is, any image that modifies co-array data that is expected to be read by another image, or any image that reads data that is modified by another image must call the co-array synchronization intrinsic functions to ensure valid data is accessed by other images. 10.14.11.1 SYNC_ALL The SYNC_ALL intrinsic guarantees to all images executing a corresponding call to SYNC_ALL that the calling procedure has completed all preceding accesses to co-array data. The access must either be a direct read or write of the data or a procedure call that references the data. The SYNC_ALL intrinsic returns when all images have made a corresponding call to SYNC_ALL. For example, consider the following subroutine: SUBROUTINE TST(A,B,C,D,N,IP) REAL :: A(N)[*], B(N)[*], C(N)[*], D(N) A(:) = B(:)[IP] CALL SUB1(C,N) D(:) = 0.0 CALL SYNC_ALL() END 220 S–3901–60Cray Fortran Language Extensions [10] When an image executes the SYNC_ALL call as in the preceding example, it guarantees to all images executing a corresponding SYNC_ALL call that its access to A and B are complete and that all accesses to C by SUB1 are complete. It does not guarantee that its accesses to D are complete, since D is not declared as a co-array. This is true even if the actual argument for D is a co-array. Access behavior to the same data by different images without such corresponding synchronization calls is undefined. This is the syntax of the SYNC_ALL intrinsic: CALL SYNC_ALL([wait]) Calling SYNC_ALL without the wait argument is the same as calling SYNC_TEAM(all), where all has the value (/ (I,I=1,num_images()) /). Calling SYNC_ALL(wait) is the same as calling SYNC_TEAM(all, wait). See Section 10.14.11.2, page 221 for more information about the wait argument. Calling SYNC_ALL implies a call to SYNC_MEMORY function. 10.14.11.2 SYNC_TEAM The SYNC_TEAM intrinsic function can be used to synchronize a subset (or team) of images. The syntax is: CALL SYNC_TEAM(team [,wait]) The team argument specifies the members of the team. It has the INTENT(IN) attribute and can be either an integer array of rank one or an integer scalar. To create a team of two or more, pass an integer array containing the image numbers of all members of the team, including the image calling SYNC_TEAM. Valid values for each element are from 1 through NUM_IMAGES inclusive. The array must not contain duplicate image numbers. This example synchronizes a team consisting of images 2, 4, and 6: Note: The calling image must be either image 2, 4 or 6. CALL SYNC_TEAM((/2,4,6/)) You can also pass an integer scalar to create a team of two where the image calling SYNC_TEAM is an implied member of the team and the scalar integer specifies the image number of the other team member. S–3901–60 221Cray® Fortran Reference Manual For example, this code synchronizes a team consisting of the executing image and image 4: CALL SYNC_TEAM(4) The presence of the optional wait argument tells an image to wait only for a subset of the team members to make a corresponding call. The wait argument has the INTENT(IN) attribute and is either an integer array or integer scalar. For the array case, it contains the numbers of the images to wait for. It should contain no duplicate entries. The scalar case is treated as if the argument were the array (/wait/). For example, this code synchronizes a team consisting of images 2, 4, 6, and 8, while waiting for images 4 and 6: CALL SYNC_TEAM((/2,4,6,8/, /4,6/)) All images participating in a SYNC_TEAM call must call with identical arguments. Otherwise the results are undefined. 10.14.11.3 SYNC_MEMORY The SYNC_MEMORY intrinsic guarantees to other images that the image calling the function has completed all preceding accesses to co-array data. This is the syntax of the intrinsic: CALL SYNC_MEMORY() 10.14.11.4 START_CRITICAL and END_CRITICAL The START_CRITICAL and END_CRITICAL intrinsic functions mark the beginning and end of a critical section. Only one image at a time may execute statements in a critical region. If an image executes a START_CRITICAL intrinsic while another image is in the critical region, it waits. Also, both intrinsics, like the SYNC_MEMORY intrinsic, ensure that the calling image has completed all preceding accesses to co-array data. This is the syntax of the intrinsics: CALL START_CRITICAL() CALL END_CRITICAL() 222 S–3901–60Cray Fortran Language Extensions [10] Example 4: Using START CRITICAL and END CRITICAL Source code: program critical implicit none real :: sum_local, median_local integer:: mype ! Distribute work and calculate sum and median values for each image ! For the sake of simplicity, we just assign values to sum_local and ! median_local mype = this_image() select case(mype) case (1) sum_local = 1000. median_local = 1234. case(2) sum_local = 2000. median_local = 2345. end select ! By putting these write statements in a critical region, you will get ! readable contiguous output on stdout. Without the critical region, ! lines of output from various images could be intermixed and unreadable. call start_critical() write (*,*) "********** Results for end of pass 1 on image ",mype," *********" write (*,*) " sum = ",sum_local write (*,*) "median = ",median_local write (*,*) "-----------------------------------------------------------" write (*,*) call end_critical() end program Commands to compile and run program: % ftn -o caf_critical -Z caf_critical.ftn % module load pbs % qsub -I -l mppe=2 S–3901–60 223Cray® Fortran Reference Manual qsub: waiting for job to start % aprun -n 2 /ptmp/user1/caf_critical Output: ********** Results for end of pass 1 on image 2 ********* sum = 2000. median = 2345. ----------------------------------------------------------- ********** Results for end of pass 1 on image 1 ********* sum = 1000. median = 1234. ----------------------------------------------------------- 10.14.11.5 SYNC_FILE To synchronize file accesses among images, use the SYNC_FILE intrinsic function. The intrinsic flushes data to a file to ensure that all images have access to valid data. The intrinsic affects only the I/O unit connected an image. If the unit is not connected or does not exist, the intrinsic has no effect. If the unit is connected for sequential access, a call to SYNC_FILE causes all WRITE requests to advance input or output. This is the syntax of SYNC_FILE: CALL SYNC_FILE(unit) The unit argument is a scalar integer with the INTENT(IN) attribute. The unit argument specifies a Fortran I/O unit. 10.14.12 I/O with Co-arrays An image can perform input only on the portion of a co-array that is local to that image. An image can perform output on any portion of a co-array. For example: REAL :: X(100)[*] ... READ *, X(I) ! PERMITTED: LOCAL CO_ARRAY REFERENCE READ *, X(I)[IP] ! NOT PERMITTED: BRACKET CO-ARRAY REFERENCE PRINT *, X(I)[IP] ! PERMITTED: OUTPUT OF BRACKET CO-ARRAY REFERENCE 224 S–3901–60Cray Fortran Language Extensions [10] Each image has its own set of independent I/O units. A file can be opened on one image when it is already open on another, but only the BLANK, DELIM, PAD, ERR, and IOSTAT specifiers can have values that differ from those in effect on other images. ! Caution: For a unit identified by an asterisk (*) in a READ or WRITE statement, there is a single position for all images. Only one image executes a statement for such a unit at any one time. The system introduces synchronization when necessary. Otherwise, each image positions each file independently. If the access order is important, the program must provide its own synchronization between images. 10.15 Compiling and Executing Programs Containing Co-arrays There are various commands, tools, and products available in the programming environment to use for compiling and executing programs containing co-arrays. 10.15.1 ftn and aprun Options Affecting Co-arrays The -Z compiler option on the ftn command line must be specified in order for co-array syntax to be recognized and translated. Otherwise, the co-array syntax generates ERROR messages. Upon execution of an a.out file that has been compiled and loaded with the -Z option, an image is created and executed on every processing element assigned to the job. Images 1 through NUM_IMAGES are assigned to processing elements 0 through N$PES-1, consecutively. You can set the number of processing elements assigned to a job at compile time by specifying the -X option on the ftn command. The number of processing elements can also be set at run time by executing the a.out file by using the aprun command with the -n option specified. (X1 only) Processing elements are either MSPs or SSPs. To run the images on SSPs, you must specify the -O ssp compiler option. To run on MSPs, you do not specify this option. For more information about SSP and MSP mode, see Section 3.19.21, page 55. Bounds checking is performed by specifying the -Rb option on the ftn command line. This feature is not implemented for co-dimensions of co-arrays. For more information about the ftn and aprun commands, see the ftn(1) and aprun(1) man pages. S–3901–60 225Cray® Fortran Reference Manual 10.15.2 Using the CrayTools Tool Set with Co-array Programs The CrayTools tool set, which includes TotalView, and Cray performance analyzer tool (CrayPat), does not contain special support for co-arrays and does not support the bracket notation. In most cases, however, these tools can still be used effectively to analyze programs containing co-arrays. The following sections discuss issues related to the interaction of these tools with programs containing co-arrays. 10.15.2.1 Debugging Programs Containing Co-arrays (Deferred implementation) The totalview debugger does not support the bracket notation. Co-arrays generally appear as their corresponding local object with co-dimensions stripped off. Co-array data can be viewed and referenced by switching the totalview Process window to the processing element corresponding to the desired image and accessing the co-array with local references. 10.15.2.2 Analyzing Co-array Program Performance To the CrayTools performance tools, which include CrayPat, co-arrays generally appear as their corresponding local object with co-dimensions stripped off. For more information about CrayPat, see Optimizing Applications on Cray X1 Series Systems. ! Caution: References to co-arrays on different images appear to the performance tools as local data references. This may skew the remote reference statistics of these tools. 10.15.3 Interoperating with Other Message Passing and Data Passing Models Co-arrays can interoperate with all other message and data passing models: MPI and SHMEM. This allows you to introduce co-arrays into existing application codes incrementally. These models are implemented through procedure calls, so the language interaction between co-arrays and these models is well defined. For more information about passing co-arrays to procedure calls, see Section 10.14.5, page 216. 226 S–3901–60Cray Fortran Language Extensions [10] ! Caution: MPI and SHMEM generally use processing element numbers, which start at zero, but the co-array model generally deals with image numbers, which start at one. For information about the mapping between processing elements and image numbers, see Section 10.15.1, page 225 Co-arrays are symmetric for the purposes of SHMEM programming. Pointers in co-arrays of derived type, however, may not necessarily point to symmetric data. For more information about the other message passing and data passing models, see one of the following publications: • Message Passing Toolkit Release Overview • intro_shmem(3) command and man page. 10.15.4 Optimizing Programs with Co-arrays Programs containing co-arrays benefit from all the usual steps you can take to improve run-time performance of code that runs on a single image. Loops containing references to co-arrays can and should be vectorized. On UNICOS/mp systems such loops may also be multistreamed. On UNICOS/mp systems if a co-array vector memory reference references multiple images, you may receive a "No Forward Progress" exception. In this case, you should try vectorizing along a different dimension of the co-array or running the application in accelerated mode (aprun -A). S–3901–60 227Cray® Fortran Reference Manual 228 S–3901–60Obsolete Features [11] The Cray Fortran compiler supports legacy features to allow the continued use of existing codes. In general, these features should not be used in new codes. The obsolete features are divided into two groups. The first is the set of features identified in Annex B of the Fortran standard as deleted. These were part of the Fortran language but their usage is explicitly discouraged in new codes. The second group is the set of legacy extensions supported in the Cray compiler for which preferred alternatives now exist. The obsolete features and their preferred alternatives are listed in Table 22. Table 22. Obsolete Features and Preferred Alternatives Obsolete feature Preferred alternative IMPLICIT UNDEFINED IMPLICIT NONE Type statements with *n Type statements with KIND= parameters BYTE data type INTEGER(KIND=1) DOUBLE COMPLEX statement COMPLEX statement with KIND parameter STATIC attribute and statement SAVE attribute and statement Slash data initialization Standard initialization syntax DATA statement features Standard conforming DATA statements Hollerith data Character data PAUSE statement READ statement ASSIGN, assigned GOTO statements and assigned format specifiers Standard branching constructs Two-branch IF statements IF construct or statement Real and double precision DO variables Integer DO variables Nested loop termination Separate END DO statements Branching into a block Restructure code ENCODE and DECODE statements WRITE and READ with internal file BUFFER IN and BUFFER OUT statements Asynchronous I/O statements Asterisk character constant delimiters Use standard character delimiters Negative-values X descriptor TL descriptor S–3901–60 229Cray® Fortran Reference Manual Obsolete feature Preferred alternative A descriptor used for noncharacter conventional data and R descriptor Character type and other conventional matchings of data and descriptors H edit descriptor Character constants Obsolete intrinsic procedures For list and replacements, see Section 11.21, page 250 Initialization using long strings Replace the numeric target with a character item. Replace a Hollerith constant with a character constant 11.1 IMPLICIT UNDEFINED The Cray Fortran compiler accepts the IMPLICIT UNDEFINED statement. It is equivalent to the IMPLICIT NONE statement. 11.2 Type statement with *n The Cray Fortran compiler defines the following additional forms of type_declaration_stmt: type_spec is INTEGER* length_value or REAL* length_value or DOUBLE PRECISION* length_value or COMPLEX* length_value or LOGICAL* length_value • length-value is the size of the data object in bytes. Data type declarations that include the data length are outmoded. The Cray Fortran compiler recognizes this usage in type statements, IMPLICIT statements, and FUNCTION statements, mapping these numbers onto kind values appropriate for the target machine. 11.3 BYTE Data Type The BYTE statement and data type declares a 1–byte value. This data type is equivalent to the INTEGER(KIND=1) and INTEGER*1 declarations. 230 S–3901–60Obsolete Features [11] 11.4 DOUBLE COMPLEX Statement The DOUBLE COMPLEX statement is used to declare an item to be of type double complex. The format for the DOUBLE COMPLEX statement is as follows: DOUBLE COMPLEX [ , attribute-list :: ] entity-list Items declared as DOUBLE COMPLEX contain two double precision entities. When the -dp option is in effect, double complex entities are affected as follows: • The nonstandard DOUBLE COMPLEX declaration is treated as a single-precision complex type. • Double precision intrinsic procedures are changed to the corresponding single-precision intrinsic procedures. The -ep or -dp specification is used for all source files compiled with a single invocation of the Cray Fortran compiler command. If a module is compiled separately from a program unit that uses the module, they both shall be compiled with the same -ep or -dp specification. 11.5 STATIC Attribute and Statement The STATIC attribute and statement provides the same effect as the SAVE attribute and statement. Variables with the Cray Fortran STATIC attribute retain their value and their definition, association, and allocation status after the subprogram in which they are declared completes execution. Variables without this attribute cannot be depended on to retain its value and status, although the Cray Fortran compiler treats named common blocks as if they had this attribute. This attribute should always be specified for an object or the object's common named block, if it is necessary for the object to retain its value and status. In Cray's implementation, the system retains the value of an object that is in a module whether or not the STATIC specifier is used. Objects declared in recursive subprograms can be given the attribute. Such objects are shared by all instances of the subprogram. Any object that is data initialized (in a DATA statement or a type declaration statement) has the STATIC attribute by default. S–3901–60 231Cray® Fortran Reference Manual The following is a format for a type declaration statement with the attribute: type, STATIC [, attribute-list ] :: entity-decl-list static-stmt is STATIC [ [ :: ] static-entity-list ] static-entity is data-object-name or / common-block-name / A statement without an entity list is treated as though it contained the names of all items that could be saved in the scoping unit. The Cray Fortran compiler allows you to insert multiple statements without entity lists in a scoping unit. If STATIC appears in a main program as an attribute or a statement, it has no effect. The following objects must not be saved: • A procedure • A function result • A dummy argument • A named constant • An automatic data object • An object in a common block • A namelist group A variable in a common block cannot be saved individually; the entire named common block must be saved if you want any variables in it to be saved. A named common block saved in one scoping unit of a program is saved throughout the program. If a named common block is specified in a main program, it is available to any scoping unit of the program that specifies the named common block; it does not need to be saved. The statement also confers the attribute. It is subject to the same rules and restrictions as the attribute. 232 S–3901–60Obsolete Features [11] The following example shows an entity-oriented declaration: CHARACTER(LEN = 12), SAVE :: NAME CHARACTER(LEN = 12), STATIC :: NAME The following example shows an attribute-oriented declaration: CHARACTER*12 NAME STATIC NAME !Use SAVE OR STATIC, but not both on the same name The following example shows saving objects and named common blocks: STATIC A, B, /BLOCKA/, C, /BLOCKB/ 11.6 Slash Data Initialization The Fortran type declaration statements provide a means for data initialization. For example, the following two methods are standard means for initializing integer data: • Method 1: INTEGER :: I=3 • Method 2: INTEGER I DATA I /3/ The Cray Fortran compiler supports an additional method for each data type. The following example shows the additional, nonstandard method, used to define integer data: • Method 3: INTEGER [::] I /3/ S–3901–60 233Cray® Fortran Reference Manual 11.7 DATA Statement Features The DATA statement has the following outmoded features: • A constant need not exist for each element of a whole array named in a data-stmt-object-list if the array is the last item in the list. • A Hollerith or character constant can initialize more than one element of an integer or single-precision real array if the array is specified without subscripts. Example 1: If the -s default32 compiler option is used (default), an array is declared by INTEGER A(2), the following DATA statements have the same effect: DATA A /'12345678'/ DATA A /'1234','5678'/ Example 2: If the -s default64 compiler option is specified, an array is declared by INTEGER A(2), the following DATA statements have the same effect: DATA A /'1234567890123456'/ DATA A /'12345678','90123456'/ An integer or single-precision real array can be defined in the same way in a DATA implied-DO statement. 11.8 Hollerith Data Before the character data type was added to the Fortran 77 standard, Hollerith data provided a method of supplying character data. 234 S–3901–60Obsolete Features [11] 11.8.1 Hollerith Constants A Hollerith constant is expressed in one of three forms. The first of these is specified as a nonzero integer constant followed by the letter H, L, or R and as many characters as equal the value of the integer constant. The second form of Hollerith constant specification delimits the character sequence between a pair of apostrophes followed by the letter H, L, or R. The third form is like the second, except that quotation marks replace apostrophes. For example: Character sequence: ABC 12 Form 1: 6HABC 12 Form 2: 'ABC 12'H Form 3: "ABC 12"H Two adjacent apostrophes or quotation marks appearing between delimiting apostrophes or quotation marks are interpreted and counted by the compiler as a single apostrophe or quotation mark within the sequence. Thus, the sequence DON'T USE "*" would be specified with apostrophe delimiters as 'DON''T USE "*"'H, and with quotation mark delimiters as "DON'T USE ""*"""H. Each character of a Hollerith constant is represented internally by an 8-bit code, with up to 32 such codes allowed. This limit corresponds to the size of the largest numeric type, COMPLEX(KIND = 16). The ultimate size and makeup of the Hollerith data depends on the context. If the Hollerith constant is larger than the size of the type implied by context, the constant is truncated to the appropriate size. If the Hollerith constant is smaller than the size of the type implied by context, the constant is padded with a character dependent on the Hollerith indicator. When an H Hollerith indicator is used, the truncation and padding is done on the right end of the constant. The pad character is the blank character code (20). Null codes can be produced in place of blank codes by substituting the letter L for the letter H in the Hollerith forms described above. The truncation and padding is also done on the right end of the constant, with the null character code (00) as the pad character. Using the letter R instead of the letter H as the Hollerith indicator means truncation and padding is done on the left end of the constant with the null character code (00) used as the pad character. S–3901–60 235Cray® Fortran Reference Manual All of the following Hollerith constants yield the same Hollerith constant and differ only in specifying the content and placement of the unused portion of the single 64-bit entity containing the constant: Hollerith Internal byte, beginning on bit: constant 0 8 16 24 32 40 48 56 6HABCDEF A B C D E F 20 16 20 16 'ABCDEF'H A B C D E F 20 16 20 16 "ABCDEF" H A B C D E F 20 16 20 16 6LABCDEF A B C D E F 00 00 'ABCDEF'L A B C D E F 00 00 "ABCDEF"L A B C D E F 00 00 6RABCDEF 00 00 A B C D E F 'ABCDEF'R 00 00 A B C D E F "ABCDEF"R 00 00 A B C D E F A Hollerith constant is limited to 32 characters except when specified in a CALL statement, a function argument list, or a DATA statement. An all-zero computer word follows the last word containing a Hollerith constant specified as an actual argument in an argument list. A character constant of 32 or fewer characters is treated as if it were a Hollerith constant in situations where a character constant is not allowed by the standard but a Hollerith constant is allowed by the Cray Fortran compiler. If the character constant appears in a DATA statement value list, it can be longer than 32 characters. 11.8.2 Hollerith Values A Hollerith value is a Hollerith constant or a variable that contains Hollerith data. A Hollerith value is limited to 32 characters. A Hollerith value can be used in any operation in which a numeric constant can be used. It can also appear on the right-hand side of an assignment statement in which a numeric constant can be used. It is truncated or padded to be the correct size for the type implied by the context. 236 S–3901–60Obsolete Features [11] 11.8.3 Hollerith Relational Expressions Used with a relational operator, the Hollerith value e 1 is less than e 2 if its value precedes the value of e 2 in the collating sequence and is greater if its value follows the value of e 2 in the collating sequence. The following examples are evaluated as true if the integer variable LOCK contains the Hollerith characters K, E, and Y in that order and left-justified with five trailing blank character codes: 3HKEY.EQ.LOCK 'KEY'.EQ.LOCK LOCK.EQ.LOCK 'KEY1'.GT.LOCK 'KEY0'H.GT.LOCK 11.9 PAUSE Statement Execution of a PAUSE statement requires operator or system-specific intervention to resume execution. In most cases, the same functionality can be achieved as effectively and in a more portable way with the use of an appropriate READ statement that awaits some input data. The execution of the PAUSE statement suspends the execution of a program. This is now redundant, because a WRITE statement can be used to send a message to any device, and a READ statement can be used to wait for and receive a message from the same device. The PAUSE statement is defined as follows: pause-stmt is PAUSE [ stop-code ] The character constant or list of digits identifying the PAUSE statement is called the stop-code because it follows the same rules as those for the STOP statement's stop code. The stop code is accessible following program suspension. The Cray Fortran compiler sends the stop-code to the standard error file (stderr). The following are examples of PAUSE statements: PAUSE PAUSE 'Wait #823' PAUSE 100 S–3901–60 237Cray® Fortran Reference Manual 11.10 ASSIGN, Assigned GO TO Statements, and Assigned Format Specifiers The ASSIGN statement assigns a statement label to an integer variable. During program execution, the variable can be assigned labels of branch target statements, providing a dynamic branching capability in a program. The unsatisfactory property of these statements is that the integer variable name can be used to hold both a label and an ordinary integer value, leading to errors that can be hard to discover and programs that can be difficult to read. A frequent use of the ASSIGN statement and assigned GO TO statement is to simulate internal procedures, using the ASSIGN statement to record the return point after a reusable block of code has completed. The internal procedure mechanism of Fortran now provides this capability. A second use of the ASSIGN statement is to simulate dynamic format specifications by assigning labels corresponding to different format statements to an integer variable and using this variable in I/O statements as a format specifier. This use can be accomplished in a clearer way by using character strings as format specifications. Thus, it is no longer necessary to use either the ASSIGN statement or the assigned GO TO statement. Execution of an ASSIGN statement causes the variable in the statement to become defined with a statement label value. When a numeric storage unit becomes defined, all associated numeric storage units of the same type become defined. Variables associated with the variable in an ASSIGN statement, however, become undefined as integers when the ASSIGN statement is executed. When an entity of double precision real type becomes defined, all totally associated entities of double precision real type become defined. Execution of an ASSIGN statement causes the variable in the statement to become undefined as an integer. Variables that are associated with the variable also become undefined. 11.10.1 Form of the ASSIGN and Assigned GO TO Statements Execution of an ASSIGN statement assigns a label to an integer variable. Subsequently, this value can be used by an assigned GO TO statement or by an I/O statement to reference a FORMAT statement. The ASSIGN statement is defined as follows: assign-stmt is ASSIGN label TO scalar-int-variable 238 S–3901–60Obsolete Features [11] The term default integer type in this section means that the integer variable shall occupy a full word in order to be able to hold the address of the statement label. Programs that contain an ASSIGN statement and are compiled with -s default32 shall ensure that the scalar-int-variable is declared as INTEGER(KIND=8). This ensures that it occupies a full word. The variable shall be a named variable of default integer type. It shall not be an array element, an integer component of a structure, or an object of nondefault integer type. The label shall be the label of a branch target statement or the label of a FORMAT statement in the same scoping unit as the ASSIGN statement. When defined with an integer value, the integer variable cannot be used as a label. When assigned a label, the integer variable cannot be used as anything other than a label. When the integer variable is used in an assigned GO TO statement, it shall be assigned a label. As the following example shows, the variable can be redefined during program execution with either another label or an integer value: ASSIGN 100 TO K Execution of the assigned GO TO statement causes a transfer of control to the branch target statement with the label that had previously been assigned to the integer variable. The assigned GO TO statement is defined as follows: assigned-goto-stmt is GO TO scalar-int-variable [ [ , ] (label-list) ] The variable shall be a named variable of default integer type. That is, it shall not be an array element, a component of a structure, or an object of nondefault integer type. The variable shall be assigned the label of a branch target statement in the same scoping unit as the assigned GO TO statement. S–3901–60 239Cray® Fortran Reference Manual If a label list appears, such as in the following examples, the variable shall have been assigned a label value that is in the list: GO TO K GO TO K (10, 20, 100) The ASSIGN statement also allows the label of a FORMAT statement to be dynamically assigned to an integer variable, which can later be used as a format specifier in READ, WRITE, or PRINT statements. This hinders readability, permits inconsistent usage of the integer variable, and can be an obscure source of error. This functionality is available through character variables, arrays, and constants. 11.10.2 Assigned Format Specifiers When an I/O statement containing the integer variable as a format specifier is executed, the integer variable can be defined with the label of a FORMAT specifier. 11.11 Two-branch IF Statements Outmoded IF statements are the two-branch arithmetic IF and the indirect logical IF. 11.11.1 Two-branch Arithmetic IF A two-branch arithmetic IF statement transfers control to statement s 1 if expression e is evaluated as nonzero or to statement s 2 if e is zero. The arithmetic expression should be replaced with a relational expression, and the statement should be changed to an IF statement or an IF construct. This format is as follows: IF ( e ) s 1 , s 2 e Integer, real, or double precision expression s Label of an executable statement in the same program unit Example: IF (I+J*K) 100,101 240 S–3901–60Obsolete Features [11] 11.11.2 Indirect Logical IF An indirect logical IF statement transfers control to statement s t if logical expression le is true and to statement s f if le is false. An IF construct or an IF statement should be used in place of this outmoded statement. This format is as follows: IF ( le ) s t , s f le Logical expression s t , s f Labels of executable statements in the same program unit Example: IF(X.GE.Y)148,9999 11.12 Real and Double Precision DO Variables The Cray Fortran compiler allows real variables and values as the DO variable and limits in DO statements. The preferred alternative is to use integer values and compute the desired real value. 11.13 Nested Loop Termination Older Cray Fortran compilers allowed nested DO loops to terminate on a single END DO statement if the END DO statement had a statement label. The END DO statement is included in the Fortran standard. The Fortran standard specifies that a separate END DO statement shall be used to terminate each DO loop, so allowing nested DO loops to end on a single, labeled END DO statement is an outmoded feature. 11.14 Branching into a Block Although the standard does not permit branching into the code block for a DO construct from outside of that construct, the Cray Fortran compiler permits branching into the code block for a DO or DO WHILE construct. By default, the Cray Fortran compiler issues an error for this situation. Cray does not recommend branching into a DO construct, but if you specify the ftn -eg command, the code will compile. S–3901–60 241Cray® Fortran Reference Manual 11.15 ENCODE and DECODE Statements A formatted I/O operation defines entities by transferring data between I/O list items and records of a file. The file can be on an external media or in internal storage. The Fortran standard provides READ and WRITE statements for both formatted external and internal file I/O. This is the preferred method for formatted internal file I/O. It is the only method for list-directed internal file I/O. The ENCODE and DECODE statements are an alternative to standard Fortran READ and WRITE statements for formatted internal file I/O. An internal file in standard Fortran I/O shall be declared as character, while the internal file in ENCODE and DECODE statements can be any data type. A record in an internal file in standard Fortran I/O is either a scalar character variable or an array element of a character array. The record size in an internal file in an ENCODE or DECODE statement is independent of the storage size of the variable used as the internal file. If the internal file is a character array in standard Fortran I/O, multiple records can be read or written with internal file I/O. The alternative form does not provide the multiple record capability. 11.15.1 ENCODE Statement The ENCODE statement provides a method of converting or encoding the internal representation of the entities in the output list to a character representation. The format of the ENCODE statement is as follows: ENCODE ( n, f, dest ) [ elist ] n Number of characters to be processed. Nonzero integer expression not to exceed the maximum record length for formatted records. This is the record size for the internal file. f Format identifier. It cannot be an asterisk. dest Name of internal file. It can be a variable or array of any data type. It cannot be an array section, a zero-sized array, or a zero-sized character variable. elist Output list to be converted to character during the ENCODE statement. The output list items are converted using format f to produce a sequence of n characters that are stored in the internal file dest. The n characters are packed 8 characters per word. 242 S–3901–60Obsolete Features [11] An ENCODE statement transfers one record of length n to the internal file dest. If format f attempts to write a second record, ENCODE processing repositions the current record position to the beginning of the internal file and begins writing at that position. An error is issued when the ENCODE statement attempts to write more than n characters to the record of the internal file. If dest is a noncharacter entity and n is not a multiple of 8, the last word of the record is padded with blanks to a word boundary. If dest is a character entity, the last word of the record is not padded with blanks to a word boundary. Example 1: The following example assumes a machine word length of 64 bits and uses the underscore character (_) as a blank: INTEGER ZD(5), ZE(3) ZD(1)='THIS____' ZD(2)='MUST____' ZD(3)='HAVE____' ZD(4)='FOUR____' ZD(5)='CHAR____' 1 FORMAT(5A4) ENCODE(20,1,ZE)ZD DO 10 I=1,3 PRINT 2,'ZE(',I,')="',ZE(I),'"' 10 CONTINUE 2 FORMAT(A,I2,A,A8,A) END The output is as follows: >ZE( 1)="THISMUST" >ZE( 2)="HAVEFOUR" >ZE( 3)="CHAR____" 11.15.2 DECODE Statement The DECODE statement provides a method of converting or decoding from a character representation to the internal representation of the entities in the input list. The format of the DECODE statement is as follows: DECODE ( n, f, source ) [ dlist ] n Number of characters to be processed. Nonzero integer expression not to exceed the maximum record length for formatted records. This is the record size for the internal file. S–3901–60 243Cray® Fortran Reference Manual f Format identifier. It cannot be an asterisk. source Name of internal file. It can be a variable or array of any data type. It cannot be an array section or a zero-sized array or a zero-sized character variable. dlist Input list to be converted from character during the DECODE statement. The input list items are converted using format f from a sequence of n characters in the internal file source to an internal representation and stored in the input list entities. If the internal file source is noncharacter, the internal file is assumed to be a multiple of 8 characters. Example 1: An example of a DECODE statement is as follows: INTEGER ZD(4), ZE(3) ZE(1)='WHILETHI' ZE(2)='S HAS F' ZE(3)='IVE ' 3 FORMAT(4A5) DECODE(20,3,ZE)ZD DO 10 I=1,4 PRINT 2,'ZD(',I,')="',ZD(I),'"' 10 CONTINUE 2 FORMAT(A,I2,A,A8,A) END The output is as follows: >ZD( 1)="WHILE " >ZD( 2)="THIS " >ZD( 3)="HAS " >ZD( 4)="FIVE " 11.16 BUFFER IN and BUFFER OUT Statements You can use the BUFFER IN and BUFFER OUT statements to transfer data. Data can be transferred while allowing the subsequent execution sequence to proceed concurrently. This is called asynchronous I/O. Asynchronous I/O may require the use of nondefault file formats or FFIO layers, as discussed in Chapter 15, page 295. BUFFER IN and BUFFER OUT operations may proceed concurrently on several units or files. If they do not proceed asynchronously, they will use synchronous I/O. 244 S–3901–60Obsolete Features [11] BUFFER IN is for reading, and BUFFER OUT is for writing. A BUFFER IN or BUFFER OUT operation includes only data from a single array or a single common block. Either statement initiates a data transfer between a specified file or unit (at the current record) and memory. If the unit or file is completing an operation initiated by any earlier BUFFER IN or BUFFER OUT statement, the current BUFFER IN or BUFFER OUT statement suspends the execution sequence until the earlier operation is complete. When the unit's preceding operation terminates, execution of the BUFFER IN or BUFFER OUT statement completes as if no delay had occurred. You can use the UNIT(3i) or LENGTH(3i) intrinsic procedures to delay the execution sequence until the BUFFER IN or BUFFER OUT operation is complete. These functions can also return information about the I/O operation at its termination. The general format of the BUFFER IN and BUFFER OUT statements follows: buffer_in_stmt is BUFFER IN (id, mode) (start_loc, end_loc) buffer_out_stmt is BUFFER OUT (id, mode) (start_loc, end_loc) io_unit is external_file_unit or file_name_expr mode is scalar_integer_expr start_loc is variable end_loc is variable In the preceding definition, the variable specified for start_loc and end_loc cannot be of a derived type if you are performing implicit data conversion. The data items between start_loc and end_loc must be of the same type. The BUFFER IN and BUFFER OUT statements are defined as follows. BUFFER IN (io_unit, mode) (start_loc, end_loc) BUFFER OUT (io_unit, mode) (start_loc, end_loc) io_unit An identifier that specifies a unit. The I/O unit is a scalar integer expression with a nonnegative value, an asterisk (*), or a character literal constant (external name). The I/O unit forms indicate that the unit is a formatted sequential access external unit. S–3901–60 245Cray® Fortran Reference Manual mode Mode identifier. This integer expression controls the record position following the data transfer. The mode identifier is ignored on files that do not contain records; only full record processing is available. start_loc, end_loc Symbolic names of the variables, arrays, or array elements that mark the beginning and ending locations of the BUFFER IN or BUFFER OUT operation. These names must be either elements of a single array (or equivalenced to an array) or members of the same common block. If start_loc or end_loc is of type character, then both must be of type character. If start_loc and end_loc are noncharacter, then the item length of each must be equal. For example, if the internal length of the data type of start_loc is 64 bits, the internal length of the data type of end_loc must be 64 bits. To ensure that the size of start_loc and end_loc are the same, use the same data type for both. The mode identifier, mode, controls the position of the record at unit io_unit after the data transfer is complete. The values of mode have the following effects: • Specifying mode = 0 causes full record processing. File and record positioning works as with conventional I/O. The record position following such a transfer is always between the current record (the record with which the transfer occurred) and the next record. Specifying BUFFER OUT with mode = 0 ends a series of partial-record transfers. • Specifying mode < 0 causes partial record processing. In BUFFER IN, the record is positioned to transfer its (n +1)th word if the nth word was the last transferred. In BUFFER OUT, the record is left positioned to receive additional words. The amount of data to be transferred is specified in words without regard to types or formats. However, the data type of end_loc affects the exact ending location of a transfer. If end_loc is of a multiple-word data type, the location of the last word in its multiple-word form of representation marks the ending location of the data transfer. BUFFER OUT with start_loc = end_loc + 1 and mode = 0 causes a zero-word transfer and concludes the record being created. Except for terminating a partial record, start_loc following end_loc in a storage sequence causes a run-time error. 246 S–3901–60Obsolete Features [11] Example: PROGRAM XFR DIMENSION A(1000), B(2,10,100), C(500) ... BUFFER IN(32,0) (A(1),A(1000)) ... DO 9 J=1,100 B(1,1,J) = B(1,1,J) + B(2,1,J) 9 CONTINUE BUFFER IN(32,0) (C(1),C(500)) BUFFER OUT(22,0) (A(1),A(1000)) ... END The first BUFFER IN statement in this example initiates a transfer of 1000 words from unit 32. If asynchronous I/O is available, processing unrelated to that transfer proceeds. When this is complete, a second BUFFER IN is encountered, which causes a delay in the execution sequence until the last of the 1000 words is received. A transfer of another 500 words is initiated from unit 32 as the execution sequence continues. BUFFER OUT begins a transfer of the first 1000 words to unit 22. In all cases mode = 0, indicating full record processing. 11.17 Asterisk Delimiters The asterisk was allowed to delimit a literal character constant. It has been replaced by the apostrophe and quotation mark. *h 1 h 2 ... h n * * Delimiter for a literal character string h Any ASCII character indicated by a C that is capable of internal representation Example: *AN ASTERISK EDIT DESCRIPTOR* S–3901–60 247Cray® Fortran Reference Manual 11.18 Negative-valued X Descriptor A negative value could be used with the X descriptor to indicate a move to the left. This has been replaced by the TL descriptor. [-b]X b Any nonzero, unsigned integer constant X Indicates a move of as many positions as indicated by b Example: -55X ! Moves current position 55 spaces left 11.19 A and R Descriptors for Noncharacter Types The Rw descriptor and the use of the Aw descriptor for noncharacter data are available primarily for programs that were written before a true character type was available. Other uses include adding labels to binary files and the transfer of data whose type is not known in advance. List items can be of type real, integer, complex, or logical. For character use, the binary form of the data is converted to or from ASCII codes. The numeric list item is assumed to contain ASCII characters when used with these edit descriptors. Complex items use two storage units and require two A descriptors, for the first and second storage units respectively. The Aw descriptor works with noncharacter list items containing character data in essentially the same way as described in the Fortran standard. The Rw descriptor works like Aw with the following exceptions: • Characters in an incompletely filled input list item are right-justified with the remainder of that list item containing binary zeros. • Partial output of an output list item is from its rightmost character positions. 248 S–3901–60Obsolete Features [11] The following example shows the Aw and Rw edit descriptors for noncharacter data types: INTEGER IA LOGICAL LA REAL RA DOUBLE PRECISION DA COMPLEX CA CHARACTER*52 CHC CHC='ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' READ(CHC,3) IA, LA, RA, DA, CA 3 FORMAT(A4,A8,A10,A17,A7,A6) PRINT 4, IA, LA, RA, DA, CA 4 FORMAT(1x,3(A8,'-'),A16,'-',2A8) READ(CHC,5) IA, LA, RA 5 FORMAT(R2,R8,R9) PRINT 4, IA, LA, RA END The output of this program would be as follows: > ABCD -EFGHIJKL-OPQRSTUV-XYZabcdefghijklm-nopqrst uvwxyz > ooooooAB-CDEFGHIJ-LMNOPQRSThe arrow (>) indicates leading blanks in the use of the A edit descriptor. The lowercase letter o is used to indicate where binary zeros have been written with the R edit descriptor. The binary zeros are not printable characters, so the printed output simply contains the characters without the binary zeros. 11.20 H Edit Descriptor This edit descriptor can be a source of error because the number of characters following the descriptor can be miscounted easily. The same functionality is available using the character constant edit descriptor, for which no count is required. The following information pertains to the H edit descriptor: S–3901–60 249Cray® Fortran Reference Manual Table 23. Summary of String Edit Descriptors Descriptor Description H Transfer of text character to output record 'text' Transfer of a character literal constant to output record "text" Transfer of a character literal constant to output record 11.21 Obsolete Intrinsic Procedures The Cray Fortran compiler supports many intrinsic procedures that have been used in legacy codes, but that are now obsolete. The following table indicates the obsolete procedures and the preferred alternatives. For more information about a particular procedure, see its man page. Table 24. Obsolete Procedures and Alternatives Obsolete Intrinsic Procedure Replacement AND IAND BITEST BTEST BJTEST BTEST BKTEST BTEST CDABS ABS CDCOS COS CDEXP EXP CDLOG LOG CDSIN SIN CDSQRT SQRT CLOC LOC or C_LOC COMPL NOT COTAN COT CQABS ABS CQDEXP EXP CQSIN SIN 250 S–3901–60Obsolete Features [11] Obsolete Intrinsic Procedure Replacement CQSQRT SQRT CSMG MERGE CVMGM MERGE CVMGN MERGE CVMGP MERGE CVMGZ MERGE CVMGT MERGE DACOSD ACOSD DASIND ASIND DATAN2D ATAN2D DATAND ATAND DCMPLX CMPLX DCONJG CONJG DCOSD COSD DCOT COT DCOTAN COT DFLOAT REAL DFLOATI REAL DFLOATJ REAL DFLOATK REAL DIMAG AIMAG DREAL REAL DSIND SIND DTAND TAND EQV NOT, IEOR FCD (none) FLOATI REAL FLOATJ REAL FLOATK REAL S–3901–60 251Cray® Fortran Reference Manual Obsolete Intrinsic Procedure Replacement FP_CLASS IEEE_CLASS IDATE DATE_AND_TIME IEEE_REAL REAL IIABS ABS IIAND IAND IIBCHNG IBCHNG IIBCLR IBCLR IIBITS IBITS IIBSET IBSET IIEOR IEOR IIDIM DIM IIDINT INT IIFIX INT IINT INT IIOR IOR IIQINT INT IISHA SHIFTA IISHC ISHFT IISHFT ISHFTC IISHFTC ISHFTC IISHL ISHFT IISIGN SIGN IMAG AIMAG IMOD MOD ININT NINT INT2 INT INT4 INT INT8 INT INOT NOT 252 S–3901–60Obsolete Features [11] Obsolete Intrinsic Procedure Replacement IQNINT NINT IRTC SYSTEM_CLOCK ISHA SHIFTA ISHC ISHFTC ISHL IEEE_IS_NAN JDATE DATE_AND_TIME JFIX INT JIABS ABS JIAND IAND JIBCHNG IBCHNG JIBCLR IBCLR JIBITS IBITS JIBSET IBSET JIEOR IEOR JIDIM DIM JIDINT INT JIFIX INT JINT INT JIOR IOR JIQINT INT JISHA SHIFTA JISHC ISHFTC JISHFT ISHFT JISHFTC ISHFTC JISHL ISHFT JISIGN SIGN JMOD MOD JNINT NINT JNOT NOT S–3901–60 253Cray® Fortran Reference Manual Obsolete Intrinsic Procedure Replacement KIABS ABS KIAND IAND KIBCHNG IBCHNG KIBCLR IBCLR KIBITS IBITS KIBSET IBSET KIEOR IEOR KIDIM DIM KIDINT INT KINT INT KIOR IOR KIQINT INT KISHA SHIFTA KISHC ISHFTC KISHFT ISHFT KISHFTC ISHFTC KISHL ISHFT KISIGN SIGN KMOD MOD KNINT NINT KNOT NOT LENGTH (none) LONG INT LSHIFT ISHFT or SHIFTL MY_PE THIS_IMAGE MEMORY_BARRIER SYNC_MEMORY NEQV IEOR OR IOR QABS ABS 254 S–3901–60Obsolete Features [11] Obsolete Intrinsic Procedure Replacement QACOS ACOS QACOSD ACOSD QASIN ASIN QASIND ASIND QATAN ATAN QATAN2 ATAN2 DATAN2D ATAN2D QATAND ATAND QCMPLX CMPLX QCONJG CONJG QCOS COS QCOSD COSD QCOSH COSH QCOT COT QCOTAN COT QDIM DIM QEXP EXP QEXT REAL QFLOAT REAL QFLOATI REAL QFLOATJ REAL QFLOATJ REAL QFLOATK REAL QIMAG AIMAG QINT AINT QLOG LOG QLOG10 LOG10 QMAX1 MAX QMIN1 MIN S–3901–60 255Cray® Fortran Reference Manual Obsolete Intrinsic Procedure Replacement QMOD MOD QNINT ANINT QREAL REAL QSIGN SIGN QSIN SIN QSIND SIND QSINH SINH QSQRT SQRT QTAN TAN QTAND TAND QTANH TANH RAN RANDOM_NUMBER RANF RANDOM_NUMBER RANGET RANDOM_SEED RANSET RANDOM_SEED REMOTE_WRITE_BARRIER SYNC_MEMORY RSHIFT ISHFT or SHIFTR RTC SYSTEM_CLOCK SECNDS CPU_TIME SHIFT ISHFTC SHORT INT SNGLQ REAL TIME DATE_AND_TIME UNIT WAIT statement WRITE_MEMORY_BARRIER SYNC_MEMORY XOR IEOR 256 S–3901–60Cray Fortran Deferred Implementation and Optional Features [12] The PE 6.0 release of the Cray Fortran compiler supports most of the features specified by the Fortran standard. One supported feature must be turned on with an option. This chapter identifies the Fortran 2003 features that are not fully supported. It is expected that these remaining features will be implemented in future releases of the Cray Fortran compiler. 12.1 ISO_10646 Character Set The Fortran 2003 features related to supporting the ISO_10646 character set are not supported. This includes declarations, constants, and operations on variables of character(kind=4) and I/O operations. 12.2 Finalizers Type bound FINAL routines are not supported for polymorphic objects, and code is not generated to invoke final routines of polymorphic objects. 12.3 Restrictions on Unlimited Polymorphic Variables If the -e h option is specified to cause packed storage for short integers and logicals, unlimited polymorphic variables whose dynamic types are integer(1), integer(2), logical(1), or logical(2) are not supported. 12.4 Enhanced Expressions in Initializations and Specifications The Fortran 2003 standard greatly expands the list of Fortran intrinsic functions that may be referenced in initialization and specification expressions, used mainly to create constants in declarations. Support for using some of these intrinsics, including the trigonometric intrinsic functions, is included in the PE 6.0 release, but the full list is not yet implemented. S–3901–60 257Cray® Fortran Reference Manual 12.5 User-defined, Derived Type I/O User-defined, derived type I/O routines are not supported. 12.6 ENCODING= in I/O Statements The ENCODING= specifier in I/O statements is accepted by the compiler but has no effect in the PE 6.0 release. 12.7 Allocatable Assignment (Optionally Enabled) The Fortran 2003 standard allows an allocatable variable in an intrinsic assignment statement (variable = expression) to have a shape different from the expression. If the shapes are different, the variable is automatically deallocated and reallocated with the shape of the expression. This feature is available in the PE 6.0 Cray Fortran compiler but is not enabled by default because of potential adverse effects on performance. The new behavior is enabled by the -e w command line option. 258 S–3901–60Cray Fortran Implementation Specifics [13] The Fortran standard specifies the rules for writing a standard conforming Fortran program. Many of the details of how such a program is compiled and executed are intentionally not specified or are explicitly specified as being processor-dependent. This chapter describes the implementation used by the Cray Fortran compiler. Included are descriptions of the internal representations used for data objects and the values of processor-dependent language parameters. 13.1 Companion Processor For the purpose of C interoperability, the Fortran standard refers to a "companion processor." The companion processor for the Cray Fortran compiler is the Cray C compiler. 13.2 INCLUDE Line There is no limit to the nesting level for INCULDE lines. The character literal constant in an INCLUDE line is interpreted as the name of the file to be included. This case-sensitive name may be prefixed with additional characters based on the -I compiler command line option. 13.3 INTEGER Kinds and Values INTEGER kind type parameters of 1, 2, 4, and 8 are supported. The default kind type parameter is 4 unless the -s default64 or -s integer64 command line option is specified, in which case the default kind type parameter is 8. The interpretation of kinds 1 and 2 depend on whether the -e h command line option is specified. Integer values are represented as two's complement binary values. S–3901–60 259Cray® Fortran Reference Manual 13.4 REAL Kinds and Values REAL kind type parameters of 4, 8, and 16 are supported. The default kind type parameter is 4 unless the -s default64 or -s real64 command lines option is specified, in which case, the default kind type parameter is 8. Real values are represented in the format specified by the IEEE 754 standard, with kinds 4, 8, and 16 corresponding to the 32, 64, and 128 bit IEEE representations. 13.5 DOUBLE PRECISION Kinds and Values The DOUBLE PRECISION type is an alternate specification of a REAL type. The kind type parameter of that REAL type is twice the value of the kind type parameter for default REAL unless the -dp command line option is specified, in which case, the kind type parameter for DOUBLE PRECISION and default REAL are the same, and REAL constants with a D exponent are treated as if the D were an E. Note that if the -dp option is specified, the compiler is not standard conforming. 13.6 LOGICAL Kinds and Values LOGICAL kind type parameters of 1, 2, 4, and 8 are supported. The default kind type parameter is 4 unless the -s default64 or -s integer64 command line option is specified, in which case, the default kind type parameter is 8. The interpretation of kinds 1 and 2 depend on whether the -e h command line option is specified. Logical values are represented by a bit sequence in which the low order bit is set to 1 for the value .true. and to 0 for .false., and the other bits in the representation are set to 0. 13.7 CHARACTER Kinds and Values The CHARACTER kind type parameter of 1 is supported. The default kind type parameter is 1. Character values are represented using the 8-bit ASCII character encoding. 13.8 Cray Pointers Cray pointers are 64-bit objects. 260 S–3901–60Cray Fortran Implementation Specifics [13] 13.9 ENUM Kind An enumerator that specifies the BIND(C) attribute creates values with a kind type parameter of 4. 13.10 Storage Issues This section describes how the Cray Fortran compiler uses storage, including how this compiler accommodates programs that use overindexing of blank common. 13.10.1 Storage Units and Sequences The size of the numeric storage units is 32 bits, unless the -s default64 option is specified, in which case the numeric storage unit is 64 bits. If the -s real64 or -s integer64option is specified alone, or the -dp is specified in addition to -s default64 or -s real64, the relative sizes of the storage assigned for default intrinsic types do not conform to the standard. In this case, storage sequence associations involving variables declared with default intrinsic noncharacter types may be invalid and should be avoided. S–3901–60 261Cray® Fortran Reference Manual 13.10.2 Static and Stack Storage The Cray Fortran compiler allocates variables to storage according to the following criteria: • Variables in common blocks are always allocated in the order in which they appear in COMMON statements. • Data in modules are statically allocated. • User variables that are defined or referenced in a program unit, and that also appear in SAVE or DATA statements, are allocated to static storage, but not necessarily in the order shown in your source program. • Other referenced user variables are assigned to the stack. If -ev is specified on the Cray Fortran compiler command line, referenced variables are allocated to static storage. This allocation does not necessarily depend on the order in which the variables appear in your source program. • Compiler-generated variables are assigned to a register or to memory (to the stack or heap), depending on how the variable is used. Compiler-generated variables include DO-loop trip counts, dummy argument addresses, temporaries used in expression evaluation, argument lists, and variables storing adjustable dimension bounds at entries. • Automatic objects may be allocated to either the stack or to the heap, depending on how much stack space is available when the objects are allocated. • Heap or stack allocation can be used for TASK COMMON variables and some compiler-generated temporary data such as automatic arrays and array temporaries. • Unsaved variables may be assigned to a register by optimization and not allocated storage. • Unreferenced user variables not appearing in COMMON statements are not allocated storage. 262 S–3901–60Cray Fortran Implementation Specifics [13] 13.10.3 Dynamic Memory Allocation Many FORTRAN 77 programs contain a memory allocation scheme that expands an array in a common block located in central memory at the end of the program. This practice of expanding a blank common block or expanding a dynamic common block (sometimes referred to as overindexing) causes conflicts between user management of memory and the dynamic memory requirements of UNICOS/mp and UNICOS/lc libraries. It is recommended that you modify programs rather than expand blank common blocks, particularly when migrating from other environments. Figure 3 shows the structure of a program under the UNICOS/mp and UNICOS/lc operating systems in relation to expanding a blank common block. In both figures, the user area includes code, data, and common blocks. Heap User area Without an expandable common block: Heap User area With an expandable common block: Dynamic area Address 0 Figure 3. Memory Use 13.11 Finalization A finalizable object in a module is not finalized in the event that there is no longer any active procedure referencing the module. S–3901–60 263Cray® Fortran Reference Manual A finalizable object that is allocated via pointer allocation is not finalized in the event that it later becomes unreachable due to all pointers to that object having their pointer association status changed. 13.12 ALLOCATE Error Status If an error occurs during the execution of an ALLOCATE statement with a stat= specifier, subsequent items in the allocation list are not allocated. 13.13 DEALLOCATE Error Status If an error occurs during the execution of an DEALLOCATE statement with a stat= specifier, subsequent items in the deallocation list are not deallocated. 13.14 ALLOCATABLE Module Variable Status An unsaved allocatable module variable remains allocated if it is allocated when the execution of an END or RETURN statement results in no active program unit having access to the module. 13.15 Kind of a Logical Expression For an expression such as x1 op x2 where op is a logical intrinsic binary operator and the operands are of type logical with different kind type parameters, the kind type parameter of the result is the larger kind type parameter of the operands. 13.16 STOP Code Availability If a STOP code is specified in a STOP statement, its value is output to the stderr file when the STOP statement is executed. 13.17 Stream File Record Structure and Position A formatted file written with stream access may be later read as a record file. In that case, embedded newline characters (char(10)) indicate the end of a record and the terminating newline character is not considered part of the record. 264 S–3901–60Cray Fortran Implementation Specifics [13] The file storage unit for a formatted stream file is a byte. The position is the ordinal byte number in the file; the first byte is position 1. Positions corresponding to newline characters (char(10)) that were inserted by the I/O library as part of record output do not correspond to positions of user-written data. 13.18 File Unit Numbers The values of INPUT_UNIT, OUTPUT_UNIT, and ERROR_UNIT defined in the ISO_Fortran_env module are 100, 101, and 102, respectively. These three unit numbers are reserved and may not be used for other purposes. The files connected to these units are the same files used by the companion C processor for standard input (stdin), output (stdout), and error (stderr). An asterisk (*) specified as the unit for a READ statement specifies unit 100. An asterisk specified as the unit for a WRITE statement, and the unit for PRINT statements is unit 101. All positive default integer values are available for use as unit numbers. 13.19 OPEN Specifiers If the ACTION= specifier is omitted from an OPEN statement, the default value is determined by the protections associated with the file. If both reading and writing are permitted, the default value is READWRITE. If the ENCODING= specifier is omitted or specified as DEFAULT in an OPEN statement for a formatted file, the encoding used is ASCII. The case of the name specified in a FILE= specifier in an OPEN statement is significant. If the FILE= specifier is omitted, fort. is prepended to the unit number. If the RECL= specifier is omitted from an OPEN statement for a sequential access file, the default value for the maximum record length is 1024. If the file is connected for unformatted I/O, the length is measured in 8-bit bytes. The FORM= specifier may also be SYSTEM for unformatted files. If the ROUND= specifier is omitted from an OPEN statement, the default value is NEAREST. Specifying a value of PROCESSOR_DEFINED is equivalent to specifying NEAREST. S–3901–60 265Cray® Fortran Reference Manual If the STATUS= specifier is omitted or specified as UNKNOWN in an OPEN statement, the specification is equivalent to OLD if the file exists, otherwise, it is equivalent to NEW. 13.20 FLUSH Statement Execution of a FLUSH statement causes memory resident buffers to be flushed to the physical file. Output to the unit specified by ERROR_UNIT in the ISO_Fortran_env module is never buffered; execution of FLUSH on that unit has no effect. 13.21 Asynchronous I/O The ASYNCHRONOUS= specifier may be set to YES to allow asynchronous I/O for a unit or file. Asynchronous I/O is used if the FFIO layer attached to the file provides asynchronous access. 13.22 REAL I/O of an IEEE NaN An IEEE NaN may be used as an I/O value for the F, E, D, or G edit descriptor or for list-directed or namelist I/O. 13.22.1 Input of an IEEE NaN The form of NaN is an optional sign followed by the string 'NAN' optionally followed by a hexadecimal digit string enclosed in parentheses. The input is case insensitive. Some examples are: NaN - quiet NaN nAN() - quiet NaN -nan(ffffffff) - quiet NaN NAn(7f800001) - signalling NaN NaN(ffc00001) - quiet NaN NaN(ff800001) - signalling NaN The internal value for the NaN will become a quiet NaN if the hexadecimal string is not present or is not a valid NaN. 266 S–3901–60Cray Fortran Implementation Specifics [13] A '+' or '-' preceding the NaN on input will be used as the high order bit of the corresponding READ input list item. An explicit sign overrides the sign bit from the hexadecimal string. The internal value becomes the hexadecimal string if it represents an IEEE NaN in the internal data type. Otherwise, the form of the internal value is undefined. 13.22.2 Output of an IEEE NaN The form of an IEEE NaN for the F, E, D, or G edit descriptor or for list-directed or namelist output is: 1. If the field width w is absent, zero, or greater than (5 + 1/4 of the size of the internal value in bits), the output consists of the string 'NaN' followed by the hexadecimal representation of the internal value within a set of parentheses. An example of the output field is: NaN(7fc00000) 2. If the field width w is at least 3 but less than (5 + 1/4 of the size of the internal value in bits), the string 'NaN' will be right-justified in the field with blank fill on the left. 3. If the field width w is 1 or 2, the field is filled with asterisks. The output field has no '+' or '-'; the sign is contained in the hexadecimal string. To get the same internal value for a NaN, write it with a list-directed write statement and read it with a list-directed read statement. To write and then read the same NaN, the field width w in D, E, F, or G must be at least the number of hexadecimal digits of the internal datum plus 5. REAL(4): w >= 13 REAL(8): w >= 21 REAL(16): w >= 37 13.23 List-directed and NAMELIST Output Default Formats The length of the output value in NAMELIST and list-directed output depends on the value being written. Blanks and unnecessary trailing zeroes are removed unless the -w option to the assign command is specified, which turns off this compression. S–3901–60 267Cray® Fortran Reference Manual By default, full-precision printing is assumed unless a precision is specified by the LISTIO_PRECISION environment variable (for more information about the LISTIO_PRECISION environment variable, see Section 4.1.5, page 83). 13.24 Random Number Generator A linear congruential generator is used to produce the output of the RANDOM_NUMBER intrinsic subroutine. The seed array contains two 32-bit integer values. 13.25 Timing Intrinsics A call to the SYSTEM_CLOCK intrinsic subroutine with the COUNT argument present translates into the inline instructions that directly access the hardware clock register. See the description of the -e s and -d s command line options for information about the values returned for the count and count rate. For fine-grained timing, Cray recommends using a kind = 8 count variable. The CPU_TIME subroutine obtains the value of its argument from the getrusage system call. Its execution time is significantly longer than for the SYSTEM_CLOCK routine, but the values returned are closer to those used by system accounting utilities. 13.26 IEEE Intrinsic Modules The IEEE intrinsics modules IEEE_EXCEPTIONS, IEEE_ARITHMETIC, and IEEE_FEATURES are supplied. Denormal numbers are not supported on Cray X1 or X2 hardware. The IEEE_SUPPORT_DENORMAL inquiry function returns .false. for all kinds of arguments. At the start of program execution, the IEEE halting modes are set such that overflow, divide_by_zero, and invalid exceptions cause a trap, while traps are disabled for underflow and inexact. 268 S–3901–60Part III: Cray Fortran Application Programmer's I/O Reference Part III describes advanced Fortran input/output (I/O) techniques for use on Cray X1 series systems. It includes the following chapters: • Using the Assign Environment (Chapter 14, page 271) • Using FFIO (Chapter 15, page 295) • FFIO Layer Reference (Chapter 16, page 311) • Creating a user Layer (Chapter 17, page 337) • Numeric File Conversion Routines (Chapter 18, page 363) • Named Pipe Support (Chapter 19, page 377) The reader should be familiar with the information presented in the following Cray man pages: • The assign(1), assign(3f), and ffassign(3f) man pages • The intro_ffio(1) man page, which describes the FFIO system and performance options available with the FFIO layers For additional information about I/O, see Optimizing Applications on Cray X1 Series Systems.Using the Assign Environment [14] Fortran programs require the ability to alter many details of a Fortran file connection. You may need to specify device residency, an alternative file name, a file space allocation scheme, file structure, or data conversion properties of a connected file. These details comprise the assign environment. In addition, Cray X1 series and X2 systems support flexible file I/O (FFIO), which uses layered I/O to implement sophisticated I/O strategies. When used in the context of the assign environment, FFIO enables you to implement different I/O techniques and realize significant improvements in I/O performance without modifying source code. This chapter describes the assign(1) command and the assign(3f) library routine, which together define the assign environment. The FFIO system is described in Chapter 15, page 295. The ffassign(3c) command provides an interface to assign processing from C/C++. See the ffassign(3c) man page for details about its use. S–3901–60 271Cray® Fortran Reference Manual 14.1 assign Basics The assign command information is stored in the assign environment file, .assign, or in a shell environment variable. To begin using the assign environment to control a program's I/O behavior, follow these steps. 1. Set the FILENV environment variable to the desired path: set FILENV environment-file 2. Run the assign command to define the current assign environment: assign arguments assign-object For example: assign -F cachea g:su 3. Run your program: ./a.out arguments 4. If you are not satisfied with the I/O performance observed during program execution, return to step 2, use the assign command to adjust the assign environment, and try again. The assign(1) command passes information to Fortran open statements and to the ffopen(3c) routine to identify the following elements: • A list of unit numbers • File names • File name patterns that have attributes associated with them The assign object is the file name, file name pattern, unit number, or type of I/O open request to which the assign environment applies. When the unit or file is opened from Fortran, the environment defined by the assign command is used to establish the properties of the connection. 14.1.1 Assign Objects and Open Processing The I/O library routines apply options to a file connection for all related assign objects. If the assign object is a unit, the application of options to the unit occurs whenever that unit becomes connected. 272 S–3901–60Using the Assign Environment [14] If the assign object is a file name or pattern, the application of options to the file connection occurs whenever a matching file name is opened from a Fortran program. When any of the library I/O routines opens a file, it uses the specified assign environment options for any assign objects that apply to the open request. Any of the following assign objects or categories might apply to a given open request: • g:all options apply to any open request. • g:su, g:du, g:sf, g:df, and g:ff all apply to types of open requests. These equate to sequential unformatted, direct unformatted, sequential formatted, direct formatted, or ffopen, respectively. • u:unit-number applies whenever unit-number is opened. • p:pattern applies whenever a file whose name matches pattern is opened. The assign environment can contain only one p:assign-object that matches the current open file. The exception is that the p:%pattern (which uses the % wildcard character) is silently ignored if a more specific pattern also matches the current file name being opened. • f:filename applies whenever a file with the name filename is opened. Options from the assign objects in these categories are collected to create the complete set of options used for any particular open. The options are collected in the listed order, with options collected later in the list of assign objects overriding those collected earlier. 14.1.2 The assign Command Here is the syntax for the assign command: assign [-I] [-O] [-a actualfile] [-b bs] [-f fortstd] [-m setting] [-s ft] [-t] [-u bufcnt] [-y setting] [-B setting] [-C charcon] [-D fildes] [-F spec[,specs]] [-N numcon] [-R] [-S setting] [-T setting] [-U setting] [-V] [-W setting] [-Y setting] [-Z setting] assign-object The following specifications cannot be used with any other options: assign -R [assign-object] assign -V [assign-object] A summary of the assign command options follows. For details, see the assign(1) and intro_ffio(3f) man pages. S–3901–60 273Cray® Fortran Reference Manual Here are the assign command control options: -I Specifies an incremental use of assign. All attributes are added to the attributes already assigned to the current assign-object. This option and the -O option are mutually exclusive. -O Specifies a replacement use of assign. This is the default control option. All currently existing assign attributes for the current assign-object are replaced. This option and the -I option are mutually exclusive. -R Removes all assign attributes for assign-object. If assign-object is not specified, all currently assigned attributes for all assign-objects are removed. -V Views attributes for assign-object. If assign-object is not specified, all currently assigned attributes for all assign-objects are printed. Here are the assign command attribute options: -a actualfile The file= specifier or the actual file name. -b bs Library buffer size in 4096-byte (512-word) blocks. -f fortstd Specifies compatibility with a Fortran standard, where fortstd is either 2003 for the current Cray Fortran or 95 for Cray Fortran 95. If the value 95 is set, the list-directed and namelist output of a floating point will remain 0.E+0. -m setting Special handling of a direct access file that will be accessed concurrently by several processes or tasks. Special handling includes skipping the check that only one Fortran unit be connected to a unit, suppressing file truncation to true size by the I/O buffering routines, and ensuring that the file is not truncated by the I/O buffering routines. Enter either on or off for setting. -s ft File type. Enter text, cos, blocked, unblocked, u, sbin, or bin for ft. The default is text. -t Temporary file. -u bufcnt Buffer count. Specifies the number of buffers to be allocated for a file. -y setting Suppresses repeat counts in list-directed output. setting can be either on or off. The default setting is off. 274 S–3901–60Using the Assign Environment [14] -B setting Activates or suppresses the passing of the O_DIRECT flag to the open(2) system call. Enter either on or off for setting. This is an important feature for I/O optimization; if this is on, it enables reads and writes directly to and from the user program buffer. -C charcon Character set conversion information. Enter ascii, or ebcdic for charcon. If you specify the -C option, you must also specify the -F option. -D fildes Specifies a connection to a standard file. Enter stdin, stdout, or stderr for fildes. -F spec [,specs] Flexible file I/O (FFIO) specification. See the assign(1) man page for details about allowed values for spec and for details about hardware platform support. See the intro_ffio(3f) man page for details about specifying the FFIO layers. -N numcon Foreign numeric conversion specification. See the assign(1) man page for details about allowed values for numcon and for details about hardware platform support. -S setting Suppresses use of a comma as a separator in list-directed output. Enter either on or off for setting. The default setting is off. -T setting Activates or suppresses truncation after write for sequential Fortran files. Enter either on or off for setting. -U setting Produces a non-UNICOS form of list-directed output. This is a global setting that sets the value for the -y, -S, and -W options. Enter either on or off for setting. The default setting is off. -W setting Suppresses compressed width in list-directed output. Enter either on or off for setting. The default setting is off. -Y setting Skips unmatched namelist groups in a namelist input record. Enter either on or off for setting. The default setting is off. -Z setting Recognizes –0.0 for IEEE floating-point systems and writes the minus sign for edit-directed, list-directed, and namelist output. Enter either on or off for setting. The default setting is on. S–3901–60 275Cray® Fortran Reference Manual assign-object Specify either a file name or a unit number for assign-object. The assign command associates the attributes with the file or unit specified. These attributes are used during the processing of Fortran open statements or during implicit file opens. Use one of the following formats for assign-object: • f:file-name (for example, f:file1) • g:io-type; io-type can be su, sf, du, df, or ff (for example, g:ff for ffopen(3C) • p:pattern (for example, p:file%) • u:unit-number (for example, u:9) • file-name (for example, myfile) When the p: pattern form is used, the % and _ wildcard characters can be used. The % matches any string of 0 or more characters. The _ matches any single character. The % performs like the * when doing file name matching in shells. However, the % character also matches strings of characters containing the / character. 14.1.3 Assign Library Routines The assign(3f), asnunit(3f), asnfile(3f), and asnrm(3f) routines can be called from a Fortran program to access and update the assign environment. The assign routine provides an easy interface to assign processing from a Fortran program. The asnunit and asnfile routines assign attributes to units and files, respectively. The asnrm routine removes all entries currently in the assign environment. The calling sequences for the assign library routines are as follows: call assign (cmd, ier) call asnunit (iunit,astring,ier) call asnfile (fname,astring,ier) call asnrm (ier) 276 S–3901–60Using the Assign Environment [14] cmd Fortran character variable that contains a complete assign command in the format that is also acceptable to the pxfsystem routine. ier Integer variable that is assigned the exit status on return from the library interface routine. iunit Integer variable or constant that contains the unit number to which attributes are assigned. astring Fortran character variable that contains any attribute options and option values from the assign command. Control options -I, -O, and -R can also be passed. fname Character variable or constant that contains the file name to which attributes are assigned. A status of 0 indicates normal return and a status of greater than 0 indicates a specific error status. Use the explain command to determine the meaning of the error status. For more information about the explain command, see the explain(1) man page. The following calls are equivalent to the assign -s u f:file command: call assign('assign -s u f:file',ier) call asnfile('file','-s u',ier) The following call is equivalent to executing the assign -I -n 2 u:99 command: iun = 99 call asnunit(iun,'-i -n 2',ier) The following call is equivalent to executing the assign -R command: call asnrm(ier) 14.2 assign and Fortran I/O Assign processing lets you tune file connections. This sections describes several areas of assign command usage and provide examples of each use. S–3901–60 277Cray® Fortran Reference Manual 14.2.1 Alternative File Names The -a option specifies the actual file name to which a connection is made. This option allows files to be created in different directories without changing the FILE= specifier on an OPEN statement. For example, consider the following assign command issued to open unit 1: assign -a /tmp/mydir/tmpfile u:1 The program then opens unit 1 with any of the following statements: WRITE(1) variable ! implicit open OPEN(1) ! unnamed open OPEN(1,FORM='FORMATTED') ! unnamed open Unit 1 is connected to file /tmp/mydir/tmpfile. Without the -a attribute, unit 1 would be connected to file fort.1. When the -a attribute is associated with a file, any Fortran open that is set to connect to the file causes a connection to the actual file name. An assign command of the following form causes a connection to file $FILENV/joe: assign -a $FILENV/joe ftfile This is true when the following statement is executed in a program: OPEN(IUN,FILE='ftfile') If the following assign command is issued and is in effect, any Fortran INQUIRE statement whose FILE= specification is foo refers to the file named actual instead of the file named foo for purposes of the EXISTS=, OPENED=, or UNIT= specifiers: assign -a actual f:foo If the following assign command is issued and is in effect, the -a attribute does not affect INQUIRE statements with a UNIT= specifier: assign -a actual ftfile When the following OPEN statement is executed, INQUIRE(UNIT=n,NAME=fname) returns a value of ftfile in fname, as if no assign had occurred: OPEN(n,file='ftfile') 278 S–3901–60Using the Assign Environment [14] The I/O library routines use only the actual file (-a) attributes from the assign environment when processing an INQUIRE statement. During an INQUIRE statement that contains a FILE= specifier, the I/O library searches the assign environment for a reference to the file name that the FILE= specifier supplies. If an assign-by-filename exists for the file name, the I/O library determines whether an actual name from the -a option is associated with the file name. If the assign-by-filename supplied an actual name, the I/O library uses that name to return values for the EXIST=, OPENED=, and UNIT= specifiers; otherwise, it uses the file name. The name returned for the NAME= specifier is the file name supplied in the FILE= specifier. The actual file name is not returned. 14.2.2 File Structure Selection A file structure defines the way records are delimited and how the end-of-file is represented. The assign command supports two mutually exclusive file structure options: • To select a structure using an FFIO layer, use assign -F • To select a structure explicitly, use assign -s Using FFIO layers is far more flexible than selecting structures explicitly. FFIO allows nested file structures, buffer size specifications, and support for file structures that are not available through the -s option. You will also realize better I/O performance by using the -F option and FFIO layers. For more information about the -F option and FFIO layers, see Chapter 15, page 295. The remainder of this section covers the -s option. Fortran sequential unformatted I/O uses four different file structures: f77 blocked structure, text structure, unblocked structure, and COS blocked structure. By default, the f77 blocked structure is used unless a file structure is selected at open time. If an alternative file structure is needed, the user can select a file structure by using the -s or -F option on the assign command. S–3901–60 279Cray® Fortran Reference Manual The -s and -F options are mutually exclusive. The following list summarizes how to select the different file structures with different options to the assign command: Structure assign command F77 blocked assign -F f77 text assign -F text assign -s text unblocked assign -F system assign -s unblocked COS blocked assign -F cos assign -s cos The following examples address file structure selection: • To select an unblocked file structure for a sequential unformatted file: IUN = 1 CALL ASNUNIT(IUN,'-s unblocked',IER) OPEN(IUN,FORM='UNFORMATTED',ACCESS='SEQUENTIAL') • You can use the assign -s u command to specify the unblocked file structure for a sequential unformatted file. When this option is selected, the I/O is unbuffered. Each Fortran READ or WRITE statement results in a read(2) or write(2) system call such as the following: CALL ASNFILE('fort.1','-s u',IER) OPEN(1,FORM='UNFORMATTED',ACCESS='SEQUENTIAL') • Use the following command to assign unit 10 a COS blocked structure: assign -s cos u:10 280 S–3901–60Using the Assign Environment [14] The full set of options allowed with the assign -s command are as follows: • bin (not recommended) • blocked • cos • sbin • text • unblocked Table 25 summarizes the Fortran access methods and options. Table 25. Fortran access methods and options Access and form assign -s ft defaults assign -s ft options Sequential unformatted, BUFFER IN and BUFFER OUT blocked / cos / f77 bin sbin u unblocked Direct unformatted unblocked bin sbin u unblocked Sequential formatted text blocked cos sbin/text Direct formatted text sbin/text 14.2.2.1 Unblocked File Structure A file with an unblocked file structure contains undelimited records. Because it does not contain any record control words, it does not have record boundaries. The unblocked file structure can be specified for a file that is opened with either unformatted sequential access or unformatted direct access. It is the default file structure for a file opened as an unformatted direct-access file. Do not reposition a file with unblocked file structure with a BACKSPACE statement. You cannot reposition the file to a previous record when record boundaries do not exist. S–3901–60 281Cray® Fortran Reference Manual BUFFER IN and BUFFER OUT statements can specify a file that has an unbuffered and unblocked file structure. If the file is specified with assign -s u, BUFFER IN and BUFFER OUT statements can perform asynchronous unformatted I/O. You can specify the unblocked data file structure by using the assign(1) command in several ways. All methods result in a similar file structure but with different library buffering styles, use of truncation on a file, alignment of data, and recognition of an end-of-file record in the file. The following unblocked data file structure specifications are available: Specification Structure assign -s unblocked Library-buffered assign -F system No library buffering assign -s sbin Buffering that is compatible with standard I/O; for example, both library and system buffering. The type of file processing for an unblocked data file structure depends on the assign -s ft option declared or assumed for a Fortran file. For more information about buffering, see Section 14.2.3, page 286. An I/O request for a file specified using the assign -s unblocked command does not need to be a multiple of a specific number of bytes. Such a file is truncated after the last record is written to the file. Padding occurs for files specified with the assign -s bin command and the assign -s unblocked command. Padding usually occurs when noncharacter variables follow character variables in an unformatted direct-access file. No padding is done in an unformatted sequential access file. An unformatted direct-access file created by a Fortran program on UNICOS/mp and UNICOS/lc systems contain records that are the same length. The end-of-file record is recognized in sequential-access files. 14.2.2.2 assign -s sbin File Processing (not recommended) You can use an assign -s sbin specification for a Fortran file that is opened with either unformatted direct access or unformatted sequential access. The file does not contain record delimiters. The file created for assign -s sbin in this instance has an unblocked data file structure and uses unblocked file processing. 282 S–3901–60Using the Assign Environment [14] The assign -s sbin option can be specified for a Fortran file that is declared as formatted sequential access. Because the file contains records that are delimited with the new-line character, it is not an unblocked data file structure. It is the same as a text file structure. The assign -s sbin option is compatible with the standard C I/O functions. Note: Cray discourages the use of assign -s sbin because of poor I/O performance. If you cannot use an FFIO layer, use assign -s text for formatted files and assign -s unblocked for unformatted files. 14.2.2.3 assign -s bin File Processing An I/O request for a file that is specified with assign -s bin does not need to be a multiple of a specific number of bytes. Padding occurs when noncharacter variables follow character variables in an unformatted record. The I/O library uses an internal buffer for the records. If opened for sequential access, a file is not truncated after each record is written to the file. 14.2.2.4 assign -s u File Processing The assign -s u command specifies undefined or unknown file processing. An assign -s u specification can be specified for a Fortran file that is declared as unformatted sequential or direct access. Because the file does not contain record delimiters, it has an unblocked data file structure. Both synchronous and asynchronous BUFFER IN and BUFFER OUT processing can be used with u file processing. Fortran sequential files declared by using assign -s u are not truncated after the last word written. The user must execute an explicit ENDFILE statement on the file. 14.2.2.5 text File Structure The text file structure consists of a stream of 8-bit ASCII characters. Every record in a text file is terminated by a newline character (\n, ASCII 012). Some utilities may omit the newline character on the last record, but the Fortran library will treat such an occurrence as a malformed record. This file structure can be specified for a file that is declared as formatted sequential access or formatted direct access. It is the default file structure for formatted sequential access files. It is also the default file structure for formatted direct access files. S–3901–60 283Cray® Fortran Reference Manual The assign -s text command specifies the library-buffered text file structure. Both library and system buffering are done for all text file structures. An I/O request for a file using assign -s text does not need to be a multiple of a specific number of bytes. You cannot use BUFFER IN and BUFFER OUT statements with this structure. You can use a BACKSPACE statement to reposition a file with this structure. 14.2.2.6 cos or blocked File Structure The cos or blocked file structure uses control words to mark the beginning of each sector and to delimit each record. You can specify this file structure for a file that is declared as unformatted sequential access. Synchronous BUFFER IN and BUFFER OUT statements can create and access files with this file structure. You can specify this file structure with one of the following assign(1) commands: assign -s cos assign -s blocked assign -F cos assign -F blocked These four assign commands result in the same file structure. An I/O request on a blocked file is library buffered. In a cos file structure, one or more ENDFILE records are allowed. BACKSPACE statements can be used to reposition a file with this structure. A blocked file is a stream of words that contains control words called Block Control Word (BCW) and Record Control Words (RCW) to delimit records. Each record is terminated by an EOR (end-of-record) RCW. At the beginning of the stream, and every 512 words thereafter (including any RCWs), a BCW is inserted. An end-of-file (EOF) control word marks a special record that is always empty. Fortran considers this empty record to be an endfile record. The end-of-data (EOD) control word is always the last control word in any blocked file. The EOD is always immediately preceded by an EOR, or an EOF and a BCW. Each control word contains a count of the number of data words to be found between it and the next control word. In the case of the EOD, this count is 0. Because there is a BCW every 512 words, these counts never point forward more than 511 words. 284 S–3901–60Using the Assign Environment [14] A record always begins at a word boundary. If a record ends in the middle of a word, the rest of that word is zero filled; the ubc field of the closing RCW contains the number of unused bits in the last word. The following illustration and table is a representation of the structure of a BCW. m unused bdf unused bn fwi (4) (7) (1) (19) (24) (9) Field Bits Description m 0–3 Type of control word; 0 for BCW bdf 11 Bad Data flag (1-bit, 1=bad data) bn 31–54 Block number (modulo 2 24 ) fwi 55–63 Forward index; the number of words to next control word The following illustration and table is a representation of the structure of an RCW. m ubc tran bdf srs unused pfi pri fwi (4) (6) (1) (1) (1) (7) (20) (15) (9) Field Bits Description m 0–3 Type of control word; 10 8 for EOR, 16 8 for EOF, and 17 8 for EOD. ubc 4–9 Unused bit count; number of unused low-order bits in last word of previous record. tran 10 Transparent record field (unused). bdf 11 Bad data flag (unused). srs 12 Skip remainder of sector (unused). pfi 20–39 Previous file index; offset modulo 2 20 to the block where the current file starts (as defined by the last EOF). S–3901–60 285Cray® Fortran Reference Manual Field Bits Description pri 40–54 Previous record index; offset modulo 2 15 to the block where the current record starts. fwi 55–63 Forward index; the number of words to next control word. 14.2.3 Buffer Specifications A buffer is a temporary storage location for data while the data is being transferred. A buffer is often used for the following purposes: • Small I/O requests can be collected into a buffer, and the overhead of making many relatively expensive system calls can be greatly reduced. • Many data file structures such as cos contain control words. During the write process, a buffer can be used as a work area where control words can be inserted into the data stream (a process called blocking). The blocked data is then written to the device. During the read process, the same buffer work area can be used to remove the control words before passing the data on to the user (called deblocking). • When data access is random, the same data may be requested many times. A cache is a buffer that keeps old requests in the buffer in case these requests are needed again. A cache that is sufficiently large or efficient can avoid a large part of the physical I/O by having the data ready in a buffer. When the data is often found in the cache buffer, it is referred to as having a high hit rate. For example, if the entire file fits in the cache and the file is present in the cache, no more physical requests are required to perform the I/O. In this case, the hit rate is 100%. • Running the I/O devices and the processors in parallel often improves performance; therefore, it is useful to keep processors busy while data is being moved. To do this when writing, data can be transferred to the buffer at memory-to-memory copy speed. Use an asynchronous I/O request. The control is then immediately returned to the program, which continues to execute as if the I/O were complete (a process called write-behind). A similar process called read-ahead can be used while reading; in this process, data is read into a buffer before the actual request is issued for it. When it is needed, it is already in the buffer and can be transferred to the user at very high speed. This is another use of a cache. 286 S–3901–60Using the Assign Environment [14] • When direct I/O is enabled (assign -B on), data is staged in the system buffer cache. While this can yield improved performance, it also means that performance is affected by program competition for system buffer cache. To minimize this effect, avoid public caches when possible. • In many cases, the best asynchronous I/O performance can be realized by using the FFIO cachea layer (assign -F cachea). This layer supports read-ahead, write-behind, and improved cache reuse. The size of the buffer used for a Fortran file can have a substantial effect on I/O performance. A larger buffer size usually decreases the system time needed to process sequential files. However, large buffers increase a program's memory usage; therefore, optimizing the buffer size for each file accessed in a program on a case-by-case basis can help increase I/O performance and minimize memory usage. The -b option on the assign command specifies a buffer size, in blocks, for the unit. The -b option can be used with the -s option, but it cannot be used with the -F option. Use the -F option to provide I/O path specifications that include buffer sizes; the -b, and -u options do not apply when -F is specified. For more information about the selection of buffer sizes, see the assign(1) man page. The following examples of buffer size specification illustrate using the assign -b and assign -F options: • If unit 1 is a large sequential file for which many Fortran READ or WRITE statements are issued, you can increase the buffer size to a large value, using the following assign command: assign -b buffer-size u:buffer-count • If file foo is a small file or is accessed infrequently, minimize the buffer size using the following assign command: assign -b 1 f:foo 14.2.3.1 Default Buffer Sizes The Fortran I/O library automatically selects default buffer sizes according to file access type as shown in Table 26. You can override the defaults by using the assign(1) command. The following subsections describe the default buffer sizes on various systems. Note: One block is 4,096 bytes on UNICOS/mp and UNICOS/lc systems. S–3901–60 287Cray® Fortran Reference Manual The default buffer sizes are as follows: Table 26. Default Buffer Sizes for Fortran I/O Library Routines Access Type Default Buffer Size Sequential formatted 16 blocks (65,536 bytes) Sequential unformatted 128 blocks (524,288 bytes) Direct formatted The smaller of: • The record length in bytes + 1 • 16 blocks (65,536 bytes) Direct unformatted The larger of: • The record length • 16 blocks (65,536 bytes) Four buffers of default size are allocated. For more information, see the description of the cachea layer in the intro_ffio(3F) man page. 14.2.3.2 Library Buffering The term library buffering refers to a buffer that the I/O library associates with a file. When a file is opened, the I/O library checks the access, form, and any attributes declared on the assign command to determine the type of processing that should be used on the file. Buffers are an integral part of the processing. If the file is assigned with one of the following assign(1) options, library buffering is used: -s blocked -F spec (buffering as defined by spec) -s cos -s bin -s unblocked 288 S–3901–60Using the Assign Environment [14] The -F option specifies flexible file I/O (FFIO), which uses library buffering if the specifications selected include a need for buffering. In some cases, more than one set of buffers might be used in processing a file. For example, the -F bufa,cos option specifies two library buffers for a read of a blank compressed COS blocked file. One buffer handles the blocking and deblocking associated with the COS blocked control words, and the second buffer is used as a work area to process blank compression. In other cases (for example, -F system), no library buffering occurs. 14.2.3.3 System Cache The operating system uses a set of buffers in kernel memory for I/O operations. These are collectively called the system cache. The I/O library uses system calls to move data between the user memory space and the system buffer. The system cache ensures that the actual I/O to the logical device is well formed, and it tries to remember recent data in order to reduce physical I/O requests. The following assign(1) command options can be expected to use system cache: -s sbin -F spec (FFIO, depends on spec) For the assign -F cachea command, a library buffer ensures that the actual system calls are well formed and the system buffer cache is bypassed. This is not true for the assign -s u option. If you plan to use assign -s u to bypass the system cache, all requests must be well formed. 14.2.3.4 Unbuffered I/O The simplest form of buffering is none at all; this unbuffered I/O is known as direct I/O. For sufficiently large, well-formed requests, buffering is not necessary and can add unnecessary overhead and delay. The following assign(1) command specifies unbuffered I/O: assign -s u ... Use the assign command to bypass both library buffering and the system cache for all well-formed requests. The data is transferred directly between the user data area and the logical device. Requests that are not well formed will result in I/O errors. S–3901–60 289Cray® Fortran Reference Manual 14.2.4 Foreign File Format Specification The Fortran I/O library can read and write files with record blocking and data formats native to operating systems from other vendors. The assign -F command specifies a foreign record blocking; the assign -C command specifies the type of character conversion; the -N option specifies the type of numeric data conversion. When -N or -C is specified, the data is converted automatically during the processing of Fortran READ and WRITE statements. For example, assume that a record in file fgnfile contains the following character and integer data: character*4 ch integer int open(iun,FILE='fgnfile',FORM='UNFORMATTED') read(iun) ch, int Use the following assign command to specify foreign record blocking and foreign data formats for character and integer data: assign -F ibm.vbs -N ibm -C ebcdic fgnfile 14.2.5 Memory Resident Files The assign -F mr command specifies that a file will be memory resident. Because the mr flexible file I/O layer does not define a record-based file structure, it must be nested beneath a file structure layer when record blocking is needed. For example, if unit 2 is a sequential unformatted file that is to be memory resident, the following Fortran statements connect the unit: CALL ASNUNIT (2,'-F cos,mr',IER) OPEN(2,FORM='UNFORMATTED') The -F cos,mr specification selects COS blocked structure with memory residency. 14.2.6 Fortran File Truncation The assign -T option activates or suppresses truncation after the writing of a sequential Fortran file. The -T on option specifies truncation; this behavior is consistent with the Fortran standard and is the default setting for most assign -s fs specifications. 290 S–3901–60Using the Assign Environment [14] The assign(1) man page lists the default setting of the -T option for each -s fs specification. It also indicates if suppression or truncation is allowed for each of these specifications. FFIO layers that are specified by using the -F option vary in their support for suppression of truncation with -T off. Figure 4 summarizes the available access methods and the default buffer sizes. Blocked Unblocked Access method assign option Blocked -s cos Text -s text Undef -s u Binary -s bin Unblocked -s unblocked Buffer size for default Formatted sequential I/O WRITE(9,20) PRINT Valid Default 16 Formatted direct I/O WRITE(9,20,REC=) Unformatted sequential I/O WRITE(9) Unformatted direct I/O WRITE(9,REC=) Buffer in/buffer out Control words Yes NEWLINE No Library buffering System cached BACKSPACE Record size Default library buffer size* 48 16 16 Any Varies Valid Valid Valid Valid Default Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Default 16 128 No No Yes Yes Yes Yes No min(recl+1, 8) bytes max(16, recl) blocks Any Any Any Yes No Yes Yes No No† No†† 8*n No No Valid * † †† Cached if not well-formed No guarantee when physical size not 512 words * In units of 4096 bytes, unless otherwise specified 16 None Blocked -F f77 Yes 16 Any Yes Yes Yes Valid Default Valid Default Figure 4. Access Methods and Default Buffer Sizes S–3901–60 291Cray® Fortran Reference Manual 14.3 The Assign Environment File The assign command information is stored in the assign environment file. The location of the active assign environment file must be provided by setting the FILENV environment variable to the desired path and file name. 14.4 Local Assign Mode The assign environment information is usually stored in the .assign environment file. Programs that do not require the use of the global .assign environment file can activate local assign mode. If you select local assign mode, the assign environment will be stored in memory. Thus, other processes can not adversely affect the assign environment used by the program. The ASNCTL(3f) routine selects local assign mode when it is called by using one of the following command lines: CALL ASNCTL('LOCAL',1,IER) CALL ASNCTL('NEWLOCAL',1,IER) Example 5: Local assign mode In the following example, a Fortran program activates local assign mode and then specifies an unblocked data file structure for a unit before opening it. The -I option is passed to ASNUNIT to ensure that any assign attributes continue to have an effect at the time of file connection. C Switch to local assign environment CALL ASNCTL('LOCAL',1,IER) IUN = 11 C Assign the unblocked file structure CALL ASNUNIT(IUN,'-I -s unblocked',IER) C Open unit 11 OPEN(IUN,FORM='UNFORMATTED') 292 S–3901–60Using the Assign Environment [14] If a program contains all necessary assign statements as calls to ASSIGN, ASNUNIT, and ASNFILE, or if a program requires total shielding from any assign commands, use the second form of a call to ASNCTL, as follows: C New (empty) local assign environment CALL ASNCTL('NEWLOCAL',1,IER) IUN = 11 C Assign a large buffer size CALL ASNUNIT(IUN,'-b 336',IER) C Open unit 11 OPEN(IUN,FORM='UNFORMATTED') S–3901–60 293Cray® Fortran Reference Manual 294 S–3901–60Using FFIO [15] This chapter provides an overview of the capabilities of the flexible file I/O (FFIO) system and describes how to use FFIO with common file structures to enhance code performance without changing source code. Flexible file I/O, sometimes called layered I/O, is used to perform many I/O-related tasks. For details about each individual I/O layer, see Chapter 16, page 311. 15.1 Introduction to FFIO The FFIO system is based on the concept that for all I/O a list of processing steps must be performed to transfer the user data between the user's memory and the desired I/O device. I/O can be the slowest part of a computational process, and the speed of I/O access methods varies depending on computational processes. Figure 5 illustrates the typical flow of data from the user's variables to and from the I/O device. Kernel job User’s System call Figure 5. Typical Data Flow It is useful to think of each of these boxes as a stopover for the data, and each transition between stopovers as a processing step. It is also important to realize that the actual I/O path can skip one or more steps in this process, depending on the I/O features used at a given point in a given program. S–3901–60 295Cray® Fortran Reference Manual Each transition has benefits and costs. Different applications might use the total I/O system in different ways. For example, if I/O requests are large, the library buffer is unnecessary because the buffer is used primarily to avoid making system calls for every small request. You can achieve better I/O throughput with large I/O requests by not using library buffering. If library buffering is not used, I/O requests should be large; otherwise, I/O performance will be degraded. On the other hand, if all I/O requests are small, the library buffer is essential to avoid making a costly system call for each I/O request. It is useful to be able to modify the I/O process to prevent intermediate steps (such as buffering of data) for existing programs without requiring that the source code be changed. The assign(1) command lets you modify the total user I/O path by establishing an I/O environment. The FFIO system lets you specify each stopover. You can specify a comma-separated list of one or more processing steps by using the assign -F command: assign -F spec1,spec2,spec3... Each spec in the list is a processing step that requests one I/O layer, or logical grouping of layers. The layer specifies the operations that are performed on the data as it is passed between the user and the I/O device. A layer refers to the specific type of processing being done. In some cases, the name corresponds directly to the name of one layer. In other cases, however, specifying one layer invokes the routines used to pass the data through multiple layers. See the intro_ffio(3f) man page for details about using the -F option to the assign command. Processing steps are ordered as if the -F side (the left side) is the user and the system/device is the right side, as in the following example: assign -F user,bufa,system With this specification, a WRITE operation first performs the user operation on the data, then performs the bufa operation, and then sends the data to the system. In a READ operation, the process is performed from right to left. The data moves from the system to the user. The layers closest to the user are higher-level layers; those closer to the system are lower-level layers. The FFIO system has an internal model of the world of data, which it maps to any given actual logical file type. Four of these concepts are basic to understanding the inner workings of the layers. 296 S–3901–60Using FFIO [15] Concept Definition Data Data is a stream of bits. Record marks End-of-record (EOR) marks are boundaries between logical records. File marks End-of-file (EOF) marks are special types of record marks that exist in some file formats. End-of-data (EOD) An end-of-data (EOD) is a point immediately beyond the last data bit, EOR, or EOF in the file. All files are streams of 0 or more bits that may contain record and/or file marks. Individual layers have varying rules about which of these things can appear and in which order they can appear in a file. Fortran programmers and C programmers can use the capabilities described in this document. Fortran users can use the assign(1) command to specify these FFIO options. For C users, the FFIO layers are available only to programs that call the FFIO routines directly (ffopen(3c), ffread(3c), and ffwrite(3c)). You can use FFIO with the Fortran I/O forms listed in the following table. For each form, the equivalent assign command is shown. Fortran I/O Form Equivalent assign Command Buffer I/O assign -F f77 Unformatted sequential assign -F f77 Unformatted direct access assign -F cache Formatted sequential assign -F text Namelist assign -F text List-directed assign -F text S–3901–60 297Cray® Fortran Reference Manual 15.2 Using Layered I/O The specification list on the assign -F command comprises all of the processing steps that the I/O system performs. If assign -F is specified, any default processing is overridden. For example, unformatted sequential I/O is assigned a default structure of f77 on UNICOS/mp and UNICOS/lc systems. The -F f77 option provides the same structure. The FFIO system provides detailed control over I/O processing requests. However, to effectively use the f77 option (or any FFIO option), you must understand the I/O processing details. As a very simple example, suppose you were making large I/O requests and did not require buffering or blocking on your data. You could specify: assign -F system The system layer is a generic system interface that chooses an appropriate layer for your file. If the file is on a disk, it chooses the syscall layer, which maps each user I/O request directly to the corresponding system call. A Fortran READ statement is mapped to one or more read(2) system calls and a Fortran WRITE statement to one or more write(2) system calls. If you want your file to be F77 blocked (the default blocking for Fortran unformatted I/O on UNICOS/mp and UNICOS/lc systems), you can specify: assign -F f77 If you want your file to be COS blocked, you can specify: assign -F cos Note: In all assign -F specifications, the system layer is the implied last layer. The above example is functionally identical to assign -F cos,system. These two specs request that each WRITE request first be blocked (blocking adds control words to the data in the file to delimit records). The f77 layer then sends the blocked data to the system layer. The system layer passes the data to the device. The process is reversed for READ requests. The system layer retrieves blocked data from the file. The blocked data is passed to the next higher layer, the f77 layer, where it is deblocked. The deblocked data is then presented to the user. 298 S–3901–60Using FFIO [15] 15.2.1 I/O Layers Several different layers are available for the spec argument. Each layer invokes one or more layers, which then handles the data it is given in an appropriate manner. For example, the syscall layer essentially passes each request to an appropriate system call. The mr layer tries to hold an entire file in a buffer that can change size as the size of the file changes; it also limits actual I/O to lower layers so that I/O occurs only at open, close, and overflow. Table 27 defines the classes you can specify for the spec argument to the assign -F option. For detailed information about each layer, see Chapter 16, page 311. Table 27. FFIO Layers Layer Function bufa Asynchronous buffering layer cache Memory-cached I/O cachea Asynchronous memory-cached I/O cos or blocked COS blocking. This is the default for Fortran sequential unformatted I/O on UNICOS and UNICOS/mk systems. event I/O monitoring layer f77 FORTRAN 77/UNIX Fortran record blocking. This is the default for Fortran sequential unformatted I/O on UNICOS/mp and UNICOS/lc systems and the common blocking format used by most FORTRAN 77 compilers on UNIX systems. fd File descriptor open global Distributed cache layer for MPI, SHMEM, OpenMP, and Co-array Fortran ibm IBM file formats mr Memory-resident file handlers null Syntactic convenience for users (does nothing) site User-defined site-specific layer syscall System call I/O S–3901–60 299Cray® Fortran Reference Manual Layer Function system Generic system interface text Newline separated record formats user User-defined layer vms VAX/VMS file formats 15.2.2 Layered I/O Options You can modify the behavior of each I/O layer. The following spec format shows how you can specify a class and one or more opt and num fields: class.opt1.opt2:num1:num2:num3 For class, you can specify one of the layers listed in Table 27. Each layer has a different set of options and numeric parameter fields that can be specified. This is necessary because each layer performs different duties. The following rules apply to the spec argument: • The class and opt fields are case-insensitive. For example, the following two specs are identical: Ibm.VBs:100:200 IBM.vbS:100:200 • The opt and num fields are usually optional, but sufficient separators must be specified as placeholders to eliminate ambiguity. For example, the following specs are identical: cos..::40, cos.::40 cos::40 In this example, opt1, opt2, num1, and num2 can assume default values. • To specify more than one spec, use commas between specs. Within each spec, you can specify more than one opt and num. Use periods between opt fields, and use colons between num fields. The following options all have the same effect. They all specify the vms layer and set the initial allocation to 100 blocks: -F vms:100 -F vms.:100 -F vms..:100 300 S–3901–60Using FFIO [15] The following option contains one spec for an vms layer that has an opt field of scr (which requests scratch file behavior): -F vms.scr The following option requests two classes with no opts: -F f77,vms The following option contains two specs and requests two layers: cos and vms. The cos layer has no options; the vms layer has options scr and ovfl, which specify that the file is a scratch file that is allowed to overflow and that the maximum allocation is 1000 sectors: -F cos,vms.scr.ovfl::1000 When possible, the default settings of the layers are set so that optional fields are seldom needed. 15.3 FFIO and Common Formats This section describes the use of FFIO with common file structures and the correlation between the common or default file structures and the FFIO usage that handles them. 15.3.1 Reading and Writing Text Files Use the fdcp command to copy files while converting record blocking. Most human-readable files are in text format; this format contains records comprised of ASCII characters with each record terminated by an ASCII line-feed character, which is the newline character in UNIX terminology. The FFIO specification that selects this file structure is assign -F text. The FFIO package is seldom required to handle text files. In the following types of cases, however, using FFIO may be necessary: • Optimizing text file access to reduce I/O wait time • Handling multiple EOF records in text files • Converting data files to and from other formats I/O speed is important when optimizing text file access. Using assign -F text is expensive in terms of processor time, but it lets you use memory-resident files, which can reduce or eliminate I/O wait time. S–3901–60 301Cray® Fortran Reference Manual The FFIO system also can process text files that have embedded EOF records. The ~e string alone in a text record is used as an EOF record. Editors such as sed(1) and other standard utilities can process these files, but it is sometimes easier with the FFIO system. The text layer is also useful in conjunction with the fdcp(1) command. The text layer provides a standard output format. Many forms of data that are not considered foreign are sometimes encountered in a heterogeneous computing environment. If a record format can be described with an FFIO specification, it can usually be converted to text format by using the following script: OTHERSPEC=$1 INFILE=$2 OUTFILE=$3 assign -F ${OTHERSPEC} ${INFILE} assign -F text ${OUTFILE} fdcp ${INFILE} ${OUTFILE} If the name of the script is to.text, you can invoke it as follows: % to.text cos data_cos data_text 15.3.2 Reading and Writing Unblocked Files The simplest data file format is the binary stream or unblocked data. It contains no record marks, file marks, or control words. This is usually the fastest way to move large amounts of data, because it involves a minimal amount of processor and system overhead. The FFIO package provides several layers designed specifically to handle a binary stream of data. These layers are syscall, mr, bufa, cache, cachea, and global. These layers behave the same from the user's perspective; they only use different system resources. The unblocked binary stream is usually used for unformatted data transfer. It is not usually useful for text files or when record boundaries or backspace operations are required. The complete burden is placed on the application to know the format of the file and the structure and type of the data contained in it. This lack of structure also allows flexibility; for example, a file declared with one of these layers can be manipulated as a direct-access file with any record length. In this context, fdcp can be called to do the equivalent of the cp(1) command only if the input file is a binary stream and to remove blocking information only if the output file is a binary stream. 302 S–3901–60Using FFIO [15] 15.3.3 Reading and Writing Fixed-length Records The most common use for fixed-length record files is for Fortran direct access. Both unformatted and formatted direct-access files use a form of fixed-length records. The simplest way to handle these files with the FFIO system is with binary stream layers, such as system, syscall, cache, cachea, global, and mr. These layers allow any requested pattern of access and also work with direct-access files. The syscall and system layers, however, are unbuffered and do not give optimal performance for small records. The FFIO system also directly supports some fixed-length record formats. 15.3.4 Reading and Writing Blocked Files The f77 blocking format is the default file structure for all Fortran sequential unformatted files. The f77 layer is provided to handle these files. The f77 layer is the default file structure on Cray X1 and X2 systems. If you specify another layer, such as mr, you may have to specify a f77 layer to get f77 blocking. 15.4 Enhancing Performance FFIO can be used to enhance performance in a program without changing or recompiling the source code. This section describes some basic techniques used to optimize I/O performance. Additional optimization options are discussed in Chapter 16, page 311. 15.4.1 Buffer Size Considerations In the FFIO system, buffering is the responsibility of the individual layers; therefore, you must understand the individual layers in order to control the use and size of buffers. The cos layer has high payoff potential to the user who wants to extract top performance by manipulating buffer sizes. As the following example shows, the cos layer accepts a buffer size as the first numeric parameter: assign -F cos:42 u:1 If the buffer is sufficiently large, the cos layer also lets you keep an entire file in the buffer and avoid almost all I/O operations. S–3901–60 303Cray® Fortran Reference Manual 15.4.2 Removing Blocking I/O optimization usually consists of reducing overhead. One part of the overhead in doing I/O is the processor time spent in record blocking. For many files in many programs, this blocking is unnecessary. If this is the case, the FFIO system can be used to deselect record blocking and thus obtain appropriate performance advantages. The following layers offer unblocked data transfer: Layer Definition syscall System call I/O bufa Buffering layer cachea Asynchronous cache layer cache Memory-resident buffer cache global SHMEM and MPI cache layer mr Memory-resident (MR) I/O You can use any of these layers alone for any file that does not require the existence of record boundaries. This includes any applications that are written in C that require a byte stream file. 15.4.2.1 The syscall Layer The syscall layer offers a simple, direct system interface with a minimum of system and library overhead. If requests are larger than approximately 64 K, this method can be appropriate. 15.4.2.2 The bufa and cachea Layers The bufa and cachea layers permit efficient file processing. Both layers provide asynchronous buffering managed by the library, and the cachea layer allows recently accessed parts of a file to be cached in memory. The number of buffers and the size of each buffer are tunable. In the bufa:bs:nbufs or cachea:bs:nbufs FFIO specifications, the bs argument specifies the size in 4096-byte blocks of each buffer. The default depends on the st_blksize field returned from a stat(2) system call of the file; if this return value is 0, the default is 8 for all files. The nbufs argument specifies the number of buffers to use. bufa defaults to 2 buffers, while cachea defaults to 512 buffers. 304 S–3901–60Using FFIO [15] 15.4.2.3 The mr Layer The mr layer lets you use main memory as an I/O device for many files. Used in combination with the other layers, cos blocked files, text files, and direct-access files can all reside in memory without recoding. This can result in excellent performance for a file, or part of a file, that can reside in memory. The mr layer features both scr and save mode, and it directs overflow to the next lower layer automatically. The assign -F command specifies the entire set of processing steps that are performed when I/O is requested. If a file is blocked, you must specify the appropriate layer for the handling of block and record control words as in the following examples: assign -F f77,mr u:1 assign -F cos,mr fort.1 Section 15.5, page 307 contains several mr program examples. 15.4.2.4 The global Layer (Deferred Implementation) The global layer is a caching layer that distributes data across all multiple SHMEM or MPI processes. Open and close operations require participation by all processes that access the file; all other operations are performed independently by one or more processes. File positions can be private to a process or global to all processes. You can specify both the cache size and the number of cache pages to use. Since this layer is used by parallel processes, the actual number of cache pages used is the number specified times the number of processes. 15.4.2.5 The cache Layer The cache layer permits efficient file processing for repeated access to one or more regions of a file. It is a library-managed buffer cache that contains a tunable number of pages of tunable size. To specify the cache layer, use the following option: assign -F cache[:[bs][:[nbufs]]] S–3901–60 305Cray® Fortran Reference Manual The bs argument specifies the size in 4096-byte blocks of each cache page; the default is 16. The nbufs argument specifies the number of cache pages to use. The default is 4. You can achieve improved I/O performance by using one or more of the following strategies: • Use a cache page size that is a multiple of the user's record size. This ensures that no user record straddles two cache pages. If this is not possible or desirable, it is best to allocate a few additional cache pages (nbufs). • Use a number of cache pages that is greater than or equal to the number of file regions the code accesses at one time. If the number of regions accessed within a file is known, the number of cache pages can be chosen first. To determine the cache page size, divide the amount of memory to be used by the number of cache pages. For example, suppose a program uses direct access to read 10 vectors from a file and then writes the sum to a different file: integer VECTSIZE, NUMCHUNKS, CHUNKSIZE parameter(VECTSIZE=1000*512) parameter(NUMCHUNKS=100) parameter(CHUNKSIZE=VECTSIZE/NUMCHUNKS) read a(CHUNKSIZE), sum(CHUNKSIZE) open(11,access='direct',recl=CHUNKSIZE*8) call asnunit (2,'-s unblocked',ier) open (2,form='unformatted') do i = 1,NUMCHUNKS sum = 0.0 do j = 1,10 read(11,rec=(j-1)*NUMCHUNKS+i)a sum=sum+a enddo write(2) sum enddo end If 4 MB of memory are allocated for buffers for unit 11, 10 cache pages should be used, each of the following size: 4MB/10 = 400000 bytes = 97 blocks Make the buffer size an even multiple of the record length of 409600 bytes by rounding it up to 100 blocks (= 409600 bytes), then use the following assign command: assign -F cache:100:10 u:11 306 S–3901–60Using FFIO [15] 15.5 Sample Programs The following examples illustrate the use of the mr layers. Example 6: Unformatted direct mr with unblocked file In the following example, batch job ex8 contains a program that uses unformatted direct-access I/O with an mr layer: #QSUB -r ex8 -lT 10 -lQ 500000 #QSUB -eo -o ex8.out date set -x cd $TMPDIR cat > ex8.f <= num1 + 8 vb 9 32,760 32,760 Must be >= num1 + 8 vbs 9 32,760 32,760 Table 42. Data Manipulation: ibm Layer Granularity Data model Truncate on write Implementation strategy 8 bits Record No for f and fb records. Yes for v, vb, and vbs records. f records for f and fb. v records for u, v, vb, and vbs. 326 S–3901–60FFIO Layer Reference [16] Table 43. Supported Operations: ibm Layer Supported operations Required of next lower level? Operation Supported Comments Used Comments ffopen Yes Yes ffread Yes Yes ffreadc Yes No ffwrite Yes Yes ffwritec Yes No ffclose Yes Yes ffflush Yes No ffweof Passed through Yes ffweod Yes Yes ffseek Yes seek(fd, 0, 0) only (equals rewind) Yes seek(fd,0,0) only ffbksp No No 16.11 The mr Layer The memory-resident (mr) layer lets users declare that all or part of a file will reside in memory. This can improve performance for relatively small files that are heavily accessed or for larger files where the first part of the file is heavily accessed (for example, a file which contains a frequently updated directory at the beginning.) The mr layer tries to allocate a buffer large enough to hold the entire file. Note: It is generally more advantageous to configure the layer preceding the mr layer to make the file buffer-resident, assuming that layer can support buffers of sufficient size. The options are as follows: mr[.type[.subtype]]:num1:num2:num3 S–3901–60 327Cray® Fortran Reference Manual The keyword syntax is as follows: mr[.type[.subtype]][.start_size=num1][.max_size=num2] [.inc_size=num3] The type field specifies whether the file in memory is intended to be saved or is considered a scratch file. This argument accepts the following values: Value Definition save Default. The file is loaded into memory when opened and written back to the next lower layer when closed. The save option also modifies the behavior of overflow processing. scr Scratch file. The file is not read into memory when opened and not written when closed. The subtype field specifies the action to take when the data can no longer fit in the allowable memory space. It accepts the following values: Value Definition ovfl Default. Data which does not fit (overflows) the maximum specified memory allocation is written to the next lower layer, which is typically a disk file. An informative message is written to stderr on the first overflow. ovflnomsg Identical to ovfl, except that no message is issued when the data overflows the memory-resident buffer. novfl If data does not fit in memory, then subsequent write(1) operations fail. 328 S–3901–60FFIO Layer Reference [16] The num1, num2, and num3 fields are nonnegative integer values that state the number of 4096-byte blocks to use in the following circumstances: Field Definition num1 The initial size of the memory allocation, specified in 4,096-byte blocks. The default is 0. num2 The maximum size of the memory allocation, specified in 4,096-byte blocks. The default is either num1 or 256 blocks (1 MB), whichever is larger. num3 Increment size of the memory allocation, specified in 4,096-byte blocks. This value is used when allocation additional memory space. The default is 256 blocks (1 MB) or (num2-num1), whichever is smaller. The num1 and num3 fields represent best-effort values. They are intended for tuning purposes and usually do not cause failure if they are not satisfied precisely as specified. For example, if the available memory space is only 100 blocks and the chosen num3 value is 200 blocks, growth is allowed to use the 100 available blocks rather than failing to grow, because the full 200 blocks requested for the increment are unavailable. ! Caution: When using the mr layer, you must ensure that the size of the memory-resident portions of the files are limited to reasonable values. Unrestrained and unmanaged growth of such file portions can cause heap fragmentation, exhaustion of all available memory, and program abort. If this growth has consumed all available memory, the program may not abort gracefully, making such a condition difficult to diagnose. Large memory-resident files may reduce I/O performance for sites that provide memory scheduling that favors small processes over large processes. Check with your system administrator if I/O performance is diminished. Increment sizes which are too small can also contribute to heap fragmentation. Memory allocation is done by using the malloc(3c) and realloc(3c) library routines. The file space in memory is always allocated contiguously. When allocating new chunks of memory space, the num3 argument is used in conjunction with realloc as a minimum first try for reallocation. S–3901–60 329Cray® Fortran Reference Manual Table 44. Data Manipulation: mr Layer Primary function Granularity Data model Truncate on write Avoid I/O to the extent possible, by holding the file in memory. 8 bit Stream (mimics UNICOS/mp system calls) No Table 45. Supported Operations: mr Layer Supported operations Required of next lower level? Operation Supported Comments Used Comments ffopen Yes Yes Sometimes delayed until overflow ffread Yes Yes Only on open ffreadc Yes No ffwrite Yes Yes Only on close, overflow ffwritec Yes No ffclose Yes Yes ffflush Yes No-op No ffweof No No representation No No representation ffweod Yes Yes ffseek Yes Full support (absolute, relative, and from end) Yes Used in open and close processing ffbksp No No records No 16.12 The null Layer The null layer is a syntactic convenience for users; it has no effect. This layer is commonly used to simplify the writing of a shell script when a shell variable is used to specify a FFIO layer specification. For example, the following line is from a shell script with a file using the assign command and overlying blocking is expected (as specified by BLKTYP): assign -F $BLKTYP,cos fort.1 330 S–3901–60FFIO Layer Reference [16] If BLKTYP is undefined, the illegal specification list ,cos results. The existence of the null layer lets the programmer set BLKTYP to null as a default, and simplify the script, as in: assign -F null,cos fort.1 This is identical to the following command: assign -F cos fort.1 When used as the last layer above the system or syscall layer, the null layer supports the assign -B option to enable or disable direct I/O. 16.13 The syscall Layer The syscall layer directly maps each request to an appropriate system call. The layer does not accept any options. Table 46. Data Manipulation: syscall Layer Granularity Data model Truncate on write 8 bits (1 byte) Stream (UNICOS/mp system calls) No S–3901–60 331Cray® Fortran Reference Manual Table 47. Supported Operations: syscall Layer Operation Supported Comments ffopen Yes open ffread Yes read ffreadc Yes read plus code ffwrite Yes write ffwritec Yes write plus code ffclose Yes close ffflush Yes None ffweof No None ffweod Yes trunc(2) ffseek Yes lseek(2) ffbksp No Lower-level layers are not allowed. 16.14 The system Layer The system layer is implicitly appended to all specification lists, if not explicitly added by the user (unless the syscall or fd layer is specified). It maps requests to appropriate system calls. For a description of options, see the syscall layer. Lower-level layers are not allowed. 16.15 The text Layer The text layer performs text blocking by terminating each record with a newline character. It can also recognize and represent the EOF mark. The text layer is used with character files and does not work with binary data. The general specification follows: text[.type]:[num1]:[num2] The keyword syntax is as follows: text[.type][.newline=num1][.bufsize=num2] 332 S–3901–60FFIO Layer Reference [16] The type field can have one of the following values: Value Definition nl Newline-separated records. eof Newline-separated records with a special string such as ~e. More than one EOF in a file is allowed. The num1 field is the decimal value of a single character that represents the newline character. The default value is 10 (octal 012, ASCII line feed). The num2 field specifies the working buffer size (in decimal bytes). If any lower-level layers are record oriented, this is also the block size. Table 48. Data Manipulation: text Layer Granularity Data model Truncate on write 8 bits Record No Table 49. Supported Operations: text Layer Supported operations Required of next lower level? Operation Supported Comments Used Comments ffopen Yes Yes ffread Yes Yes ffreadc Yes No ffwrite Yes Yes ffwritec Yes No ffclose Yes Yes ffflush Yes No ffweof Passed through Yes Only if explicitly requested ffweod Yes Yes ffseek Yes Yes ffbksp No No S–3901–60 333Cray® Fortran Reference Manual 16.16 The user and site Layers The user and site layers let users and site administrators build user-defined or site-specific layers to meet special needs. The syntax follows: user[num1]:[num2] site:[num1]:[num2] The open processing passes the num1 and num2 arguments to the layer and are interpreted by the layers. See Chapter 17, page 337 for an example of how to create a user FFIO layer. 16.17 The vms Layer The vms layer handles record blocking for three common record types on VAX/VMS operating systems. The general format of the specification follows: vms.[type.subtype]:[num1]:[num2] The following is the alternate keyword syntax for this layer: vms.[type.subtype][.recsize=num1][.mbs=num2] The following type values are supported: Value Definition f VAX/VMS fixed-length records v VAX/VMS variable-length records s VAX/VMS variable-length segmented records In addition to the record type, you must specify a record subtype, which has one of the following four values: Value Definition bb Format used for binary blocked transfers disk Same as binary blocked tr Transparent format, for files transferred as a bit stream to and from the VAX/VMS system tape VAX/VMS labeled tape 334 S–3901–60FFIO Layer Reference [16] The num1 field is the maximum record size that may be read or written. It is ignored by the s record type. Table 50. Values for Record Size: vms Layer Field Minimum Maximum Default Comments v.bb 1 32,767 32,767 v.tape 1 9995 2043 v.tr 1 32,767 2044 s.bb 1 None None No maximum record size s.tape 1 None None No maximum record size s.tr 1 None None No maximum record size The num2 field is the maximum segment or block size that is allowed on input and is produced on output. For vms.f.tr and vms.f.bb, num2 should be equal to the record size (num1). Because vms.f.tape places one or more records in each block, vms.f.tape num2 must be greater than or equal to num1. Table 51. Values for Maximum Block Size: vms Layer Field Minimum Maximum Default Comments v.bb 1 32,767 32,767 v.tape 6 32,767 2,048 v.tr 3 32,767 32,767 N/A s.bb 5 32,767 2,046 s.tape 7 32,767 2,048 s.tr 5 32,767 2,046 N/A For vms.v.bb and vms.v.disk records, num2 is a limit on the maximum record size. For vms.v.tape records, it is the maximum size of a block on tape; more specifically, it is the maximum size of a record that will be written to the next lower layer. If that layer is tape, num2 is the tape block size. If it is cos, it will be a COS record that represents a tape block. One or more records are placed in each block. S–3901–60 335Cray® Fortran Reference Manual For segmented records, num2 is a limit on the block size that will be produced. No limit on record size exists. For vms.s.tr and vms.s.bb, the block size is an upper limit on the size of a segment. For vms.s.tape, one or more segments are placed in a tape block. It functions as an upper limit on the size of a segment and a preferred tape block size. Table 52. Data Manipulation: vms Layer Granularity Data model Truncate on write Implementation strategy 8 bits Record No for f records. Yes for v and s records. f records for f formats. v records for v formats. Table 53. Supported Operations: vms Layer Supported operations Required of next lower level? Operation Supported Comments Used Comments ffopen Yes Yes ffread Yes Yes ffreadc Yes No ffwrite Yes Yes ffwritec Yes No ffclose Yes Yes ffflush Yes No ffweof Yes and passed through Yes for s records; passed through for others Yes Only if explicitly requested ffweod Yes Yes ffseek Yes seek(fd,0,0) only (equals rewind) Yes seek(fd,0,0) only ffbksp No No 336 S–3901–60Creating a user Layer [17] This chapter explains some of the internals of the FFIO system and explains the ways in which you can put together a user or site layer. 17.1 Internal Functions The FFIO system has an internal model of data that maps to any given actual logical file type based on the following concepts: • Data is a stream of bits. Layers must declare their granularity by using the fffcntl(3c) call. • Record marks are boundaries between logical records. • End-of-file (EOF) marks are a special type of record that exists in some file structures. • End-of-data (EOD) is a point immediately beyond the last data bit, EOR, or EOF in the file. You cannot read past or write after an EOD. In a case when a file is positioned after an EOD, a write operation (if valid) immediately moves the EOD to a point after the last data bit, end-of-record (EOR), or EOF produced by the write. All files are streams that contain zero or more data bits that may contain record or file marks. No inherent hierarchy or ordering is imposed on the file structures. Any number of data bits or EOR and EOF marks may appear in any order. The EOD, if present, is by definition last. Given the EOR, EOF, and EOD return statuses from read operations, only EOR may be returned along with data. When data bits are immediately followed by EOF, the record is terminated implicitly. Individual layers can impose restrictions for specific file structures that are more restrictive than the preceding rules. For instance, in COS blocked files, an EOR always immediately precedes an EOF. Successful mappings were used for all logical file types supported, except formats that have more than one type of partitioning for files (such as end-of-group or more than one level of EOF). For example, some file formats have level numbers in the partitions. FFIO maps level 017 to an EOF. No other handling is provided for these level numbers. S–3901–60 337Cray® Fortran Reference Manual Internally, there are two main protocol components: the operations and the stat structure. 17.1.1 The Operations Structure Many of the operations try to mimic the UNICOS/mp and UNICOS/lc system calls. In the man pages for ffread(3c), ffwrite(3c), and others, the calls can be made without the optional parameters and appear like the system calls. Internally, all parameters are required. Table 54 provides a brief synopsis of the interface routines that are supported at the user level. Each of these ff entry points checks the parameters and issues the corresponding internal call. Each interface routine provides defaults and dummy arguments for those optional arguments that the user does not provide. Each layer must have an internal entry point for all of these operations, although in some cases the entry point may simply issue an error or do nothing. For example, the syscall layer uses _ff_noop for the ffflush entry point because it has no buffer to flush, and it uses _ff_err2 for the ffweof entry point because it has no representation for EOF. No optional parameters for calls to the internal entry points exist. All arguments are required. 338 S–3901–60Creating a user Layer [17] Table 54 lists the variables for the internal entry points and the variable definitions. An internal entry point must be provided for all of these operations: Table 54. C Program Entry Points Variable Definition fd The FFIO pointer (struct fdinfo *)fd. file A char* file. flags File status flag for open, such as O_RDONLY. buf Bit pointer to the user data. nb Number of bytes. ret The status returned; >=0 is valid, <0 is error. stat A pointer to the status structure. fulp The value FULL or PARTIAL defined in ffio.h for full or partial-record mode. &ubc A pointer to the unused bit count; this ranges from 0 to 7 and represents the bits not used in the last byte of the operation. It is used for both input and output. pos A byte position in the file. opos The old position of the file, just like the system call. whence The same as the syscall. cmd The command request to the fffcntl(3c) call. arg A generic pointer to the fffcntl argument. mode Bit pattern denoting file's access permissions. argp A pointer to the input or output data. len The length of the space available at argp. It is used primarily on output to avoid overwriting the available memory. S–3901–60 339Cray® Fortran Reference Manual 17.1.2 FFIO and the stat Structure The stat structure contains four fields in the current implementation. They mimic the iows structure of the UNICOS/mp and UNICOS/lc ASYNC syscalls to the extent possible. All operations are expected to update the stat structure on each call. The SETSTAT and ERETURN macros are provided in the ffio.h file for this purpose. The fields in the stat structure are as follows: Status field Description stat.sw_flag 0 indicates outstanding; 1 indicates I/O complete. stat.sw_error 0 indicates no error; otherwise, the error number. stat.sw_count Number of bytes transferred in this request. This number is rounded up to the next integral value if a partial byte is transferred. stat.sw_stat This tells the status of the I/O operation. The FFSTAT(stat) macro accesses this field. The following values are valid: FFBOD: At beginning-of-data (BOD). FFCNT: Request terminated by count (either the count of bytes before EOF or EOD in the file or the count of the request). FFEOR: Request termination by EOR or a full record mode read was processed. FFEOF: EOF encountered. FFEOD: EOD encountered. FFERR: Error encountered. If count is satisfied simultaneously with EOR, the FFEOR is returned. 340 S–3901–60Creating a user Layer [17] The EOF and EOD status values must never be returned with data. This means that if a byte-stream file is being traversed and the file contains 100 bytes and then an EOD, a read of 500 bytes will return with a stat value of FFCNT and a return byte count of 100. The next read operation returns FFEOD and a count of 0. A FFEOF or FFEOD status is always returned with a 0-byte transfer count. 17.2 user Layer Example This section gives a complete and working user layer. It traces I/O at a given level. All operations are passed through to the next lower-level layer, and a trace record is sent to the trace file. The first step in generating a user layer is to create a table that contains the addresses for the routines that will fulfill the required functions described in Section 17.1.1, page 338 and Section 17.1.2, page 340. The format of the table can be found in struct xtr_s, which is found in the file. No restriction is placed on the names of the routines, but the table must be called _usr_ffvect for it to be recognized as a user layer. In the example, the declaration of the table can be found with the code in the _usr_open routine. To use this layer, you must take advantage of the weak external files in the library. The following script fragment is suggested for UNICOS/mp and UNICOS/lc systems: # -D_LIB_INTERNAL is required to obtain the # declaration of struct fdinfo in # cc -c -D_LIB_INTERNAL -hcalchars usr*.c cat usr*.o > user.o # # Note that the -F option is selected that loads # and links the entries despite not having any # hard references. cc -o user.o myprog.o assign -F user,others... fort.1 ./abs S–3901–60 341Cray® Fortran Reference Manual static char USMID[] = "@(#)code/usrbksp.c 1.0 "; /* COPYRIGHT CRAY INC. * UNPUBLISHED -- ALL RIGHTS RESERVED UNDER * THE COPYRIGHT LAWS OF THE UNITED STATES. */ #include #include "usrio.h" /* * trace backspace requests */ int _usr_bksp(struct fdinfo *fio, struct ffsw *stat) { struct fdinfo *llfio; int ret; llfio = fio->fioptr; _usr_enter(fio, TRC_BKSP); _usr_pr_2p(fio, stat); ret = XRCALL(llfio, backrtn) llfio, stat); _usr_exit(fio, ret, stat); return(0); } 342 S–3901–60Creating a user Layer [17] static char USMID[] = "@(#)code.usrclose.c 1.0 "; /* COPYRIGHT CRAY INC. * UNPUBLISHED -- ALL RIGHTS RESERVED UNDER * THE COPYRIGHT LAWS OF THE UNITED STATES. */ #include #include #include #include "usrio.h" /* * trace close requests */ int _usr_close(struct fdinfo *fio, struct ffsw *stat) { struct fdinfo *llfio; struct trace_f *pinfo; int ret; llfio = fio->fioptr; /* * lyr_info is a place in the fdinfo block that holds * a pointer to the layer's private information. */ pinfo = (struct trace_f *)fio->lyr_info; _usr_enter(fio, TRC_CLOSE); _usr_pr_2p(fio, stat); /* * close file */ ret = XRCALL(llfio, closertn) llfio, stat); /* * It is the layer's responsibility to clean up its mess. */ free(pinfo->name); pinfo->name = NULL; free(pinfo); _usr_exit(fio, ret, stat); (void) close(pinfo->usrfd); return(0); } static char USMID[] = "@(#)code/usrfcntl.c 1.0 "; S–3901–60 343Cray® Fortran Reference Manual /* COPYRIGHT CRAY INC. * UNPUBLISHED -- ALL RIGHTS RESERVED UNDER * THE COPYRIGHT LAWS OF THE UNITED STATES. */ #include #include "usrio.h" /* * trace fcntl requests * * Parameters: * fd - fdinfo pointer * cmd - command code * arg - command specific parameter * stat - pointer to status return word * * This fcntl routine passes the request down to the next lower * layer, so it provides nothing of its own. * * When writing a user layer, the fcntl routine must be provided, * and must provide correct responses to one essential function and * two desirable functions. * * FC_GETINFO: (essential) * If the 'cmd' argument is FC_GETINFO, the fields of the 'arg' is * considered a pointer to an ffc_info_s structure, and the fields * must be filled. The most important of these is the ffc_flags * field, whose bits are defined in .(Look for FFC_STRM * through FFC_NOTRN) * FC_STAT: (desirable) * FC_RECALL: (desirable) */ int _usr_fcntl(struct fdinfo *fio, int cmd, void *arg, struct ffsw *stat) { struct fdinfo *llfio; struct trace_f *pinfo; int ret; llfio = fio->fioptr; pinfo = (struct trace_f *)fio->lyr_info; _usr_enter(fio, TRC_FCNTL); _usr_info(fio, "cmd=%d ", cmd); ret = XRCALL(llfio, fcntlrtn) llfio, cmd, arg, stat); 344 S–3901–60Creating a user Layer [17] _usr_exit(fio, ret, stat); return(ret); } static char USMID[] = "@(#)code/usropen.c 1.0 "; /* COPYRIGHT CRAY INC. * UNPUBLISHED -- ALL RIGHTS RESERVED UNDER * THE COPYRIGHT LAWS OF THE UNITED STATES. */ #include #include #include #include #include "usrio.h" #define SUFFIX ".trc" /* * trace open requests; * The following routines compose the user layer. They are declared * in "usrio.h" */ /* * Create the _usr_ffvect structure. Note the _ff_err inclusion to * account for the listiortn, which is not supported by this user * layer */ struct xtr_s _usr_ffvect = { _usr_open, _usr_read, _usr_reada, _usr_readc, _usr_write, _usr_writea, _usr_writec, _usr_close, _usr_flush, _usr_weof, _usr_weod, _usr_seek, _usr_bksp, _usr_pos, _usr_err, _usr_fcntl }; _ffopen_t _usr_open( const char *name, int flags, mode_t mode, struct fdinfo * fio, union spec_u *spec, S–3901–60 345Cray® Fortran Reference Manual struct ffsw *stat, long cbits, int cblks, struct gl_o_inf *oinf) { union spec_u *nspec; struct fdinfo *llfio; struct trace_f *pinfo; char *ptr = NULL; int namlen, usrfd; _ffopen_t nextfio; char buf[256]; namlen = strlen(name); ptr = malloc(namlen + strlen(SUFFIX) + 1); if (ptr == NULL) goto badopen; pinfo = (struct trace_f *)malloc(sizeof(struct trace_f)); if (pinfo == NULL) goto badopen; fio->lyr_info = (char *)pinfo; /* * Now, build the name of the trace info file, and open it. */ strcpy(ptr, name); strcat(ptr, SUFFIX); usrfd = open(ptr, O_WRONLY | O_APPEND | O_CREAT, 0666); /* * Put the file info into the private data area. */ pinfo->name = ptr; pinfo->usrfd = usrfd; ptr[namlen] = '\0'; /* * Log the open call */ _usr_enter(fio, TRC_OPEN); sprintf(buf,"(\"%s\", %o, %o...);\n", name, flags, mode); _usr_info(fio, buf, 0); /* * Now, open the lower layers */ nspec = spec; NEXT_SPEC(nspec); 346 S–3901–60Creating a user Layer [17] nextfio = _ffopen(name, flags, mode, nspec, stat, cbits, cblks, NULL, oinf); _usr_exit_ff(fio, nextfio, stat); if (nextfio != _FFOPEN_ERR) { DUMP_IOB(fio); /* debugging only */ return(nextfio); } /* * End up here only on an error * */ badopen: if(ptr != NULL) free(ptr); if (fio->lyr_info != NULL) free(fio->lyr_info); _SETERROR(stat, FDC_ERR_NOMEM, 0); return(_FFOPEN_ERR); } _usr_err(struct fdinfo *fio) { _usr_info(fio,"ERROR: not expecting this routine\n",0); return(0); } S–3901–60 347Cray® Fortran Reference Manual static char USMID[] = "@(#)code/usrpos.c 1.1 "; /* COPYRIGHT CRAY INC. * UNPUBLISHED -- ALL RIGHTS RESERVED UNDER * THE COPYRIGHT LAWS OF THE UNITED STATES. */ #include #include "usrio.h" /* * trace positioning requests */ _ffseek_t _usr_pos(struct fdinfo *fio, int cmd, void *arg, int len, struct ffsw *stat) { struct fdinfo *llfio; struct trace_f *usr_info; _ffseek_t ret; llfio = fio->fioptr; usr_info = (struct trace_f *)fio->lyr_info; _usr_enter(fio,TRC_POS); _usr_info(fio, " ", 0); ret = XRCALL(llfio, posrtn) llfio, cmd, arg, len, stat); _usr_exit_sk(fio, ret, stat); return(ret); } static char USMID[] = "@(#)code/usrprint.c 1.1 "; /* COPYRIGHT CRAY INC. * UNPUBLISHED -- ALL RIGHTS RESERVED UNDER * THE COPYRIGHT LAWS OF THE UNITED STATES. */ #include #include #include "usrio.h" static char *name_tab[] = { 348 S–3901–60Creating a user Layer [17] "???", "ffopen", "ffread", "ffreadc", "ffwrite", "ffwritec", "ffclose", "ffflush", "ffweof", "ffweod", "ffseek", "ffbksp", "fflistio", "fffcntl", }; /* * trace printing stuff */ int _usr_enter(struct fdinfo *fio, int opcd) { char buf[256], *op; struct trace_f *usr_info; op = name_tab[opcd]; usr_info = (struct trace_f *)fio->lyr_info; sprintf(buf, "TRCE: %s ",op); write(usr_info->usrfd, buf, strlen(buf)); return(0); } void _usr_info(struct fdinfo *fio, char *str, int arg1) { char buf[256]; struct trace_f *usr_info; usr_info = (struct trace_f *)fio->lyr_info; sprintf(buf, str, arg1); write(usr_info->usrfd, buf, strlen(buf)); } S–3901–60 349Cray® Fortran Reference Manual void _usr_exit(struct fdinfo *fio, int ret, struct ffsw *stat) { char buf[256]; struct trace_f *usr_info; usr_info = (struct trace_f *)fio->lyr_info; fio->ateof = fio->fioptr->ateof; fio->ateod = fio->fioptr->ateod; sprintf(buf, "TRCX: ret=%d, stat=%d, err=%d\n", ret, stat->sw_stat, stat->sw_error); write(usr_info->usrfd, buf, strlen(buf)); } void _usr_exit_ss(struct fdinfo *fio, ssize_t ret, struct ffsw *stat) { char buf[256]; struct trace_f *usr_info; usr_info = (struct trace_f *)fio->lyr_info; fio->ateof = fio->fioptr->ateof; fio->ateod = fio->fioptr->ateod; sprintf(buf, "TRCX: ret=%ld, stat=%d, err=%d\n", ret, stat->sw_stat, stat->sw_error); write(usr_info->usrfd, buf, strlen(buf)); } void _usr_exit_ff(struct fdinfo *fio, _ffopen_t ret, struct ffsw *stat) { char buf[256]; struct trace_f *usr_info; usr_info = (struct trace_f *)fio->lyr_info; sprintf(buf, "TRCX: ret=%d, stat=%d, err=%d\n", ret, stat->sw_stat, stat->sw_error); write(usr_info->usrfd, buf, strlen(buf)); } void _usr_exit_sk(struct fdinfo *fio, _ffseek_t ret, struct ffsw *stat) { char buf[256]; 350 S–3901–60Creating a user Layer [17] struct trace_f *usr_info; usr_info = (struct trace_f *)fio->lyr_info; fio->ateof = fio->fioptr->ateof; fio->ateod = fio->fioptr->ateod; sprintf(buf, "TRCX: ret=%ld, stat=%d, err=%d\n", ret, stat->sw_stat, stat->sw_error); #endif write(usr_info->usrfd, buf, strlen(buf)); } void _usr_pr_rwc( struct fdinfo *fio, bitptr bufptr, size_t nbytes, struct ffsw *stat, int fulp) { char buf[256]; struct trace_f *usr_info; usr_info = (struct trace_f *)fio->lyr_info; sprintf(buf,"(fd / %lx */, &memc[%lx], %ld, &statw[%lx], ", fio, BPTR2CP(bufptr), nbytes, stat); write(usr_info->usrfd, buf, strlen(buf)); if (fulp == FULL) sprintf(buf,"FULL"); else sprintf(buf,"PARTIAL"); write(usr_info->usrfd, buf, strlen(buf)); } void _usr_pr_rww( struct fdinfo *fio, bitptr bufptr, size_t nbytes, struct ffsw *stat, int fulp, int *ubc) { char buf[256]; struct trace_f *usr_info; usr_info = (struct trace_f *)fio->lyr_info; S–3901–60 351Cray® Fortran Reference Manual sprintf(buf,"(fd / %lx */, &memc[%lx], %ld, &statw[%lx], ", fio, BPTR2CP(bufptr), nbytes, stat); write(usr_info->usrfd, buf, strlen(buf)); if (fulp == FULL) sprintf(buf,"FULL"); else sprintf(buf,"PARTIAL"); write(usr_info->usrfd, buf, strlen(buf)); sprintf(buf,", &conubc[%d]; ", *ubc); write(usr_info->usrfd, buf, strlen(buf)); } void _usr_pr_2p(struct fdinfo *fio, struct ffsw *stat) { char buf[256]; struct trace_f *usr_info; usr_info = (struct trace_f *)fio->lyr_info; sprintf(buf,"(fd / %lx */, &statw[%lx], ", fio, stat); write(usr_info->usrfd, buf, strlen(buf)); } 352 S–3901–60Creating a user Layer [17] static char USMID[] = "@(#)code/usrread.c 1.0 "; /* COPYRIGHT CRAY INC. * UNPUBLISHED -- ALL RIGHTS RESERVED UNDER * THE COPYRIGHT LAWS OF THE UNITED STATES. */ #include #include "usrio.h" /* * trace read requests * * Parameters: * fio - Pointer to fdinfo block * bufptr - bit pointer to where data is to go. * nbytes - Number of bytes to be read * stat - pointer to status return word * fulp - full or partial read mode flag * ubc - pointer to unused bit count */ ssize_t _usr_read( struct fdinfo *fio, bitptr bufptr, size_t nbytes, struct ffsw *stat, int fulp, int *ubc) { struct fdinfo *llfio; char *str; ssize_t ret; llfio = fio->fioptr; _usr_enter(fio, TRC_READ); _usr_pr_rww(fio, bufptr, nbytes, stat, fulp, ubc); ret = XRCALL(llfio, readrtn) llfio, bufptr, nbytes, stat, fulp, ubc); _usr_exit_ss(fio, ret, stat); return(ret); } S–3901–60 353Cray® Fortran Reference Manual /* * trace reada (asynchronous read) requests * * Parameters: * fio - Pointer to fdinfo block * bufptr - bit pointer to where data is to go. * nbytes - Number of bytes to be read * stat - pointer to status return word * fulp - full or partial read mode flag * ubc - pointer to unused bit count */ ssize_t _usr_reada( struct fdinfo *fio, bitptr bufptr, size_t nbytes, struct ffsw *stat, int fulp, int *ubc) { struct fdinfo *llfio; char *str; ssize_t ret; llfio = fio->fioptr; _usr_enter(fio, TRC_READA); _usr_pr_rww(fio, bufptr, nbytes, stat, fulp, ubc); ret = XRCALL(llfio,readartn)llfio,bufptr,nbytes,stat,fulp,ubc); _usr_exit_ss(fio, ret, stat); return(ret); } 354 S–3901–60Creating a user Layer [17] /* * trace readc requests * * Parameters: * fio - Pointer to fdinfo block * bufptr - bit pointer to where data is to go. * nbytes - Number of bytes to be read * stat - pointer to status return word * fulp - full or partial read mode flag */ ssize_t _usr_readc( struct fdinfo *fio, bitptr bufptr, size_t nbytes, struct ffsw *stat, int fulp) { struct fdinfo *llfio; char *str; ssize_t ret; llfio = fio->fioptr; _usr_enter(fio, TRC_READC); _usr_pr_rwc(fio, bufptr, nbytes, stat, fulp); ret = XRCALL(llfio, readcrtn)llfio, bufptr, nbytes, stat, fulp); _usr_exit_ss(fio, ret, stat); return(ret); } S–3901–60 355Cray® Fortran Reference Manual /* * _usr_seek() * * The user seek call should mimic the UNICOS/mp lseek system call as * much as possible. */ _ffseek_t _usr_seek( struct fdinfo *fio, off_t pos, int whence, struct ffsw *stat) { struct fdinfo *llfio; _ffseek_t ret; char buf[256]; llfio = fio->fioptr; _usr_enter(fio, TRC_SEEK); sprintf(buf,"pos %ld, whence %d\n", pos, whence); _usr_info(fio, buf, 0); ret = XRCALL(llfio, seekrtn) llfio, pos, whence, stat); _usr_exit_sk(fio, ret, stat); return(ret); } 356 S–3901–60Creating a user Layer [17] static char USMID[] = "@(#)code/usrwrite.c 1.0 "; /* COPYRIGHT CRAY INC. * UNPUBLISHED -- ALL RIGHTS RESERVED UNDER * THE COPYRIGHT LAWS OF THE UNITED STATES. */ #include #include "usrio.h" /* * trace write requests * * Parameters: * fio - Pointer to fdinfo block * bufptr - bit pointer to where data is to go. * nbytes - Number of bytes to be written * stat - pointer to status return word * fulp - full or partial write mode flag * ubc - pointer to unused bit count (not used for IBM) */ ssize_t _usr_write( struct fdinfo *fio, bitptr bufptr, size_t nbytes, struct ffsw *stat, int fulp, int *ubc) { struct fdinfo *llfio; ssize_t ret; llfio = fio->fioptr; _usr_enter(fio, TRC_WRITE); _usr_pr_rww(fio, bufptr, nbytes, stat, fulp, ubc); ret = XRCALL(llfio, writertn) llfio, bufptr, nbytes, stat, fulp,ubc); _usr_exit_ss(fio, ret, stat); return(ret); } S–3901–60 357Cray® Fortran Reference Manual /* * trace writea requests * * Parameters: * fio - Pointer to fdinfo block * bufptr - bit pointer to where data is to go. * nbytes - Number of bytes to be written * stat - pointer to status return word * fulp - full or partial write mode flag * ubc - pointer to unused bit count (not used for IBM) */ ssize_t _usr_writea( struct fdinfo *fio, bitptr bufptr, size_t nbytes, struct ffsw *stat, int fulp, int *ubc) { struct fdinfo *llfio; ssize_t ret; llfio = fio->fioptr; _usr_enter(fio, TRC_WRITEA); _usr_pr_rww(fio, bufptr, nbytes, stat, fulp, ubc); ret = XRCALL(llfio, writeartn) llfio, bufptr, nbytes, stat, fulp,ubc); _usr_exit_ss(fio, ret, stat); return(ret); } /* * trace writec requests * * Parameters: * fio - Pointer to fdinfo block * bufptr - bit pointer to where data is to go. * nbytes - Number of bytes to be written * stat - pointer to status return word * fulp - full or partial write mode flag */ 358 S–3901–60Creating a user Layer [17] ssize_t _usr_writec( struct fdinfo *fio, bitptr bufptr, size_t nbytes, struct ffsw *stat, int fulp) { struct fdinfo *llfio; ssize_t ret; llfio = fio->fioptr; _usr_enter(fio, TRC_WRITEC); _usr_pr_rwc(fio, bufptr, nbytes, stat, fulp); ret = XRCALL(llfio, writecrtn)llfio,bufptr, nbytes, stat, fulp); _usr_exit_ss(fio, ret, stat); return(ret); } /* * Flush the buffer and clean up * This routine should return 0, or -1 on error. */ int _usr_flush(struct fdinfo *fio, struct ffsw *stat) { struct fdinfo *llfio; int ret; llfio = fio->fioptr; _usr_enter(fio, TRC_FLUSH); _usr_info(fio, "\n",0); ret = XRCALL(llfio, flushrtn) llfio, stat); _usr_exit(fio, ret, stat); return(ret); } S–3901–60 359Cray® Fortran Reference Manual /* * trace WEOF calls * * The EOF is a very specific concept. Don't confuse it with the * UNICOS/mp EOF, or the truncate(2) system call. */ int _usr_weof(struct fdinfo *fio, struct ffsw *stat) { struct fdinfo *llfio; int ret; llfio = fio->fioptr; _usr_enter(fio, TRC_WEOF); _usr_info(fio, "\n",0); ret = XRCALL(llfio, weofrtn) llfio, stat); _usr_exit(fio, ret, stat); return(ret); } /* * trace WEOD calls * * The EOD is a specific concept. Don't confuse it with the UNICOS/mp * EOF. It is usually mapped to the truncate(2) system call. */ int _usr_weod(struct fdinfo *fio, struct ffsw *stat) { struct fdinfo *llfio; int ret; llfio = fio->fioptr; _usr_enter(fio, TRC_WEOD); _usr_info(fio, "\n",0); ret = XRCALL(llfio, weodrtn) llfio, stat); _usr_exit(fio, ret, stat); return(ret); } /* USMID @(#)code/usrio.h 1.1 */ /* COPYRIGHT CRAY INC. 360 S–3901–60Creating a user Layer [17] * UNPUBLISHED -- ALL RIGHTS RESERVED UNDER * THE COPYRIGHT LAWS OF THE UNITED STATES. */ #define TRC_OPEN 1 #define TRC_READ 2 #define TRC_READA 3 #define TRC_READC 4 #define TRC_WRITE 5 #define TRC_WRITEA 6 #define TRC_WRITEC 7 #define TRC_CLOSE 8 #define TRC_FLUSH 9 #define TRC_WEOF 10 #define TRC_WEOD 11 #define TRC_SEEK 12 #define TRC_BKSP 13 #define TRC_POS 14 #define TRC_UNUSED 15 #define TRC_FCNTL 16 struct trace_f { char *name; /* name of the file */ int usrfd; /* file descriptor of trace file */ }; /* * Prototypes */ extern int _usr_bksp(struct fdinfo *fio, struct ffsw *stat); extern int _usr_close(struct fdinfo *fio, struct ffsw *stat); extern int _usr_fcntl(struct fdinfo *fio, int cmd, void *arg, struct ffsw *stat); extern _ffopen_t _usr_open(const char *name, int flags, mode_t mode, struct fdinfo * fio, union spec_u *spec, struct ffsw *stat, long cbits, int cblks, struct gl_o_inf *oinf); extern int _usr_flush(struct fdinfo *fio, struct ffsw *stat); extern _ffseek_t _usr_pos(struct fdinfo *fio, int cmd, void *arg, int len, struct ffsw *stat); extern ssize_t _usr_read(struct fdinfo *fio, bitptr bufptr, size_t nbytes, struct ffsw *stat, int fulp, int *ubc); extern ssize_t _usr_reada(struct fdinfo *fio, bitptr bufptr, S–3901–60 361Cray® Fortran Reference Manual size_t nbytes, struct ffsw *stat, int fulp, int *ubc); extern ssize_t _usr_readc(struct fdinfo *fio, bitptr bufptr, size_t nbytes, struct ffsw *stat, int fulp); extern _ffseek_t _usr_seek(struct fdinfo *fio, off_t pos, int whence, struct ffsw *stat); extern ssize_t _usr_write(struct fdinfo *fio, bitptr bufptr, size_t nbytes, struct ffsw *stat, int fulp, int *ubc); extern ssize_t _usr_writea(struct fdinfo *fio, bitptr bufptr, size_t nbytes, struct ffsw *stat, int fulp, int *ubc); extern ssize_t _usr_writec(struct fdinfo *fio, bitptr bufptr, size_t nbytes, struct ffsw *stat, int fulp); extern int _usr_weod(struct fdinfo *fio, struct ffsw *stat); extern int _usr_weof(struct fdinfo *fio, struct ffsw *stat); extern int _usr_err(); /* * Prototypes for routines that are used by the user layer. */ extern int _usr_enter(struct fdinfo *fio, int opcd); extern void _usr_info(struct fdinfo *fio, char *str, int arg1); extern void _usr_exit(struct fdinfo *fio, int ret, struct ffsw *stat); extern void _usr_exit_ss(struct fdinfo *fio, ssize_t ret, struct ffsw *stat); extern void _usr_exit_ff(struct fdinfo *fio, _ffopen_t ret, struct ffsw *stat); extern void _usr_exit_sk(struct fdinfo *fio, _ffseek_t ret, struct ffsw *stat); extern void _usr_pr_rww(struct fdinfo *fio, bitptr bufptr, size_t nbytes, struct ffsw *stat, int fulp, int *ubc); extern void _usr_pr_2p(struct fdinfo *fio, struct ffsw *stat); 362 S–3901–60Numeric File Conversion Routines [18] This chapter contains information about data conversion, moving data between machines, and implicit and explicit data conversion. It also explains the support provided for reading and writing files in foreign formats, including record blocking and numeric and character conversion. These routines convert data (primarily floating-point data, but also integer and character data, as well as Fortran complex and logical data) from your system's native representation to a foreign representation, and vice versa. 18.1 Conversion Overview Data can be transferred between UNICOS/mp and UNICOS/lc systems and other computer systems in several ways. These methods include the use of utilities built on TCP/IP (such as ftp). You can also use the data conversion library routines to convert data. Cray X1 and X2 systems support the Institute of Electrical and Electronics Engineers (IEEE) format by default and also support conversion to and from IBM, VAX/VMS, and other formats. For each foreign file type, several supported file and record formats exist or explicit or implicit data conversion can also be used. When processing foreign data, you must consider the interactions between the data formats and the method of data transfer. This section describes, in broad terms, the techniques available to do data conversion. Explicit conversion is the process by which the user performs calls to subroutines that convert the native data to and from the foreign data formats. These routines are provided for many data formats. This is discussed in more detail in Section 18.3.1, page 365. Implicit conversion is the process by which you declare that a particular file contains foreign data and/or record blocking and then request that the run-time library perform appropriate transformations on the data to make it useful to the program at I/O time. This method of record and data format conversion requires changes in command scripts. This is discussed in more detail in Section 18.3.2, page 365. S–3901–60 363Cray® Fortran Reference Manual 18.2 Transferring Data This section describes several ways to transfer data, including using the fdcp and other TCP/IP tools. 18.2.1 Using fdcp to Transfer Files The fdcp(1) command can handle data that is not a simple disk-resident byte stream. The fdcp command assumes that both the data and any record, including an end-of-file (EOF) record, can be copied from one file to another. Record structures can be preserved or removed. EOF records can be preserved either as EOF records in the output file or used to separate the delimited data in the input file into separate files. The fdcp command does not perform data conversion; the only transformations done are on the record and file structures (fdcp transforms block, record, and file control words from one format to another). If no assign(1) information is available for a file, the system layer is used. If the file being accessed is on disk and if no assign -F attribute is used, the syscall layer is used. 18.2.2 Using ftp to Move Data between Systems When transferring a file to a foreign system, FFIO can create the file in the correct foreign format, but ftp cannot establish the right attributes on the file so that the foreign operating system can handle it correctly. Therefore, ftp is not useful as a transfer agent on IBM and VMS systems for binary data. Its utility is limited to those systems that do not embed record attributes in the system file information. 18.3 Data Item Conversion The UNICOS/mp operating system provides both the implicit and explicit conversion of data items. Explicit conversion means that your code invokes the routines that convert between native systems and foreign representations. Options to the assign(1) command control implicit conversion. Implicit conversion is usually transparent to users and is available only to Fortran programmers. The following sections describe these data conversion types and provides direction in choosing the best one for your situation. 364 S–3901–60Numeric File Conversion Routines [18] 18.3.1 Explicit Data Item Conversion The Cray Fortran compiler library contains a set of subroutines that convert between Cray data formats and the formats of various vendors. These routines are callable from any programming language supported by Cray. The explicit conversion routines convert between IBM, VAX/VMS, or generic IEEE binary data formats and Cray 32-bit IEEE binary data formats. For complete details, see the individual man pages for each routine. These subroutines provide an efficient way to convert data that was read into system central memory. Table 55 lists the explicit data conversion subroutines. Table 55. Explicit Data Conversion Routines Cray X1 and X2 Systems Name Foreign -> Cray Cray -> Foreign IBM IBM2IEG IEG2IBM VAX/VMS VAX2IEG IEG2VAX IEEE little-endian IEU2IEG IEG2IEU Cray T3E IEEE (64-bit) CRI2IEG IEG2CRI SGI MIPS MIPS2IEG IEG2MIPS User conversion USR2IEG IEG2USR Site conversion STE2IEG IEG2STE See the individual man pages for details about the syntax and arguments for each routine. 18.3.2 Implicit Data Item Conversion Implicit data conversion in Fortran requires no explicit action by the program to convert the data in the I/O stream other than using the assign command to instruct the libraries to perform conversion. For details, see the assign(1) man page. The implicit data conversion process is performed in two steps: 1. Record format conversion 2. Data conversion S–3901–60 365Cray® Fortran Reference Manual Record format conversion interprets or converts the internal record blocking structures in the data stream to gain record-level access to the data. The data contained in the records can then be converted. Using implicit conversion, you can select record blocking or deblocking alone, or you can request that the data items be converted automatically. When enabled, record format conversion and data item conversion occur transparently and simultaneously. Changes are usually not required in your Fortran code. To enable conversion of foreign record formats, specify the appropriate record type with the assign -F command. The -N (numeric conversion) and -C (character conversion) assign options control conversion of data contained in a record. If -F is specified but -N and -C are not, the libraries interpret the record format but they do not convert data. You can obtain information about the type of data that will be converted (and, therefore, the type of conversion that will be performed) from the Fortran I/O list. If -N is used and -C is not, an appropriate character conversion type is selected by default, as shown in Table 56. 366 S–3901–60Numeric File Conversion Routines [18] Table 56. Implicit Data Conversion Types -N option -C default Meaning none none No numeric conversion default default No numeric conversion; IEEE 32-bit cray ASCII Cray “classic” floating-point ibm EBCDIC IBM 360/370-style numeric conversion vms ASCII VAX/VMS numeric conversion ieee ASCII Generic IEEE data (no data conversion) ieee_32 ASCII Generic 32-bit IEEE data. No data conversion except for items which are promoted via -s default64 (or -sreal64 or -sinteger64). They are handled as if they had not been promoted. That is, default sized variables will be read and written as if no -s option is specified. mips ASCII SGI MIPS IEEE numeric conversion (128-bit floating-point is “double double” format) ieee_64 ASCII Cray 64-bit IEEE numeric conversion ieee_le ASCII Little endian 32-bit IEEE numeric conversion ultrix ASCII Alias for above t3e ASCII Cray 64-bit IEEE numeric conversion; denormalized numbers flushed to zero t3d ASCII Alias for t3e user ASCII User defined numeric conversion site ASCII Site defined numeric conversion ia ASCII Intel architecture swap_endian ASCII The endian of data and control images is swapped during unformatted input and output Cray supports conversion of the supported formats and data types through standard Fortran formatted, unformatted, list-directed, and namelist I/O and through BUFFER IN and BUFFER OUT statements. Generally, read, write, and rewind are supported for all record formats. Other capabilities such as backspace are usually not available but can be made to work if a blocking type can be used to support it. See the sections on the specific layers for complete details. S–3901–60 367Cray® Fortran Reference Manual If you select the -N option, the libraries perform data conversion for Fortran unformatted statements and BUFFER IN and BUFFER OUT I/O statements. Data is converted between its native representation and a foreign representation, according to its Fortran data type. If the value in a native element is too large to fit in the foreign element, the foreign element is set to the largest or smallest possible value; no error is generated. When converting from a native element to a smaller foreign element, precision is also lost due to truncation of the floating-point mantissa. If the assign -N user or assign -N site command is specified, the user or site must provide site numeric data conversion routines. They follow the same calling conventions as the other explicit routines. For implicit conversion, specify format characteristics on an assign command. Files can be converted to either: • A disk file • A file transferred from a computer other than the Cray X1 or X2 system When a Fortran I/O operation is performed on the file, the appropriate file format and data conversions are performed during the I/O operation. Data conversion is performed on each data item, based on the type of the Fortran variable in the I/O list. For example, if the first read of a foreign file format is like the following example, the library interprets any blocking structures in the file that precede the first data record: INTEGER(KIND=8) INT REAL(KIND=8) FLOAT1, FLOAT2 READ (10) INT,FLOAT1,FLOAT2 These vary depending on the file type and record format. The first 32 bits of data (in IBM format, for example) are extracted, sign-extended, and stored in the INT Fortran variable. The next 32 bits are extracted, converted to native floating-point format, and stored in the FLOAT1 Fortran variable. The next 32 bits are extracted, converted, and stored into the FLOAT2 Fortran variable. The library then skips to the end of the foreign logical record. When writing from a native system to a foreign format (for example, if in the previous example WRITE(10) was used), precision is lost when converting from a 64-bit representation to 32-bit representation if the program was compiled with the -s default64 compiler option and the INT, FLOAT1, and FLOAT2 variables are default types. 368 S–3901–60Numeric File Conversion Routines [18] 18.3.3 Choosing a Conversion Method As with any software process, the various options for data conversion have advantages and disadvantages, which are discussed in this section. As a set, various data conversion options provide choices in methods of file processing for front-end systems. No one option is best for all applications. 18.3.3.1 Explicit Conversion Explicit data conversion has some distinct advantages, including: • Providing direct control (including some options not available through implicit conversion) over data conversion • Allowing programmers to control and schedule the conversion for a convenient and appropriate time • Performing conversion on large data areas as vector operations, usually increasing performance One disadvantage of using explicit conversion is that explicit routines require changes to the source code. 18.3.3.2 Implicit Conversion An advantage when using implicit conversion is that you do not have to change the source code. Disadvantages of using implicit conversion include: • Requiring script changes to the assign(1) command • Making conversion less efficient on a record-by-record basis • Doing conversion at I/O time according to the declared data types, allowing little flexibility for nonstandard requirements 18.3.4 Disabling Conversion Types The subroutines required to handle data conversion must be loaded into absolute binary files. By default, the run-time libraries include references to routines required to support the forms of implicit conversion enabled in the foreign data conversion configuration file, usually named fdcconfig.h. S–3901–60 369Cray® Fortran Reference Manual 18.4 Foreign Conversion Techniques This section contains some tips and techniques for the following conversion types: Conversion type Convert data to/from UNICOS files Older Cray UNICOS systems IBM conversion IBM machines IEEE conversion Various types of workstations and different vendors that support IEEE floating-point format VAX/VMS conversion DEC VAX machines that run MVS 18.4.1 UNICOS/mp and UNICOS/lc Conversions The UNICOS/mp and UNICOS/lc operating systems use f77 format as the default format for Fortran unformatted sequential files. To swap the data and control images when accessing unformatted files created on a system with a different endian, use the following command: assign -N swap_endian f:filename Previous UNICOS operating systems used COS blocking for all blocked files, so conversion is necessary when moving unformatted, blocked, sequential files from those Cray systems to the UNICOS/mp and UNICOS/lc operating systems. Two common COS file types require some conversion to make them useful on the UNICOS/mp and UNICOS/lc operating systems. 370 S–3901–60Numeric File Conversion Routines [18] To read or write unformatted files from UNICOS systems, use one of the following commands: • If moving a Cray floating point format file from a Cray SV1 series system, use the following command: assign -F cos -N cray cosfile • If moving an IEEE floating point format file from a Cray SV1 series system, use the following command: assign -F cos -N ieee_64 cosfile • If moving a file from a Cray T3E system, use the following command: assign -F cos -N t3e cosfile 18.4.2 IBM Overview To convert and transfer data between Cray X1 series or X2 systems and an IBM/MVS or VM (360/370 style) system, you must understand the differences between the UNICOS/mp and UNCOS/lc file system and file formats, and those on the IBM system(s). On both VM and MVS, the file system is record-oriented. The most obvious form of data conversion is between the IBM EBCDIC character set and the ASCII character set used on UNICOS/mp and UNICOS/lc systems. Most of the utilities that transfer files to and from the IBM systems automatically convert both the record structures and character set to the UNICOS/mp and UNICOS/lc text format and to ASCII. For example, ftp performs these conversions and does not require any further conversion on UNICOS/mp and UNICOS/lc systems. Binary data, however, is more complicated. You must first find a way to transfer the file and to preserve the record boundaries. If workstations are available, this is simple. Few problems are caused by transferring the file and preserving record boundaries. S–3901–60 371Cray® Fortran Reference Manual Cray supports the following IBM record formats: Format Description U Undefined record format F Fixed-length records, one record per block FB Fixed-length, blocked records V Variable-length records VB Variable-length, blocked records VBS Variable-length, blocked, spanned records 18.4.3 IEEE Conversion By default Cray X1 series and X2 systems use 32-bit IEEE standard floating point, with two's-complement arithmetic and the ASCII character set. This standard is also used by many workstations and personal computers. The logical values in these implementations are usually the same for Fortran and C; they use zero for false and nonzero for true. It is also common to see the Fortran record blocking used by the Fortran run-time library on unformatted sequential files. No IEEE record format exists, but the IEEE implicit and explicit data conversion routine facilities are provided with the assumption that many of these things are true. Most computer systems that use the IEEE data formats run operating systems based on UNIX software and use f77 record blocking. You can use the rcp or ftp commands to transfer files. In most cases, the following command should work: assign -F f77 fort.1 When writing files in the Fortran format, remember that you can gain a large performance boost by ensuring that the records being written fit in the working buffer of the Fortran layer. On Cray X1 series and X2 systems, data types can be declared as 32 bits in size and can then be read or written directly. This is the most direct and efficient method to read or write data files for IEEE workstations. The user can alter the declarations of the variables used in the Fortran I/O list to declare them as KIND=4 or as REAL*4 (or INTEGER*4). 372 S–3901–60Numeric File Conversion Routines [18] For example, to read a file on a Cray X1 series or X2 system that has 32-bit integers and 32-bit floating-point numbers, consider the following code fragments. To swap the unformatted data and control images when accessing unformatted files created on a system with a different endian, use one of the following commands: assign -N swap_endian u:unit assign -N swap_endian f:filename Existing program: REAL RVAL ! Default size (32-bits) INTEGER IVAL ! Default size (32-bits) ... READ (1) IVAL, RVAL This program will expect both the integer and floating-point data to be the same size (32 bits). However, it can be modified to explicitly declare the variables to be the same size as the expected data. Modified program (#1): REAL (KIND=4) RVAL ! Explicit 32-bits INTEGER (KIND=4) IVAL ! Explicit 32-bits ... READ (1) IVAL, RVAL This program will correctly read the expected data. However, if this type of modification is too extensive, only the variables used in the I/O statement list need be modified. Modified program (#2): REAL RVAL ! Default size (32-bits) INTEGER IVAL ! Default size (32-bits) REAL (KIND=8) RTMP ! Explicit 64-bits INTEGER (KIND=4) ITMP ! Explicit 32-bits ... READ (1) ITMP, RTMP ! Change explicitly sized data to default sized data: RVAL = RTMP IVAL = ITMP S–3901–60 373Cray® Fortran Reference Manual On some systems, data types can be declared as 32 bits in size and can then be read or written directly. This is the most direct and efficient method to read or write data files for Cray X1 series and X2 systems. The user can alter the declarations of the variables used in the Fortran I/O list to declare them as KIND=4 or as REAL*4 (or INTEGER*4). Other IEEE data conversion variants are also available, but not all variants are available on all systems: ieee or ieee_32 The default workstation conversion specification. Data sizes are based on 32-bit words. ieee_64 The default IEEE specification on Cray T90/IEEE and Cray T3E systems. Data sizes are based on 64-bit words. ieee_le or ultrix Data sizes are based on 32-bit words and are little-endian. mips Data sizes are based on 32-bit words, except for 128-bit floating point data which uses a "double double" format. ia IEEE data types with Intel-style little-endian. 18.4.4 VAX/VMS Conversion Nine record types are supported for VAX/VMS record conversion. This includes a combination of three record types and the three types of storage medium, as defined in the following list: Record type Definition f Fixed-length records v Variable-length records s Segmented records Media Definition tr For transparent access to files bb For unlabeled tapes and bb station transfers tape For labeled tapes 374 S–3901–60Numeric File Conversion Routines [18] Segmented records are mainly used by VAX/VMS Fortran. The following examples show some combinations of segmented records in different types of storage media: Example Definition vms.s.tr Use as an FFIO specification to read or write a file containing segmented records with transparent access. In the fetch and dispose commands, specify the -f tr option for the file. vms.s.tape Use as an FFIO specification to read or write a file containing segmented records on a labeled tape. vms.s.bb Use as an FFIO specification to read or write a file containing segmented records on an unlabeled tape. In the fetch and dispose commands, specify the -f bb option for the file if it is not a tape. The VAX/VMS system stores its data as a stream of bytes on various devices. Cray X1 series and X2 systems number their bytes from the most-significant bits to the least-significant bits, while the VAX system numbers the bytes from lowest significance up. The Cray X1 series and X2 systems make this byte-ordering transparent when you use text files. When data conversion is used, byte swapping sometimes must be done. S–3901–60 375Cray® Fortran Reference Manual 376 S–3901–60Named Pipe Support [19] Named pipes, or UNIX FIFO special files for I/O requests, are created with the mknod(2) system call; these special files allow any two processes to exchange information. The system call creates an inode for the named pipe and establishes it as a named pipe that can be read to or written from. It can then be used by standard Fortran I/O or C I/O. Piped I/O is faster than normal I/O and requires less memory than memory-resident files. Fortran programs can communicate with each other using named pipes. After a named pipe is created, Fortran programs can access that pipe almost as if it were a normal file. The unique aspects of process communication using named pipes are discussed in the following list; the examples show how a Fortran program can use standard Fortran I/O on pipes: • A named pipe must be created before a Fortran program opens it. The following syntax for the command creates a named pipe called fort.13. The p argument makes it a pipe. /bin/mknod fort.13 p A named pipe can be created from within a Fortran program by using the pxfsystem function. The following example creates a named pipe: INTEGER ILEN,IERROR ILEN=0 CALL PXFSYSTEM ('/bin/mknod fort.13 p',ILEN,IERROR) • Fortran programs can use two named pipes: one to read and one to write. A Fortran program can read from or write to any named pipe, but it cannot do both at the same time. This is a Fortran restriction on pipes, not a system restriction. It occurs because Fortran does not allow read and write access at the same time. • I/O transfers through named pipes use memory for buffering. A separate buffer is created for each named pipe. The PIPE_BUF parameter defines the kernel buffer size in the /sys/param.h parameter file. The default value of PIPE_BUF is 8 blocks (8 * 512 words), but the full size may not be needed or used. I/O to named pipes does not transfer to or from a disk. However, if I/O transfers fill the buffer, the writing process waits for the receiving process to read the data before refilling the buffer. If the size of the PIPE_BUF parameter is increased, I/O performance may decrease because of buffer contention. S–3901–60 377Cray® Fortran Reference Manual If memory has already been allocated for buffers, more space will not be allocated. • Binary data transferred between two processes through a named pipe must use the correct file structure. An undefined file structure (specified by assign -s u) should be specified for a pipe by the sending process. An unblocked structure (specified by assign -s unblocked) should be specified for a pipe by the receiving process. You can also select a file specification of system (assign -F system) for the sending process. The file structure of the receiving or read process can be set to either an undefined or an unblocked file structure. However, if the sending process writes a request that is larger than PIPE_BUF, it is essential for the receiving process to read the data from a pipe set to an unblocked file structure. A read of a transfer larger than PIPE_BUF on an undefined file structure yields only the amount of data specified by PIPE_BUF. The receiving process does not wait to see whether the sending process is refilling the buffer. The pipe may be less than the value of PIPE_BUF. For example, the following assign commands specify that the file structure of the named pipe (unit 13, file name pipe) for the sending process should be undefined (-s u). The named pipe (unit 15, file name pipe) is type unblocked (-s unblocked) for the read process. assign -s u -a pipe u:13 assign -s unblocked -a pipe u:15 • A read from a pipe that is closed by the sender causes an end-of-file (EOF). To detect EOF on a named pipe, the pipe must be opened as read-only by the receiving process. The remainder of this chapter presents more information about detecting EOF. 19.1 Piped I/O Example without End-of-file Detection In this example, two Fortran programs communicate without end-of-file (EOF) detection. Program writerd generates an array, which contains the elements 1 to 3, and writes the array to named pipe pipe1. Program readwt reads the three elements from named pipe pipe1, prints out the values, adds 1 to each value, and writes the new elements to named pipe pipe2. Program writerd reads the new values from named pipe pipe2 and prints them. The -a option of the assign command allows the two processes to access the same file with different assign characteristics. 378 S–3901–60Named Pipe Support [19] Example 8: No EOF Detection: program writerd program writerd parameter(n=3) dimension ia(n) do 10 i=1,n ia(i)=i 10 continue write (10) ia read (11) ia do 20 i=1,n print*,'ia(',i,') is ',ia(i),' in writerd' 20 continue end Example 9: No EOF Detection: program readwt program readwt parameter(n=3) dimension ia(n) read (15) ia do 10 i=1,n print*,'ia(',i,') is ',ia(i),' in readwt' ia(i)=ia(i)+1 10 continue write (16) ia end The following command sequence executes the programs: ftn -o readwt readwt.f ftn -o writerd writerd.f /bin/mknod pipe1 p /bin/mknod pipe2 p assign -s u -a pipe1 u:10 assign -s unblocked -a pipe2 u:11 assign -s unblocked -a pipe1 u:15 assign -s u -a pipe2 u:16 readwt & writerd S–3901–60 379Cray® Fortran Reference Manual This is the output of the two programs: ia(1) is 1 in readwt ia(2) is 2 in readwt ia(3) is 3 in readwt ia(1) is 2 in writerd ia(2) is 3 in writerd ia(3) is 4 in writerd 19.2 Detecting End-of-file on a Named Pipe The following conditions must be met to detect end-of-file on a read from a named pipe within a Fortran program: • The program that sends data must open the pipe in a specific way, and the program that receives the data must open the pipe as read-only. • The program that sends or writes the data must open the named pipe as read and write or write-only. Read and write is the default because the /bin/mknod command creates a named pipe with read and write permission. • The program that receives or reads the data must open the pipe as read-only. A read from a named pipe that is opened as read and write waits indefinitely for the data. 19.3 Piped I/O Example with End-of-file Detection This example uses named pipes for communication between two Fortran programs with end-of-file detection. The programs in this example are similar to the programs used in the preceding section. This example shows that program readwt can detect the EOF. Program writerd generates array ia and writes the data to the named pipe pipe1. Program readwt reads the data from the named pipe pipe1, prints the values, adds one to each value, and writes the new elements to named pipe pipe2. Program writerd reads the new values from pipe2 and prints them. Finally, program writerd closes pipe1 and causes program readwt to detect the EOF. 380 S–3901–60Named Pipe Support [19] This command sequence executes these programs: ftn -o readwt readwt.f ftn -o writerd writerd.f assign -s u -a pipe1 u:10 assign -s unblocked -a pipe2 u:11 assign -s unblocked -a pipe1 u:15 assign -s u -a pipe2 u:16 /bin/mknod pipe1 p /bin/mknod pipe2 p readwt & writerd Example 10: EOF Detection: program writerd program writerd parameter(n=3) dimension ia(n) do 10 i=1,n ia(i)=i 10 continue write (10) ia read (11) ia do 20 i=1,n print*,'ia(',i,') is',ia(i),' in writerd' 20 continue close (10) end Example 11: EOF Detection: program readwt program readwt parameter(n=3) dimension ia(n) C open the pipe as read-only open(15,form='unformatted', action='read') read (15,end = 101) ia do 10 i=1,n print*,'ia(',i,') is ',ia(i),' in readwt' ia(i)=ia(i)+1 10 continue write (16) ia read (15,end = 101) ia goto 102 S–3901–60 381Cray® Fortran Reference Manual 101 print *,'End of file detected' 102 continue end This is the output of the two programs: ia(1) is 1 in readwt ia(2) is 2 in readwt ia(3) is 3 in readwt ia(1) is 2 in writerd ia(2) is 3 in writerd ia(3) is 4 in writerd End of file detected 382 S–3901–60Glossary absolute address 1. A unique, explicit identification of a memory location, a peripheral device, or a location within a peripheral device. 2. A precise memory location that is an actual address number rather than an expression from which the address can be calculated. accelerated mode One of two modes of execution for an application on UNICOS/mp systems; the other mode is flexible mode. Applications running in accelerated mode perform in a predictable period of processor time, though their wall clock time may vary depending on I/O usage, network use, and/or whether any oversubscription occurs on the relevant nodes. Due to the characteristics of the memory address space, accelerated applications must run on logically contiguous nodes. See also flexible mode. application node For UNICOS/mp systems, a node that is used to run user applications. Application nodes are best suited for executing parallel applications and are managed by the strong application placement scheduling and gang scheduling mechanism Psched. See also node; node flavor. array assignment statement See array syntax statement. array syntax statement A Fortran statement that allows you to use the array name (or the array name with a section subscript) to specify actions on all the elements of an array (or array section) without using DO loops. For example, the A = B array syntax statement assigns all the values of array A to array B. Sometimes called an array assignment statement. assign environment The set of information used in Fortran to alter the details of a Fortran connection. This information includes a list of unit numbers, file names, and file name patterns that have attributes associated with them. Any file name, file name pattern, or unit number to which assign options are attached is called an assign S–3901–60 383Cray® Fortran Reference Manual object. When the unit or file is opened from Fortran, the options are used to set up the properties of the connection. asynchronous I/O I/O operation during which the program performs other operations that do not involve the data in the I/O operation. Control is returned to the calling program after the I/O is initiated. The program may perform calculations unrelated to the previous I/O request, or it may issue another unrelated I/O request while waiting for the first I/O request to complete. An operation is complete when all data has been moved. barrier An obstacle within a program that provides a mechanism for synchronizing tasks. When a task encounters a barrier, it must wait until all specified tasks reach the barrier. barrier synchronization 1. An event initiated by software that prevents cooperating tasks from continuing to issue new program instructions until all of the tasks have reached the same point in the program. 2. A feature that uses a barrier to synchronize the processors within a partition. All processors must reach the barrier before they can continue the program. basic block A section of a program that does not cross any conditional branches, loop boundaries, or other transfers of control. There is a single entry point and a single exit point. Many compiler optimizations occur within basic blocks. binary blocked A file format that describes blocked, nontranslatable data. binary stream An ordered sequence of characters that can transparently record internal data. Data read in from a binary stream equals data that was written earlier out to that stream under the same implementation. 384 S–3901–60Glossary binding The way in which one component in a resource specification is related to another component. block data A type of Fortran program unit. A block data program unit contains only data definitions. It specifies initial values for a restricted set of data objects. blocking An optimization that involves changing the iteration order of loops that access large arrays so that groups of array elements are processed as many times as possible while they reside in cache. C interoperability A Fortran feature that allows Fortran programs to call C functions and access C global objects and also allows C programs to call Fortran procedures and access Fortran global objects. cache line A division of cache. Each cache line can hold multiple data items. For Cray X1 and X2 systems, a cache line is 32 bytes, which is the maximum size of a hardware message. co-array A syntactic extension to Fortran that offers a method for programming data passing; a data object that is identically allocated on each image and can be directly referenced syntactically by any other image. co-dimensions The dimensions of a co-array; specified within brackets ([ ]). A co-array specification consists of the local object specification and the co-dimensions specification. common block An area of memory, or block, that can be referenced by any program unit. In Fortran, a named common block has a name specified in a Fortran COMMON or TASKCOMMON statement, along with specified names of variables or arrays S–3901–60 385Cray® Fortran Reference Manual stored in the block. A blank common block, sometimes referred to as blank common, is declared in the same way but without a name. compute module For a Cray X1 and X2 series mainframes, the physical, configurable, scalable building block. Each compute module contains either one node with 4 MCMs/4MSPs (Cray X1 modules) or two nodes with 4 MCMs/8MSPs (Cray X1E modules). Sometimes referred to as a node module. See also node. construct A sequence of statements in Fortran that starts with a SELECT CASE, DO, IF, or WHERE statement and ends with the corresponding terminal statement. Cray Fortran Compiler The compiler that translates Fortran programs into Cray object files. The Cray Fortran Compiler fully supports the Fortran language through the Fortran 2003 Standard, ISO/IEC 1539-1:2004. Cray pointee See Cray pointer. Cray pointer A variable whose value is the address of another entity, which is called a pointee. The Cray pointer type statement declares both the pointer and its pointee. The Cray pointee does not have an address until the value of the Cray pointer is defined; the pointee is stored starting at the location specified by the pointer. Cray Programming Environment Server (CPES) A server for the Cray X1 and X2 series systems that runs the Programming Environment software. Cray streaming directives (CSDs)(X1 only) Nonadvisory directives that allow you to more closely control multistreaming for key loops. Cray X1 series system The Cray system that combines the single-processor performance and 386 S–3901–60Glossary single-shared address space of Cray parallel vector processor (PVP) systems with the highly scalable microprocessor-based architecture that is used in Cray T3E systems. Cray X1 and Cray X1E systems utilize powerful vector processors, shared memory, and a modernized vector instruction set in a highly scalable configuration that provides the computational power required for advanced scientific and engineering applications. CrayDoc Cray's documentation system for accessing and searching Cray books, man pages, and glossary terms from a web browser. CrayPat For Cray X1 and X2 series systems, the primary high-level tool for identifying opportunities for optimization. CrayPat allows you to perform profiling, sampling, and tracing experiments on an instrumented application and to analyze the results of those experiments; no recompilation is needed to produce the instrumented program. In addition, the CrayPat tool provides access to all hardware performance counters. data passing Transferring data from one object to another; useful for programming single-program-multiple-data (SPMD) parallel computation. Its chief advantage over message passing is lower latency for data transfers, which leads to better scalability of parallel applications. Data passing can be achieved by using SHMEM library routines or by using co-arrays. deferred implementation The label used to introduce information about a feature that will not be implemented until a later release. direct-access I/O I/O operation where the a peripheral device or a channel controls data transfer in and out of the computer. The data transfers directly to or from storage and bypasses the processor. dynamic thread adjustment In OpenMP, the automatic adjustment of the number of threads between parallel regions. Also known as dynamic threads or the dynamic thread mechanism. S–3901–60 387Cray® Fortran Reference Manual entry point A location in a program or routine at which execution begins. A routine may have several entry points, each serving a different purpose. Linkage between program modules is performed when the linkage editor binds the external references of one group of modules to the entry points of another module. environment variable A variable that stores a string of characters for use by your shell and the processes that execute under the shell. Some environment variables are predefined by the shell, and others are defined by an application or user. Shell-level environment variables let you specify the search path that the shell uses to locate executable files, the shell prompt, and many other characteristics of the operation of your shell. Most environment variables are described in the ENVIRONMENT VARIABLES section of the man page for the affected command. Etnus TotalView A symbolic source-level debugger designed for debugging the multiple processes of parallel Fortran, C, or C++ programs. explicit data conversion The process by which the user performs calls to subroutines that convert native data to and from foreign data formats. flexible file I/O (FFIO) A method of I/O, sometimes called layered I/O, wherein each processing step requests one I/O layer or grouping of layers. A layer refers to the specific type of processing being done. In some cases, the name corresponds directly to the name of one layer. In other cases, however, specifying one layer invokes the routines used to pass the data through multiple layers. flexible mode One of two modes of execution for an application on UNICOS/mp systems; the other mode is accelerated mode. Applications running in flexible mode may run on noncontiguous nodes; they perform in a less predictable amount of processor time than applications running in accelerated mode due to the exclusive use of source processor address translation. See also accelerated mode. 388 S–3901–60Glossary folding A basic compiler optimization that converts operations on constants to simpler forms as these examples show: Operation to fold Folded operation 1 + 2 3 5.0/3.0 + 1.7 3.366... (if the -O fp1 (Fortran) or -h fp1 (C/C++) or greater is used.) sin( 1.3 ) 0.96355818... 3 + n - 4 n - 1 formatted I/O Data transfer with editing. Formatted I/O can be edit-directed, list-directed, or namelist I/O. If the format identifier is an asterisk, the I/O statement is a list-directed I/O statement. All other format identifiers indicate edit-directed I/O. Formatted I/O should be avoided when I/O performance is important. gather/scatter An operation that copies data between remote and local memory or within local memory. A gather is any software operation that copies a set of data that is nonsequential in a remote (or local) processor, usually storing into a sequential (contiguous) area of local processor memory. A scatter copies data from a sequential, contiguous area of local processor memory) into nonsequential locations in a remote (or local) memory. implicit data conversion The process by which you declare that a particular file contains foreign data and/or record blocking and then request that the run-time library perform appropriate transformations on the data to make it useful to the program at I/O time. implicit open The opening of a file or a unit when the first reference to a unit number is an I/O statement other than OPEN, CLOSE, INQUIRE, BACKSPACE, ENDFILE, or REWIND. invariant A rule, such as the ordering of an ordered list or heap, that applies throughout the life of a data structure or procedure. Each change to the data structure must maintain the correctness of the invariant. S–3901–60 389Cray® Fortran Reference Manual kind Data representation (for example, single precision, double precision). The kind of a type is referred to as a kind parameter or kind type parameter of the type. The kind type parameter KIND indicates the decimal range for the integer type, the decimal precision and exponent range for the real and complex types, and the machine representation method for the character and logical types. layered I/O See flexible file I/O (FFIO). lexical extent In OpenMP, statements that reside within a structured block. See also structured block. list-directed I/O I/O where the records consist of a sequence of values separated by value separators such as commas or spaces. A tab is treated as a space in list-directed input, except when it occurs in a character constant that is delimited by apostrophes or quotation marks. lock 1. Any device or algorithm that is used to ensure that only one process will perform some action or use some resource at a time. 2. A synchronization mechanism that, by convention, forces some data to be accessed by tasks in a serial fashion. Locks have two states: locked and unlocked. 3. A facility that monitors critical regions of code. loop collapse An optimization that combines loop interchange and loop fusion to convert a loop nest into a single loop, with an iteration count that is the product of the iteration counts of the original loops. loop fusion An optimization that takes the bodies of loops with identical iteration counts and fuses them into a single loop with the same iteration count. 390 S–3901–60Glossary loop interchange An optimization that changes the order of loops within a loop nest, to achieve stride minimization or eliminate data dependencies. loop invariant A value that does not change between iterations of a loop. loop unrolling An optimization that increases the step of a loop and duplicates the expressions within a loop to reflect the increase in the step. This can improve instruction scheduling and reduce memory access time. master thread The thread that creates a team of threads when an OpenMP parallel region is entered. Message Passing Interface (MPI) A widely accepted standard for communication among nodes that run a parallel program on a distributed-memory system. MPI is a library of routines that can be called from Fortran, C, and C++ programs. module file A metafile that defines information specific to an application or collection of applications. (This term is not related to the module statement of the Fortran language; it is related to setting up the Cray system environment.) For example, to define the paths, command names, and other environment variables to use the Programming Environment for Cray systems, you use the module file PrgEnv, which contains the base information needed for application compilations. The module file mpt sets a number of environment variables needed for message passing and data passing application development. multistreaming processor (MSP) (X1 only) For UNICOS/mp systems, a basic programmable computational unit. Each MSP is analogous to a traditional processor and is composed of four single-streaming processors (SSPs) and E-cache that is shared by the SSPs. See also node. S–3901–60 391Cray® Fortran Reference Manual multithreading The concurrent use of multiple threads of control that operate within the same address space. named pipe A first-in, first-out file that allows communication between two unrelated processes running on the same host. namelist I/O I/O that allows you to group variables by specifying a namelist group name. On input, any namelist item within that list may appear in the input record with a value to be assigned. On output, the entire namelist is written. NaN An IEEE floating-point representation for the result of a numerical operation that cannot return a valid number value; that is, not a number, NaN. node For UNICOS/mp systems, the logical group of four multistreaming processors (MSPs), cache-coherent shared local memory, high-speed interconnections, and system I/O ports. A Cray X1 system has one node with 4 MSPs per compute module. A Cray X1E system has two nodes of 4 MSPs per node, providing a total of 8 MSPs on its compute module. Software controls how a node is used: as an OS node, application node, or support node. See also compute module; . node In networking, a processing location. A node can be a computer (host) or some other device, such as a printer. Every node has a unique network address. node flavor For UNICOS/mp systems, software controls how a node is used. A node's software-assigned flavor dictates the kind of processes and threads that can use its resources. The three assignable node flavors are application, OS, and support. See also application node; OS node; support node; system node. OpenMP An industry-standard, portable model for shared memory parallel programming. 392 S–3901–60Glossary OS node For UNICOS/mp systems, the node that provides kernel-level services, such as system calls, to all support nodes and application nodes. See also node; node flavor. overindexing The nonstandard practice of referencing an array with a subscript not contained between the declared lower and upper bounds of the corresponding dimension for that array. This practice sometimes, but not always, leads to referencing a storage location outside of the entire array. page size The unit of memory addressable through the Translation Lookaside Buffer (TLB). For a UNICOS/mp system, the base page size is 65,536 bytes, but larger page sizes (up to 4,294,967,296 bytes) are also available. parallel processing Processing in which multiple processors work on a single application simultaneously. parallel region See serial region. partitioning Configuring a UNICOS/mp system into logical systems (partitions). Each partition is independently operated, booted, dumped, and so on without impact on other running partitions. Hardware and software failures in one partition do not affect other partitions. piped I/O I/O that uses named pipes; faster than normal I/O because it requires less memory than memory-resident files. See also named pipe. pointer A data item that consists of the address of a desired item. Psched The UNICOS/mp application placement scheduling tool. The psched command S–3901–60 393Cray® Fortran Reference Manual can provide job placement, load balancing, and gang scheduling for all applications placed on application nodes. rank The number of dimensions in a Fortran array. Rank is declared when the array is declared and cannot change. reduction The process of transforming an expression according to certain reduction rules. The most important forms are beta reduction (application of a lambda abstraction to one or more argument expressions) and delta reduction (application of a mathematical function to the required number of arguments). An evaluation strategy (or reduction strategy) determines which part of an expression to reduce first. There are many such strategies. Also called contraction. reduction loop A loop that contains at least one statement that reduces an array to a scalar value by doing a cumulative operation on many of the array elements. This involves including the result of the previous iteration in the expression of the current iteration. scalar processing A form of fine-grain serial processing whereby iterative operations are performed sequentially on the elements of an array, with each iteration producing one result. scoping unit Part of a program in which a name has a fixed meaning. A program unit or subprogram generally defines a scoping unit. Type definitions and procedure interface bodies also constitute scoping units. Scoping units do not overlap, although one scoping unit may contain another in the sense that it surrounds it. If a scoping unit contains another scoping unit, the outer scoping unit is referred to as the host scoping unit of the inner scoping unit. serial region An area within a program in which only the master task is executing. Its opposite is a parallel region. 394 S–3901–60Glossary SHMEM A library of optimized functions and subroutines that take advantage of shared memory to move data between the memories of processors. The routines can either be used by themselves or in conjunction with another programming style such as Message Passing Interface. SHMEM routines can be called from Fortran, C, and C++ programs. shortloop A loop that is vectorized but that has been determined by the compiler to have trips less than or equal to the maximum vector length. In this case, the compiler deletes the loop to the top of the loop. If the shortloop directive is used or the trip count is constant, the top test for number of trips is deleted. A shortloop is more efficient than a conventional loop. side effects The result of modifying shared data or performing I/O by concurrent streams without the use of an appropriate synchronization mechanism. Modifying shared data (where multiple streams write to the same location or write/read the same location) without appropriate synchronization can cause unreliable data and race conditions. Performing I/O without appropriate synchronization can cause an I/O deadlock. Shared data, in this context, occurs when any object may be referenced by two or more single-streaming processors (X1 only). This includes globally visible objects (for example, COMMON, MODULE data), statically allocated objects (SAVE, C static), dummy arguments that refer to SHARED data and objects in the SHARED heap. single-streaming processor (SSP) (X1 only) For UNICOS/mp systems, a basic programmable computational unit. See also node. stack allocation A method of allocating memory for variables used by a called routine during program execution. Variables are reset for each invocation of a subprogram. Stack mode is required for multitasked code. stride The relationship between the layout of an array's elements in memory and the order in which those elements are accessed. A stride of 1 means that S–3901–60 395Cray® Fortran Reference Manual memory-adjacent array elements are accessed on successive iterations of an array-processing loop. structured block In Fortran OpenMP, a collection of one or more executable statements with a single point of entry at the top and a single point of exit at the bottom. Execution must always proceed with entry at the top of the block and exit at the bottom with only one exception: the block is allowed to have a STOP statement inside a structured block. This statement has the well-defined behavior of terminating the entire program. support node For UNICOS/mp systems, the node that is used to run serial commands, such as shells, editors, and other user commands (ls, for example). See also node; node flavor. symbol table A table of symbolic names (for example, variables) used in a program to store their memory locations. The symbol table is part of the executable object generated by the compiler. Debuggers use it to help analyze the program. synchronous I/O I/O operation during which an executing program relinquishes control until the operation is complete. An operation is not complete until all data is moved. system cache A set of buffers in kernel memory used for I/O operations by the operating system. The system cache ensures that the actual I/O to the logical device is well formed, and it tries to remember recent data in order to reduce physical I/O requests. In many cases, however, it is desirable to bypass the system cache and to perform I/O directly between the user's memory and the logical device. system node For UNICOS/mp systems, the node that is designated as both an OS node and a support node; this node is often called a system node; however, there is no node flavor of "system." See also node; node flavor. 396 S–3901–60Glossary system time The amount of time that the operating system spends providing services to an application. thread The active entity of execution. A sequence of instructions together with machine context (processor registers) and a stack. On a parallel system, multiple threads can be executing parts of a program at the same time. TKR An acronym that represents attributes for argument association. It represents the data type, kind type parameter, and rank of the argument. trigger A command that a user logged into a Cray X1 series system uses to launch Programming Environment components residing on the CPES. Examples of trigger commands are ftn, CC, and pat_build. type A means for categorizing data. Each intrinsic and user-defined data type has four characteristics: a name, a set of values, a set of operators, and a means to represent constant values of the type in a program. unblocked file structure A file that contains undelimited records. Because it does not contain any record control words, it does not have record boundaries. unformatted I/O Transfer of binary data without editing between the current record and the entities specified by the I/O list. Exactly one record is read or written. The unit must be an external unit. UNICOS/lc The operating system for Cray X2 series systems. UNICOS/mp The operating system for Cray X1 series (Cray X1 and Cray X1E) systems. S–3901–60 397Cray® Fortran Reference Manual unrolling A single-processing-element optimization technique in which the statements within a loop are copied. For example, if a loop has two statements, unrolling might copy those statements four times, resulting in eight statements. The loop control variable would be incremented for each copy, and the stride through the array would also be increased by the number of copies. This technique is often performed directly by the compiler, and the number of copies is usually between two and four. vector A series of values on which instructions operate; this can be an array or any subset of an array such as row, column, or diagonal. Applying arithmetic, logical, or memory operations to vectors is called vector processing. See also vector processing. vector length The number of elements in a vector. vector processing A form of instruction-level parallelism in which the vector registers are used to perform iterative operations in parallel on the elements of an array, with each iteration producing up to 64 simultaneous results. See also vector. vector register The register that serves as a source and destination for vector operations. vectorization The process, performed by the compiler, of analyzing code to determine whether it contains vectorizable expressions and then producing object code that uses the vector unit to perform vector processing. 398 S–3901–60Index # (null) directive, 161 -- option, 80 32 bit default types, 72 64 bit default types, 72 A a.out, 5, 15, 60 Advisory directives defined, 98 ALLOCATE statement, 24, 218 American National Standards Institute (ANSI), 1 ANSI, 1 aprun command, 172, 225 Assembly language file.s, 15 output, 5, 24 output file, 15 assign environment alternative file names, 278 assign command syntax, 273 basic usage, 272 buffer size defaults, 287 buffer size specification, 287 C/C++ interface, 271 changing from within a Fortran program, 276 defined, 271 foreign file format specification, 290 Fortran file truncation, 290 Fortran I/O, 277 library calling sequence, 276 library routines, 276 local assign mode, 292 memory resident files, 290 selecting file structure, 279 setting the FILENV variable, 292 system cache, 289 unbuffered I/O, 289 using FFIO in, 271 assign objects open processing, 272 ASSIGN statement, 238 Assignment, 191 Asterisk delimiter, 247 Asynchronous I/O, 266 AUTOMATIC attribute and statement, 189 B -b bin_file option, 17 -b bin_obj_file option, 16, 24, 75, 80 BACKSPACE statement, 281 Barriers, 218 bin file structure defined, 283 padding, 283 binary data streams, 302 Binary file, creating, 16 BIND(C) syntax, 210 Bitwise logical expressions, 194 Block Control Word, 284 BLOCKABLE, 93 BLOCKABLE directive, 133 blocked file structure defined, 284 using BUFFER IN/OUT, 284 using ENDFILE, 284 blocked layer defined, 299 BLOCKINGSIZE, 93 BLOCKINGSIZE directive, 133 Boolean data type introduction, 187 Bounds checking, 225 BOUNDS directive, 92, 130 BOZ constant, 189 Bracket reference, 219 Branching, 241 bufa layer, 304 S–3901–60 399Cray® Fortran Reference Manual defined, 299 specification, 313 BUFFER IN statement, 244 BUFFER OUT statement, 244 Buffer sizes, 287 Buffer specifications, 286 buffers bufa layer, 313 cachea layer, 316 memory-resident files, 327 named pipes, 377 sizes, 303 using binary stream layers, 303 write-behind and read-ahead, 286 BYTE data type, 230 Byte size scaling, 73–74 byte_pointer, 71, 74 C -C cifopts option, 17 -c option , 17, 60, 80 cache layer, 305 defined, 299 improving I/O performance with, 306 specification, 305, 315 Cache management, 38 CACHE_SHARED directive, 92, 97–98 cachea layer, 304 defined, 299 specification, 316 CAL, 24 CDIR$, 87 !CDIR$ directive, 91 Character constant, 189 CIF, 16, 18 CLONE directive, 92, 121 Co-array Fortran, 323 Co-array syntax, 79 Co-arrays co-dimension, 217–218 co-rank, 213 co-shape, 213 co-size, 213 local rank, 213 local shape, 213 local size, 213 LOG2_IMAGES, 219 NUM_IMAGES, 219 related publications, 212 REM_IMAGES, 219 SSP mode, 225 SYNC_ALL, 219 THIS_IMAGE, 219 COERCE_KIND directive, 92 COLLAPSE directive, 126 Column widths, 34 Command line options -Y option, 79 Common blocks, 191 COMMON statement, 191 Common-block report, 65 Compilation phases -Yphase,dirname, 79 Compiler Information File (CIF) See CIF CONCURRENT directive, 93, 136 Conditional compilation, 76 overview, 157 CONTAINS statement, 204 conversion methods, 369 COPY_ASSUMED_SHAPE directive, 92, 98 COS data conversion, 370 cos file structure defined, 284 using BUFFER IN/OUT, 284 using ENDFILE IN/OUT, 284 cos layer defined, 299 specification, 318 CPES, 11 Cray Apprentice2, 12 Cray C, 80 Cray C++, 80 400 S–3901–60Index Cray character pointer data representation, 260 Cray Performance Analyzer Too, 3 Cray pointers and scaling factors, 71, 73–74 Cray Programming Environment Server (CPES), 11 Cray streaming directives See CSDs CRAY_FTN_OPTIONS, 82 CRAY_PE_TARGET environment variable, 82 CrayPat, 3, 226 creating a user-defined FFIO layer, 337 CRITICAL directive, 150 Cross-compiler platforms, 5 CSD continuing long CSD statements, 144 long CSD statements, 144 CSD directive, 152 CSDs, 78, 143 chunk size, 147 compatibility, 143 compiler options, 155 dynamic memory allocation within, 155 incorrect results, 145 Nested, 153 ORDERED clause, 145 parallel regions, 144 placement, 153 PRIVATE clause, 145 SCHEDULE clause, 146 shared data protection, 154 stand-alone, 153 D -d disable option, 18 -D identifier [=value] option, 26 Data global, 191 data item conversion absolute binary files, 369 explicit conversion, 369 implicit conversion, 369 Data passing, 210 DATA statement, 216, 234 Data type, 180 Boolean, 187 Cray pointer, 181 debugging using the event layer to monitor I/O activity, 319 Debugging support, 3, 27 DECODE statement, 243 default types, size of, 72 default64, 72 Defaults -d n, 22 -d Z, 25 -d0, 18 -da, 18 -dc, 18 -dd, 19 -dD, 19 -dE, 20 -dg, 20 -dh, 21 -dI, 21 -dj, 21 -dL, 22 -dm, 22 -do, 22 -dP, 23 -dQ, 23 -dR, 24 -ds, 24 -dS, 24 -dv, 24 -eB, 18 -eg, 20 -Ep, 23 -Eq, 23 -Ey, 25 -h msp, 31 -h nompmd, 30 -O 2, 37 -O fp2, 40 S–3901–60 401Cray® Fortran Reference Manual -O infinitevl, 44 -O ipa3, 44 -O modinline, 49 -O msp, 50 -O noaggress, 38 -O nointerchange, 51 -O nomsgs, 50 -O nonegmsgs, 51 -O nooverindex, 51 -O nopattern, 52 -O nozeroinc, 59 -O scalar2, 53 -O shortcircuit3, 54 -O stream2, 56 -O task1, 57 -O vector2, 59 O- cache0, 38 -s byte_pointer, 71 -s default32, 72 -s integer32, 72 -s real32, 72 #define directive, 159 Defined externals, 173 Descriptors noncharacter data, 248 !DIR$, 87 !DIR$ directive, 91 Directive conditional compilation, 158 Directives advisory, defined, 98 conditional, 161 continuing, 91 Cray streaming See CSDs disabling, 77 for local control, 130 for scalar optimization, 125 for storage, 132 for vectorization, 96 inlining, 121 interaction with -x dirlist option, 94 interaction with command line, 94 interaction with optimization options, 95 miscellaneous, 135 OpenMP Fortran API, 167 overview, 87 parallel, 144 range and placement, 92 Directories phase execution, 79 distributed I/O, 323 DO directive, 146 DOUBLE COMPLEX statement, 23, 231 See also STATIC attribute and statement Double precision, enabling/disabling, 23 Dynamic memory allocation, 263 E -e enable option, 18 #elif directive, 163 #else directive, 163 ENCODE statement, 242 END DO directive, 146 END ORDERED directive, 151 END PARALLEL directive, 144 END PARALLEL DO directive, 149 END_CRITICAL intrinsic function, 222 #endif directive, 164 Enumeration, 187 Enumerator, 187 environment variables FILENV, 272 Environment variables, 81 EQUIVALENCE statement, 217 event layer defined, 299 log file, 320 specification, 319 examples named pipes, 377 piped I/O with no EOF detection, 378 user layer, 341 Exclusive disjunct expression, 192 402 S–3901–60Index Executable output file, 15 explain command, 5 Explicit kind values, 72 Expressions, 191 F -F option, 26 -f source_form option, 26 .f suffix, 26 .F suffix, 26 f77 layer defined, 299 specification, 321 .f90 suffix, 26 .F90 suffix, 26 .f90, .F90, .ftn, .FTN, 15 fd layer defined, 299 specification, 323 FFIO blocked layer, 299 bufa layer, 299 buffer size considerations, 303 cache layer, 299 cachea layer, 299 cachea library buffer, 289 common formats, 301 converting data files, 301 cos layer, 299 creating a user-defined layer, 337 data granularity, 312 defined, 271, 295 event layer, 299 f77 files, 303 f77 layer, 299 fd layer, 299 Fortran I/O forms, 297 global layer, 299 handling binary data, 302 handling multiple EOFs in text files, 301 I/O status fields, 340 ibm layer, 299 layer options, 300 library buffering, 289 list of supported layers, 299 modifying layer behavior, 300 mr layer, 290, 299 null layer, 299 reading and writing f77 files, 303 reading and writing fixed-length records, 303 reading and writing text files, 301 reading and writing unblocked files, 302 removing blocking, 304 selecting file structure, 279 site layer, 299 specifying layers, 299 supported operations, 313 syscall layer, 299 system layer, 298, 300 text files, 301 text layer, 300 unblocked files, 302 usage rules, 300 user layer, 300 using, 298 using sequential layers, 296 using the bufa layer, 304 using the cache layer, 305 using the cachea layer, 304 using the global layer, 305 using the mr layer, 305 using the syscall layer, 304 using with assign, 271 vms layer, 300 FFIO and foreign data foreign conversion tips, 374 IEEE conversion, 372 FFIO layer reference bufa layer, 313 cache layer, 315 cachea layer, 316 cos layer, 318 event layer, 319 f77 layer, 321 S–3901–60 403Cray® Fortran Reference Manual fd layer, 323 global layer, 323 ibm layer, 324 layer definitions, 311 mr layer, 327 null layer, 330 site layer, 334 syscall layer, 331 system layer, 332 text layer, 332 user layer, 334 vms layer, 334 File suffixes for input files, 26 file.a, 15 file.cg, 15 file.f, 15 file.F, 15 file.f90, 15 file.F90, 15 file.ftn, 15 file.FTN, 15 file.i, 15 file.L, 15 file.lst, 15 file.M, 60 file.o, 5, 15, 17 file.opt, 15 file.s, 5, 15 file.T, 16–17, 68 FILENV environment variables, 272 files bin file structure, 283 blocked file structure, 284 data conversion, 301 default file structure, 279 enabling/suppressing truncation, 290 handling multiple EOFs in text files, 301 memory-resident, 290 reading and writing f77 files, 303 reading and writing fixed-length records, 303 reading and writing text files, 301 reading and writing unblocked files, 302 tuning connections, 277 undefined/unknown file structure, 283 Files COS blocked, 280 cos file structure, 284 F77 blocked, 280 foreign format specification, 290 Fortran access methods, 281 positioning, 198 selecting structure, 279 text, 280 text file structure, 283 unblocked, 280 unblocked file structure, 281 FIXED directive, 92, 132 Fixed source form, 26, 34, 80, 91 Fixed source form D lines, 180 FLUSH statement, 266 foreign file conversion choosing conversion methods, 369 conversion techniques, 370 COS conversions, 370 IBM, 371 IEEE, 372 implicit data item conversion, 365 VAX/VMS, 374 FORMAT_TYPE_CHECKING environment variable, 82 Formatted I/O and internal files, 242 Fortran co-arrays, 323 I/O forms, 297 mapping I/O requests to system calls, 298 Fortran 2003 standard, 1 FORTRAN 77 compatibility, 7 FORTRAN 77 standard, 1 Fortran 90 compatibility, 6 Fortran 95 standard, 1 Fortran 95/2003 Explained, 8 404 S–3901–60Index Fortran 95/2003 for Scientists & Engineers, 8 Fortran lister, 3 See also lister FORTRAN_MODULE_PATH environment variable, 83 FREE directive, 92, 132 Free source form, 26, 80, 91 Free source form lines, 180 ftn, 3 command example, 4 command line and options, 15 ftn command, 225 .ftn suffix, 26 .FTN suffix, 26 ftnlx, 3, 64 interaction with the -r list_opt option, 64 FUSION directive, 137 Fusion, defined, 114 G -G debug_lvl option, 27 -g option, 27 global I/O, 323 global layer, 305 defined, 299 specification, 323 Global variables, 173 H -h ieee_nonstop, 29 -h keepfiles, 29 -h mpmd, 30 -h msp option, 31 -h nompmd, 30 HAND_TUNED directive, 100 Hollerith constant, 189 Hollerith constants, 235 I -I incldir option, 31 I/O editing, 201 formatted, 242 I/O processing log file, 320 overriding defaults, 298 specifying I/O class, 296 unblocked data transfer, 304 I/O processing steps specifying I/O class, 296 I/O specifiers, 225 IBM data conversion, 371 ibm layer defined, 299 specification, 324 ID directive, 93, 137 IEEE conversion, 372 #if directive, 162 IF statement, 240 #ifdef directive, 163 #ifndef directive, 163 IGNORE_RANK directive, 92 IGNORE_TKR directive, 92, 139 implicit data item conversion, 365 supported conversions, 367 IMPLICIT NONE statement, 21 #include directive, 158 INCLUDE lines, 31 Indirect logical IF, 241 INLINE directive, 92, 122 INLINEALWAYS directive, 92, 122 INLINENEVER directive, 92, 122 Inlining command line options, 44 directives, 121 main discussion, 44 Input list directed, 200 inputfile.suffix option, 80 INQUIRE statement, 279 INT intrinsic obsolete, 250 INTERCHANGE directive, 93, 125 Interface blocks, 204 S–3901–60 405Cray® Fortran Reference Manual International Organization for Standardization (ISO), 1 Intrinsic assignment, 196 operations, 193 operators, 192 Intrinsic procedures, 205, 219 ISO, 1 IVDEP directive, 93, 100 J -J option, 32 L -L ldir option, 32 -l libname option, 32 Language elements and source form, 179 lexical tokens names, 179 operators, 179 layered I/O bufa layer, 313 cache layer, 315 cachea layer, 316 cos layer, 318 data model, 312 defined, 295 event layer, 319 f77 layer, 321 fd layer, 323 global layer, 323 ibm layer, 324 implementation strategy, 312 mr layer, 327 null layer, 330 site layer, 334 site-specific layers, 334 supported operations, 313 syscall layer, 331 system layer, 332 text layer, 332 user layer, 334 user-defined layers, 334 vms layer, 334 ld, 80 Library return status, 277 Library files, 32 libsci, 52 List file, 65 List-directed input, 200 Lister, 3 Listing files, 64 Listing, producing, 64 LISTIO_PRECISION environment variable, 83 Loader, 80 ld, 3 preferred method for invoking, 3 LOG2_IMAGES, 219 Logical editing, 199 Loop collapse, defined, 51 Loop fusion, defined, 114 Loop optimization FUSION, 137 LOOP_INFO, 108 NOFUSION, 137 NOUNROLL, 112 SAFE_ADDRESS, 105 SHORTLOOP, 107 SHORTLOOP128, 107 UNROLL, 112 LOOP_INFO directive, 108 .lst file, 65 See also list file M -m msg_lvl option, 33 -M msgs option, 34 Macros predefined, 164 _ADDR64, 165 cray, CRAY, _CRAY, 164 _CRAYIEEE, 165 406 S–3901–60Index __crayx1, 164 __crayx1e, 164 __crayx2, 164 _MAXVL, 165 __UNICOSMP, 164 unix, 164 __unix, 164 __unix__, 164 man pages asnctl(3f), 292 asnfile, 276 asnrm, 276 assign, 273, 276 assign(1), 269 assign(3f), 269 asunuit, 276 cp(1), 302 fdcp(1), 301 ffassign(3c), 271 ffassign(3f), 269 ffopen, 272 ffread(3c), 297 ffwrite(3c), 297 intro_ffio(1), 269 Maximum name length, 179 Memory allocation, 263 memory-resident files, 290 memory-resident layer, 327 Message Passing Interface (see also MPI), 211 Messages, 5 Messages, suppressing, 33–34 MODINLINE directive, 94, 123 Module file destination directory, 32 modulename.mod, 16 Modules, 13 MPI, 211, 226, 305, 323 MPMD, 30 mr layer, 305 defined, 299 example, 305 specification, 327 MSP mode, defined, 50 multiple end-of-file records in text files, 301 Multiple program, multiple data (MPMD), 30 Multiprocessing work quantum, 170 Multiprocessing variables, 81 Multistreaming process (MSP) directives, 117 Multistreaming processor, 56 N -N col option, 34 N$PES-1, 225 NAME directive, 92, 140 Name length, maximum, 179 named pipes buffers, 377 creating, 377 defined, 377 detecting EOF, 380 differences from normal I/O, 377 example, 377 piped I/O example (no EOF detection), 378 restrictions, 377 specifying file structure for binary data, 378 Namelist processing, 201 Naming rules, 179 Nested loop termination, 241 NEXTSCALAR directive, 93, 101 NLSPATH environment variable, 84 NO_CACHE_ALLOC directive, 92 NOBLOCKING, 93 NOBLOCKING directive, 133 NOBOUNDS directive, 92, 130 NOCLONE directive, 92, 121 NOCOLLAPSE directive, 126 NOCSD directive, 152 NOFUSION directive, 137 NOINLINE directive, 92, 122 NOINTERCHANGE directive, 93, 125 NOMODINLINE directive, 94, 123 Nonstandard syntax, 6 NOPATTERN, 92 NOPATTERN directive, 102 S–3901–60 407Cray® Fortran Reference Manual NOSIDEEFFECTS directive, 92, 128 NOSTREAM directive, 92, 120 NOUNROLL directive, 93, 112 NOVECTOR directive, 92, 115 NPROC environment variable, 84 null layer defined, 299 specification, 330 NUM_IMAGES, 219, 225 Numeric editing, 198 O -O 0 option, 37 -O 1 option, 37 -O 2 option, 37 -O 3 option, 37 -O aggress option, 38 -O cachen, 38 -O ipa option, 44 -O ipafrom option, 44 -O modinline option, 49 -O msgs option, 50 -O msp option, 50 -O negmsgs option, 51 -O noaggress option, 38 -O nointerchange option, 51 -O nomodinline option, 49 -O nomsgs option, 50 -O nonegmsgs option, 51 -O nooverindex option, 51 -O nopattern option, 52 -O nozeroinc option, 59 -O opt [, opt] option, 95 -O opt [,opt] option, 35 -o out_file option, 60 -O overindex option, 51 -O pattern option, 52 -O scalar0 option, 53 -O scalar1 option, 53 -O scalar2 option, 53 -O scalar3 option, 53 -O shortcircuit option, 54 -O ssp option, 55 -O stream0 option, 56 -O stream1 option, 56 -O stream2 option, 56 -O stream3 option, 56 -O task0 option, 57 -O task1 option, 57 -O vector0 option, 59 -O vector1 option, 59 -O vector2 option, 59 -O vector3 option, 59 -O zeroinc option, 59 Obsolete features, 229 OPEN statement, 197 OpenMP, 323 enabling compiler recognition of, 57 memory considerations, 85, 169 OpenMP Fortran API, 167 Operators, 179 intrinsic, 192 Optimization messages, 51 options, 35 scalar, 53 streaming, 56 tasking, 57 vectorization, 59 with debugging, 27 optimizing I/O performance, 303 text file I/O, 301 using the event layer to monitor I/O activity, 319 ORDERED directive, 151 Outmoded features, 229 Output file, 15 Overindexing, 51 408 S–3901–60Index P -p module_site option, 60 PARALLEL directive, 144 PARALLEL DO directive, 149 Parallelism conditional, 171 pat(1), 3 PATTERN directive, 102 Pattern matching, 52 PAUSE statement, 237 Performance tool, 3 PERMUTATION directive, 93, 102 PIPELINE directive, 93, 115 Pointer arithmetic, 185 POINTER statement, 181 Pointers, 218 Predefined macros, 164 PREFERSTREAM directive, 93, 118 PREFERVECTOR directive, 93, 103 PREPROCESS directive, 141 Preprocessing file extensions, 26 source preprocessing, 23, 25–26, 31, 76, 157 PROBABILITY directive, 104 Program units, 204 block data, 204 Q -Q path, 64 R -r list_opt option, 64 -R runchk option, 68 READ statement, 225 read-ahead bufa layer, 313 cachea layer, 316 defined, 286 Record Control Word, 284 Recursion STATIC attribute, 231 Redursive functions, 204 REM_IMAGES, 219 removing record blocking, 304 RESETINLINE directive, 122 Run-time checking, 68 S -s byte_pointer, 71, 74 -s default32, 72 -s size option, 71 -S source_file option, 24, 75 -s word_pointer, 73–74 SAFE_ADDRESS directive, 105 SAFE_CONDITIONAL directive, 106 Scalar optimization, 53 Scalar optimization directives, 125 Scaling factor, 71, 73–74 See also Cray pointers and scaling factors Shared memory (See also SHMEM), 210 Shell variables, 81 SHMEM, 210–211, 226, 305, 323 Short circuiting, 54 SHORTLOOP directive, 93, 107 Shortloop option, 22 SHORTLOOP128 directive, 93, 107 Single-program-multiple-data (also see SPMD), 210 site layer defined, 299 specification, 334 site-specific FFIO layers, 337 Slash data initialization, 233 Source files, Fortran, 26 Source form, 180 Source forms, 26, 80 Source preprocessing See Preprocessing Source Preprocessing, 157 Source preprocessing variable, defined, 159 SPMD, 210, 212 SSP mode universal library, 50, 56 SSP mode for co-arrays, 225 S–3901–60 409Cray® Fortran Reference Manual SSP mode, defined, 55 SSP_PRIVATE directive, 92, 118 STACK directive, 92, 135 Standards, 1 Star values, 72 START_CRITICAL intrinsic function, 222 STATIC attribute and statement, 231 STOP statement, 196 Storage, 261 Storage directives, 132 STREAM directive, 92, 120 Streaming, 56 Strong reference, 142 supported implicit data conversions, 367 SUPPRESS directive, 93, 129 SYMMETRIC directive, 92 SYNC directive, 150 SYNC_ALL, 219 SYNC_ALL intrinsic function, 220 SYNC_MEMORY intrinsic function, 222 SYNC_TEAM intrinsic function, 221 Synchronization, 218 syscall layer defined, 299, 304 specification, 331 system calls in user-defined FFIO layers, 338 system layer defined, 298, 300 implicit usage of, 332 specification, 332 SYSTEM_MODULE directive, 92 T -T option, 75 Tasking, 57 text file structure using BACKSPACE, 284 using BUFFER IN/OUT, 284 text layer defined, 300 specification, 332 THIS_IMAGE, 219 TL descriptor, 248 TMPDIR environment variable, 84 Token lexical, 179 TotalView, 3, 226 Trigger environment, 11 Triggers, 11 Two-branch arithmetic IF, 240 Type alias, 187 Typeless constant, 189 U -U identifier [,identifier] ... option, 76 unblocked data transfer using I/O layers, 304 Unblocked file structure specifications, 282 #undef directive, 161 Universal library for SSP and MSP mode, 50, 56 UNIX FFIO special files, 377 UNROLL directive, 93, 112 user layer creating, 337 defined, 300 example, 341 specification, 334 user-defined FFIO layers creating, 337 I/O status fields, 340 using system calls, 338 V -v option, 76 -V option, 76 Variables STATIC attribute and values, 231 Variables, environment, 81 VAX/VMS explicit data item conversion, 365 record conversion, 374 410 S–3901–60Index transferring files, 364 VECTOR directive, 92, 115 Vector length, 44, 100–101 Vector pipelining, 115 Vectorization, 59 Vectorization directives, 96 Version, release, 76 VFUNCTION directive, 92, 116 vms layer defined, 300 example, 300 specification, 334 W WEAK directive, 92, 141 Word size scaling, 71, 74 word_pointer, 73–74 Work quantum, 170 WRITE statement, 225 write-behind bufa layer, 313 cachea layer, 316 defined, 286 X -x dirlist option, 77, 94 -X npes option, 78 Y -Yphase,dirname, 79 Z -Z option, 79 S–3901–60 411 Cray® C and C++ Reference Manual S–2179–60© 1996-2000, 2002-2007 Cray Inc. All Rights Reserved. This manual or parts thereof may not be reproduced in any form unless permitted by contract or by written permission of Cray Inc. U.S. GOVERNMENT RESTRICTED RIGHTS NOTICE The Computer Software is delivered as "Commercial Computer Software" as defined in DFARS 48 CFR 252.227-7014. All Computer Software and Computer Software Documentation acquired by or for the U.S. Government is provided with Restricted Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7014, as applicable. Technical Data acquired by or for the U.S. Government, if any, is provided with Limited Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7013, as applicable. Cray, LibSci, UNICOS and UNICOS/mk are federally registered trademarks and Active Manager, Cray Apprentice2, Cray C++ Compiling System, Cray Fortran Compiler, Cray SeaStar, Cray SeaStar2, Cray SHMEM, Cray Threadstorm, Cray X1, Cray X1E, Cray X2, Cray XD1, Cray XMT, Cray XT, Cray XT3, Cray XT4, CrayDoc, CRInform, Libsci, RapidArray, UNICOS/lc, and UNICOS/mp are trademarks of Cray Inc. Dinkumware and Dinkum are trademarks of Dinkumware, Ltd. Edison Design Group is a trademark of Edison Design Group, Inc. TotalView is a trademark of TotalView Technologies LLC. GNU is a trademark of The Free Software Foundation. O2 is a trademark of Silicon Graphics, Inc. SGI and Silicon Graphics are trademarks of Silicon Graphics, Inc. ISO is a trademark of International Organization for Standardization (Organisation Internationale de Normalisation). UNIX, the “X device,” X Window System, and X/Open are trademarks of The Open Group in the United States and other countries. All other trademarks are the property of their respective owners. The UNICOS, UNICOS/mk, and UNICOS/mp operating systems are derived from UNIX System V. These operating systems are also based in part on the Fourth Berkeley Software Distribution (BSD) under license from The Regents of the University of California. Portions of this document were copied by permission of OpenMP Architecture Review Board from OpenMP C and C++ Application Program Interface, Version 2.0, March 2002, Copyright © 1997-2002, OpenMP Architecture Review Board.New Features Cray® C and C++ Reference Manual S–2179–60 This manual was revised to include the following changes for the Programming Environment 6.0 Cray C and Cray C++ releases: • The -h calchars option allows the use of the @ symbol on Cray X1 series systems only (see Section 2.9.3, page 24). • The -h gen_private_callee option is not supported on Cray X2 systems (see Section 2.10.5, page 27). • Documented Cray X2 cache levels (see Table 5, page 34). • Added an option that directs the compiler to instrument the source code for gathering profile information (see Section 2.10.12, page 30). • Added support of profile-guided optimization (see Section 2.10.11, page 30 and Section 3.7.10, page 99). • On Cray X2 systems,-O values of 0, 1, 2, or 3 set that level of optimization for -h cachen options (see Table 5, page 34). • The -h cpu=target_system option for Cray X2 systems is -h cpu=cray-x2 (see Section 2.22.2, page 58). • The CRAY_PE_TARGET environment variable for Cray X2 systems is cray-x2 (see Section 2.24, page 65). • Added support of the NPROC compile time environment variable for Cray X2 systems (see Section 2.24, page 65). • Added support of the CRAYNV_STACK_SIZE run time environment variable for Cray X2 systems (see Section 2.25, page 67). • Added support of the loop_info prefetch and loop_info noprefetch directives for Cray X2 systems. Preloading scalar data into cache can improve the frequency of cache hits and lower latency (see Section 3.7.3, page 92). • Added support of vector atomic memory operations (AMOs) on Cray X2 systems (see Example 10, page 94). • On Cray X2 systems, the maximum number of threads (aprun -d depth value) is 4 (see Section 5.5, page 132). • For Cray X2 systems, the Cray C++ compiler uses the GNU standard C++ library, libstdc++.a (see Section 7.2, page 141). • Added support of the __linux, __linux__, linux, and __gnu_linux__ macros for Cray X2 systems (see Section 10.2, page 159).• Added support of the __LITTLE_ENDIAN__ and __LITTLE_ENDIAN macros for Cray X2 systems (see Section 10.3, page 160). • Added support of the __crayx2 environment variable for Cray X2 systems (see Section 10.3, page 160). • Added support of the __craynv environment variable for Cray X2 systems (see Section 10.3, page 160). • The CRAY, _CRAY, and cray macros are not defined for Cray X2 systems (see Section 10.3, page 160). • Added support of the _CRAYC macro for Cray X2 systems (see Section 10.4, page 161). • The maximum hardware vector length Cray X2 systems is 128 (see Section 10.3, page 160). • Noted throughout the manual that multistreaming, MSPs and SSPs, and Cray Streaming Directives (CSDs) are not supported on Cray X2 systems.Record of Revision Version Description 2.0 January 1996 Original Printing. This manual supports the C and C++ compilers contained in the Cray C++ Programming Environment release 2.0. On all Cray systems, the C++ compiler is Cray C++ 2.0. On Cray systems with IEEE floating-point hardware, the C compiler is Cray Standard C 5.0. On Cray systems without IEEE floating-point hardware, the C compiler is Cray Standard C 4.0. 3.0 May 1997 This rewrite supports the C and C++ compilers contained in the Cray C++ Programming Environment release 3.0, which is supported on all systems except the Cray T3D system. On all supported Cray systems, the C++ compiler is Cray C++ 3.0 and the C compiler is Cray C 6.0. 3.0.2 March 1998 This revision supports the C and C++ compilers contained in the Cray C++ Programming Environment release 3.0.2, which is supported on all systems except the Cray T3D system. On all supported Cray systems, the C++ compiler is Cray C++ 3.0.2 and the C compiler is Cray C 6.0.2. 3.1 August 1998 This revision supports the C and C++ compilers contained in the Cray C++ Programming Environment release 3.1, which is supported on all systems except the Cray T3D system. On all supported Cray systems, the C++ compiler is Cray C++ 3.1 and the C compiler is Cray C 6.1. 3.2 January 1999 This revision supports the C and C++ compilers contained in the Cray C++ Programming Environment release 3.2, which is supported on all systems except the Cray T3D system. On all supported Cray systems, the C++ compiler is Cray C++ 3.2 and the C compiler is Cray C 6.2. 3.3 July 1999 This revision supports the C and C++ compilers contained in the Cray C++ Programming Environment release 3.3, which is supported on the Cray SV1, Cray C90, Cray J90, and Cray T90 systems running UNICOS 10.0.0.5 and later, and Cray T3E systems running UNICOS/mk 2.0.4 and later. On all supported Cray systems, the C++ compiler is Cray C++ 3.3 and the C compiler is Cray C 6.3. S–2179–60 iCray® C and C++ Reference Manual 3.4 August 2000 This revision supports the Cray C 6.4 and Cray C++ 3.4 releases running on UNICOS and UNICOS/mk operating systems. It includes updates to revision 3.3. 3.4 October 2000 This revision supports the Cray C 6.4 and Cray C++ 3.4 releases running on UNICOS and UNICOS/mk operating systems. This revision supports a new inlining level, inline4. 3.6 June 2002 This revision supports the Cray Standard C 6.6 and Cray Standard C++ 3.6 releases running on UNICOS and UNICOS/mk operating systems. 4.1 August 20, 2002 Draft version to support Cray C 7.1 and Cray C++ 4.1 releases running on UNICOS/mp operating systems. 4.2 December 20, 2002 Draft version to support Cray C 7.2 and Cray C++ 4.2 releases running on UNICOS/mp operating systems. 4.3 March 31, 2003 Draft version to support Cray C 7.3 and Cray C++ 4.3 releases running on UNICOS/mp operating systems. 5.0 June 2003 Supports Cray C++ 5.0 and Cray C 8.0 releases running on UNICOS/mp 2.1 or later operating systems. 5.1 October 2003 Supports Cray C++ 5.1 and Cray C 8.1 releases running on UNICOS/mp 2.2 or later operating systems. 5.2 April 2004 Supports Cray C++ 5.2 and Cray C 8.2 releases running on UNICOS/mp 2.3 or later operating systems. 5.3 November 2004 Supports Cray C++ 5.3 and Cray C 8.3 releases running on UNICOS/mp 2.5 or later operating systems. 5.4 March 2005 ii S–2179–60Record of Revision Supports Cray C++ 5.4 and Cray C 8.4 releases running on UNICOS/mp 3.0 or later operating systems. 5.5 December 2005 Supports Cray C++ 5.5 and Cray C 8.5 releases running on UNICOS/mp 3.0 or later operating systems. 5.6 March 2007 Supports Cray C++ 5.6 and Cray C 8.6 releases running on Cray X1 series systems. 6.0 September 2007 Supports the Cray C and Cray C++ 6.0 release running on Cray X1 series and Cray X2 systems. S–2179–60 iiiContents Page Preface xix Accessing Product Documentation . . . . . . . . . . . . . . . . . . . xix Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . xx Reader Comments . . . . . . . . . . . . . . . . . . . . . . . . xxi Cray User Group . . . . . . . . . . . . . . . . . . . . . . . . xxi Introduction [1] 1 The Trigger Environment . . . . . . . . . . . . . . . . . . . . . . 1 Working in the Programming Environment . . . . . . . . . . . . . . . . 3 Preparing the Trigger Environment . . . . . . . . . . . . . . . . . . 4 General Compiler Description . . . . . . . . . . . . . . . . . . . . 4 Cray C++ Compiler . . . . . . . . . . . . . . . . . . . . . . . 4 Cray C Compiler . . . . . . . . . . . . . . . . . . . . . . . . 5 Related Publications . . . . . . . . . . . . . . . . . . . . . . . 5 Compiler Commands [2] 7 CC Command . . . . . . . . . . . . . . . . . . . . . . . . . 8 cc and c99 Commands . . . . . . . . . . . . . . . . . . . . . . 8 c89 Command . . . . . . . . . . . . . . . . . . . . . . . . . 9 cpp Command . . . . . . . . . . . . . . . . . . . . . . . . . 9 Command Line Options . . . . . . . . . . . . . . . . . . . . . . 10 Standard Language Conformance Options . . . . . . . . . . . . . . . . . 13 -h [no]c99 (cc, c99) . . . . . . . . . . . . . . . . . . . . . . 13 -h [no]conform (CC, cc, c99), -h [no]stdc (cc, c99) . . . . . . . . . . . 13 -h cfront (CC) . . . . . . . . . . . . . . . . . . . . . . . . 14 -h [no]parse_templates (CC) . . . . . . . . . . . . . . . . . . . 14 -h [no]dep_name (CC) . . . . . . . . . . . . . . . . . . . . . 14 S–2179–60 vCray® C and C++ Reference Manual Page -h [no]exceptions (CC) . . . . . . . . . . . . . . . . . . . . . 15 -h [no]anachronisms (CC) . . . . . . . . . . . . . . . . . . . . 15 -h new_for_init (CC) . . . . . . . . . . . . . . . . . . . . . 15 -h [no]tolerant (cc, c99) . . . . . . . . . . . . . . . . . . . . 16 -h [no] const_string_literals (CC) . . . . . . . . . . . . . . . 16 -h [no]gnu (CC, cc) . . . . . . . . . . . . . . . . . . . . . . 16 Template Language Options . . . . . . . . . . . . . . . . . . . . . 20 -h simple_templates (CC) . . . . . . . . . . . . . . . . . . . . 20 -h [no]autoinstantiate (CC) . . . . . . . . . . . . . . . . . . . 20 -h one_instantiation_per_object (CC) . . . . . . . . . . . . . . . 20 -h instantiation_dir=dirname (CC) . . . . . . . . . . . . . . . . 20 -h instantiate=mode (CC) . . . . . . . . . . . . . . . . . . . . 21 -h [no]implicitinclude (CC) . . . . . . . . . . . . . . . . . . . 22 -h remove_instantiation_flags (CC) . . . . . . . . . . . . . . . . 22 -h prelink_local_copy (CC) . . . . . . . . . . . . . . . . . . . 22 -h prelink_copy_if_nonlocal (CC) . . . . . . . . . . . . . . . . 22 Virtual Function Options . . . . . . . . . . . . . . . . . . . . . . 22 -h forcevtbl (CC) . . . . . . . . . . . . . . . . . . . . . . 22 -h suppressvtbl (CC) . . . . . . . . . . . . . . . . . . . . . 23 General Language Options . . . . . . . . . . . . . . . . . . . . . 23 -h keep=file (CC) . . . . . . . . . . . . . . . . . . . . . . . 23 -h restrict=args (CC, cc, c99) . . . . . . . . . . . . . . . . . . 23 -h [no]calchars (CC, cc, c99) . . . . . . . . . . . . . . . . . . . 24 -h [no]signedshifts (CC, cc, c99) . . . . . . . . . . . . . . . . . 25 General Optimization Options . . . . . . . . . . . . . . . . . . . . 25 -h [no]aggress (CC, cc, c99) . . . . . . . . . . . . . . . . . . . 25 -h display_opt . . . . . . . . . . . . . . . . . . . . . . . 25 -h [no]fusion (CC, cc, c99) . . . . . . . . . . . . . . . . . . . 26 -h gcpn . . . . . . . . . . . . . . . . . . . . . . . . . . 26 -h gen_private_callee (CC, cc, c99) . . . . . . . . . . . . . . . . 27 vi S–2179–60Contents Page -h [no]intrinsics (CC, cc, c99) . . . . . . . . . . . . . . . . . . 27 -h list=opt (CC, cc, c99) . . . . . . . . . . . . . . . . . . . . 28 -h msp (CC, cc, c99) . . . . . . . . . . . . . . . . . . . . . . 29 -h [no]overindex (CC, cc, c99) . . . . . . . . . . . . . . . . . . 29 -h [no]pattern (CC, cc, c99) . . . . . . . . . . . . . . . . . . . 30 -h profile_data=pgo_opt (CC, cc, c99) . . . . . . . . . . . . . . . . 30 -h profile_generate (CC, cc, c99) . . . . . . . . . . . . . . . . . 30 -h ssp (CC, cc, c99) . . . . . . . . . . . . . . . . . . . . . . 30 -h [no]unroll (CC, cc, c99) . . . . . . . . . . . . . . . . . . . 31 -Olevel (CC, cc, c89, c99) . . . . . . . . . . . . . . . . . . . . 32 Automatic Cache Management Options . . . . . . . . . . . . . . . . . . 33 -h cachen (CC, cc, c99) . . . . . . . . . . . . . . . . . . . . . 33 Multistreaming Processor Optimization Options . . . . . . . . . . . . . . . 34 -h streamn (CC, cc, c99) . . . . . . . . . . . . . . . . . . . . 34 Vector Optimization Options . . . . . . . . . . . . . . . . . . . . . 35 -h [no]infinitevl (CC, cc, c99) . . . . . . . . . . . . . . . . . . 35 -h [no]ivdep (CC, cc, c99) . . . . . . . . . . . . . . . . . . . . 36 -h vectorn (CC, cc, c99) . . . . . . . . . . . . . . . . . . . . 36 Inlining Optimization Options . . . . . . . . . . . . . . . . . . . . 37 -h clonen (CC, cc) . . . . . . . . . . . . . . . . . . . . . . 38 -h ipan (CC, cc, c89, c99) . . . . . . . . . . . . . . . . . . . . 39 -h ipafrom=source [:source ] (CC, cc, c89, c99) . . . . . . . . . . . 40 Combined Inlining . . . . . . . . . . . . . . . . . . . . . . . 41 Scalar Optimization Options . . . . . . . . . . . . . . . . . . . . . 41 -h [no]interchange (CC, cc, c99) . . . . . . . . . . . . . . . . . 41 -h scalarn (CC, cc, c99) . . . . . . . . . . . . . . . . . . . . 41 -h [no]zeroinc (CC, cc, c99) . . . . . . . . . . . . . . . . . . . 42 Math Options . . . . . . . . . . . . . . . . . . . . . . . . . 42 -h fpn (CC, cc, c99) . . . . . . . . . . . . . . . . . . . . . . 42 -h ieee_nonstop . . . . . . . . . . . . . . . . . . . . . . . 45 S–2179–60 viiCray® C and C++ Reference Manual Page -h matherror=method (CC, cc, c99) . . . . . . . . . . . . . . . . . 45 Debugging Options . . . . . . . . . . . . . . . . . . . . . . . . 45 -Glevel (CC, cc, c99) and -g (CC, cc, c89, c99) . . . . . . . . . . . . . . 45 -h [no]bounds (cc, c99) . . . . . . . . . . . . . . . . . . . . . 46 -h zero (CC, cc, c99) . . . . . . . . . . . . . . . . . . . . . . 46 -h dir_check (CC, cc) . . . . . . . . . . . . . . . . . . . . . 47 Compiler Message Options . . . . . . . . . . . . . . . . . . . . . 47 -h msglevel_n (CC, cc, c99) . . . . . . . . . . . . . . . . . . . 47 -h [no]message=n[:n...] (CC, cc, c99) . . . . . . . . . . . . . . . 48 -h report=args (CC, cc, c99) . . . . . . . . . . . . . . . . . . . 48 -h [no]abort (CC, cc, c99) . . . . . . . . . . . . . . . . . . . . 49 -h errorlimit[=n] (CC, cc, c99) . . . . . . . . . . . . . . . . . . 49 Compilation Phase Options . . . . . . . . . . . . . . . . . . . . . 49 -E (CC, cc, c89, c99, cpp) . . . . . . . . . . . . . . . . . . . . . 49 -P (CC, cc, c99) . . . . . . . . . . . . . . . . . . . . . . . . 50 -h feonly (CC, cc, c99) . . . . . . . . . . . . . . . . . . . . . 50 -S (CC, cc, c99) . . . . . . . . . . . . . . . . . . . . . . . . 50 -c (CC, cc, c89, c99) . . . . . . . . . . . . . . . . . . . . . . 50 -#, -##, and -### (CC, cc, c99, cpp) . . . . . . . . . . . . . . . . . 50 -Wphase,"opt..." (CC, cc, c99) . . . . . . . . . . . . . . . . . . 51 -Yphase,dirname (CC, cc, c89, c99, cpp) . . . . . . . . . . . . . . . 51 Preprocessing Options . . . . . . . . . . . . . . . . . . . . . . . 52 -C (CC, cc, c99, cpp) . . . . . . . . . . . . . . . . . . . . . . 52 -D macro[=def] (CC, cc, c89, c99 cpp) . . . . . . . . . . . . . . . . 52 -h [no]pragma=name[:name...] (CC, cc, c99) . . . . . . . . . . . . . 53 -I incldir (CC, cc, c89, c99, cpp) . . . . . . . . . . . . . . . . . 54 -M (CC, cc, c99, cpp) . . . . . . . . . . . . . . . . . . . . . . 55 -N (cpp) . . . . . . . . . . . . . . . . . . . . . . . . . . 55 -nostdinc (CC, cc, c89, c99, cpp) . . . . . . . . . . . . . . . . . . 55 -U macro (CC, cc, c89, c99, cpp) . . . . . . . . . . . . . . . . . . 55 viii S–2179–60Contents Page Loader Options . . . . . . . . . . . . . . . . . . . . . . . . . 55 -l libfile (CC, cc, c89, c99) . . . . . . . . . . . . . . . . . . . 55 -L libdir (CC, cc, c89, c99) . . . . . . . . . . . . . . . . . . . 56 -o outfile (CC, cc, c89, c99) . . . . . . . . . . . . . . . . . . . 57 Miscellaneous Options . . . . . . . . . . . . . . . . . . . . . . . 57 -h command (cc, c99) . . . . . . . . . . . . . . . . . . . . . . 57 -h cpu=target_system (CC, cc, c99) . . . . . . . . . . . . . . . . 58 -h decomp (CC, cc, c99) . . . . . . . . . . . . . . . . . . . . . 59 -h ident=name (CC, cc, c99) . . . . . . . . . . . . . . . . . . . 61 -h keepfiles (CC, cc, c89, c99) . . . . . . . . . . . . . . . . . . 61 -h [no]mpmd (CC, cc) . . . . . . . . . . . . . . . . . . . . . . 61 -h [no]omp (CC, cc) . . . . . . . . . . . . . . . . . . . . . . 61 -h prototype_intrinsics (CC, cc, c99, cpp) . . . . . . . . . . . . . . 61 -h taskn (CC, cc) . . . . . . . . . . . . . . . . . . . . . . . 62 -h [no]threadsafe (CC) . . . . . . . . . . . . . . . . . . . . . 62 -h upc (cc) . . . . . . . . . . . . . . . . . . . . . . . . . 62 -V (CC, cc, c99, cpp) . . . . . . . . . . . . . . . . . . . . . . 63 -X npes (CC, cc, c99) . . . . . . . . . . . . . . . . . . . . . . 63 Command Line Examples . . . . . . . . . . . . . . . . . . . . . . 64 Example 1: CC -X8 -h instantiate=all myprog.C . . . . . . . . . . . 64 Example 2: CC -h conform -h noautoinstantiate myprog.C . . . . . . . 64 Example 3: CC -c -h ipa1 myprog.C subprog.C . . . . . . . . . . . 64 Example 4: CC -I. disc.C vend.C . . . . . . . . . . . . . . . . 64 Example 5: cc -P -D DEBUG newprog.c . . . . . . . . . . . . . . . 65 Example 6: CC -c -h report=s mydata1.C . . . . . . . . . . . . . 65 Example 7: cc -h listing mydata3.c . . . . . . . . . . . . . . . 65 Example 8: CC -h ipa5,report=if myfile.C . . . . . . . . . . . . . 65 Compile Time . . . . . . . . . . . . . . . . . . . . . . . . . 65 Run Time . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 S–2179–60 ixCray® C and C++ Reference Manual Page OMP_SCHEDULE . . . . . . . . . . . . . . . . . . . . . . . . 72 OMP_NUM_THREADS . . . . . . . . . . . . . . . . . . . . . . . 73 OMP_DYNAMIC . . . . . . . . . . . . . . . . . . . . . . . . 73 OMP_NESTED . . . . . . . . . . . . . . . . . . . . . . . . . 74 OMP_THREAD_STACK_SIZE . . . . . . . . . . . . . . . . . . . . 74 #pragma Directives [3] 75 Protecting Directives . . . . . . . . . . . . . . . . . . . . . . . 76 Directives in Cray C++ . . . . . . . . . . . . . . . . . . . . . . . 77 Loop Directives . . . . . . . . . . . . . . . . . . . . . . . . . 77 Alternative Directive form: _Pragma . . . . . . . . . . . . . . . . . . 77 General Directives . . . . . . . . . . . . . . . . . . . . . . . . 78 [no]bounds Directive (Cray C Compiler) . . . . . . . . . . . . . . . . 78 duplicate Directive (Cray C Compiler) . . . . . . . . . . . . . . . . 79 message Directive . . . . . . . . . . . . . . . . . . . . . . . 82 no_cache_alloc Directive . . . . . . . . . . . . . . . . . . . . 82 cache_shared Directive . . . . . . . . . . . . . . . . . . . . . 83 cache_exclusive Directive . . . . . . . . . . . . . . . . . . . . 84 [no]opt Directive . . . . . . . . . . . . . . . . . . . . . . . 84 Probability Directives . . . . . . . . . . . . . . . . . . . . . . 85 weak Directive . . . . . . . . . . . . . . . . . . . . . . . . 87 vfunction Directive . . . . . . . . . . . . . . . . . . . . . . 88 ident Directive . . . . . . . . . . . . . . . . . . . . . . . . 89 Instantiation Directives . . . . . . . . . . . . . . . . . . . . . . . 90 Vectorization Directives . . . . . . . . . . . . . . . . . . . . . . 90 hand_tuned Directive . . . . . . . . . . . . . . . . . . . . . . 90 ivdep Directive . . . . . . . . . . . . . . . . . . . . . . . . 91 loop_info Directive . . . . . . . . . . . . . . . . . . . . . . 92 Example 9: Trip counts . . . . . . . . . . . . . . . . . . . . 94 Example 10: Specifying AMOs . . . . . . . . . . . . . . . . . . 94 Example 11: Using prefer_noamo clause . . . . . . . . . . . . . . . 95 x S–2179–60Contents Page nopattern Directive . . . . . . . . . . . . . . . . . . . . . . 95 novector Directive . . . . . . . . . . . . . . . . . . . . . . . 96 novsearch Directive . . . . . . . . . . . . . . . . . . . . . . 97 permutation Directive . . . . . . . . . . . . . . . . . . . . . 97 [no]pipeline Directive . . . . . . . . . . . . . . . . . . . . . 98 prefervector Directive . . . . . . . . . . . . . . . . . . . . . 98 pgo loop_info Directive . . . . . . . . . . . . . . . . . . . . . 99 safe_address Directive . . . . . . . . . . . . . . . . . . . . . 99 safe_conditional Directive . . . . . . . . . . . . . . . . . . . 101 shortloop and shortloop128 Directives . . . . . . . . . . . . . . . 102 Multistreaming Processor (MSP) Directives . . . . . . . . . . . . . . . . . 103 ssp_private Directive (cc, c99) . . . . . . . . . . . . . . . . . . . 104 nostream Directive . . . . . . . . . . . . . . . . . . . . . . . 106 preferstream Directive . . . . . . . . . . . . . . . . . . . . . 106 Scalar Directives . . . . . . . . . . . . . . . . . . . . . . . . . 107 concurrent Directive . . . . . . . . . . . . . . . . . . . . . . 107 nointerchange Directive . . . . . . . . . . . . . . . . . . . . . 108 noreduction Directive . . . . . . . . . . . . . . . . . . . . . 108 suppress Directive . . . . . . . . . . . . . . . . . . . . . . . 108 [no]unroll Directive . . . . . . . . . . . . . . . . . . . . . . 109 Example 12: Unrolling Outer Loops . . . . . . . . . . . . . . . . . 110 Example 13: Illegal Unrolling of Outer Loops . . . . . . . . . . . . . . 110 [no]fusion Directive . . . . . . . . . . . . . . . . . . . . . . 111 Inlining Directives . . . . . . . . . . . . . . . . . . . . . . . . 111 inline_enable, inline_disable, and inline_reset Directives . . . . . . . 112 Example 14: Using the inline_enable, inline_disable, and inline_reset Directives . . . . . . . . . . . . . . . . . . . . . . . . . 112 inline_always and inline_never Directives . . . . . . . . . . . . . . 113 Cray Streaming Directives (CSDs) [4] 115 CSD Parallel Regions . . . . . . . . . . . . . . . . . . . . . . . 116 S–2179–60 xiCray® C and C++ Reference Manual Page parallel Directive . . . . . . . . . . . . . . . . . . . . . . . 116 for Directive . . . . . . . . . . . . . . . . . . . . . . . . . 118 parallel for Directive . . . . . . . . . . . . . . . . . . . . . . 121 sync Directive . . . . . . . . . . . . . . . . . . . . . . . . . 122 critical Directive . . . . . . . . . . . . . . . . . . . . . . . 122 ordered Directive . . . . . . . . . . . . . . . . . . . . . . . . 123 Nested CSDs within Cray Parallel Programming Models . . . . . . . . . . . . 124 CSD Placement . . . . . . . . . . . . . . . . . . . . . . . . . 125 Protection of Shared Data . . . . . . . . . . . . . . . . . . . . . . 125 Dynamic Memory Allocation for CSD Parallel Regions . . . . . . . . . . . . . 126 Compiler Options Affecting CSDs . . . . . . . . . . . . . . . . . . . 127 OpenMP C and C++ API Directives [5] 129 Deferred OpenMP Features . . . . . . . . . . . . . . . . . . . . . 129 Cray Implementation Differences . . . . . . . . . . . . . . . . . . . . 130 OMP_THREAD_STACK_SIZE . . . . . . . . . . . . . . . . . . . . . 131 Compiler Options Affecting OpenMP . . . . . . . . . . . . . . . . . . 132 OpenMP Program Execution . . . . . . . . . . . . . . . . . . . . . 132 Cray Unified Parallel C (UPC) [6] 135 Cray Specific UPC Functions . . . . . . . . . . . . . . . . . . . . . 136 Shared Memory Allocation Functions . . . . . . . . . . . . . . . . . 136 upc_all_free . . . . . . . . . . . . . . . . . . . . . . . 136 upc_local_free . . . . . . . . . . . . . . . . . . . . . . 136 Pointer-to-shared Manipulation Functions . . . . . . . . . . . . . . . . 137 Lock Functions . . . . . . . . . . . . . . . . . . . . . . . . 137 upc_all_lock_free . . . . . . . . . . . . . . . . . . . . . 137 upc_global_lock_free . . . . . . . . . . . . . . . . . . . . 137 Cray Implementation Differences . . . . . . . . . . . . . . . . . . . . 138 Compiling and Executing UPC Code . . . . . . . . . . . . . . . . . . 138 Example 15: UPC and THREADS defined dynamically . . . . . . . . . . . . 139 xii S–2179–60Contents Page Example 16: UPC and THREADS defined statically . . . . . . . . . . . . . 139 Cray C++ Libraries [7] 141 Unsupported Standard C++ Library Features . . . . . . . . . . . . . . . . 141 Dinkum and GNU C++ Libraries . . . . . . . . . . . . . . . . . . . . 141 Cray C++ Template Instantiation [8] 143 Simple Instantiation . . . . . . . . . . . . . . . . . . . . . . . . 144 Prelinker Instantiation . . . . . . . . . . . . . . . . . . . . . . . 145 Instantiation Modes . . . . . . . . . . . . . . . . . . . . . . . . 148 One Instantiation Per Object File . . . . . . . . . . . . . . . . . . . . 149 Instantiation #pragma Directives . . . . . . . . . . . . . . . . . . . 149 Implicit Inclusion . . . . . . . . . . . . . . . . . . . . . . . . 151 Cray C Extensions [9] 153 Complex Data Extensions . . . . . . . . . . . . . . . . . . . . . . 153 fortran Keyword . . . . . . . . . . . . . . . . . . . . . . . . 154 Hexadecimal Floating-point Constants . . . . . . . . . . . . . . . . . . 154 Predefined Macros [10] 157 Macros Required by the C and C++ Standards . . . . . . . . . . . . . . . . 158 Macros Based on the Host Machine . . . . . . . . . . . . . . . . . . . 159 Macros Based on the Target Machine . . . . . . . . . . . . . . . . . . 160 Macros Based on the Compiler . . . . . . . . . . . . . . . . . . . . 161 UPC Predefined Macros . . . . . . . . . . . . . . . . . . . . . . 162 Running C and C++ Applications [11] 163 Launching a Single Non-MPI Application . . . . . . . . . . . . . . . . . 163 Launching a Single MPI Application . . . . . . . . . . . . . . . . . . . 164 Multiple Program, Multiple Data (MPMD) Launch . . . . . . . . . . . . . . 164 Debugging Cray C and C++ Code [12] 167 TotalView Debugger . . . . . . . . . . . . . . . . . . . . . . . 167 S–2179–60 xiiiCray® C and C++ Reference Manual Page Compiler Debugging Options . . . . . . . . . . . . . . . . . . . . . 168 Interlanguage Communication [13] 169 Calls between C and C++ Functions . . . . . . . . . . . . . . . . . . . 169 Calling Assembly Language Functions from a C or C++ Function . . . . . . . . . . 171 Calling Fortran Functions and Subroutines from a C or C++ Function . . . . . . . . . 172 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 172 Argument Passing . . . . . . . . . . . . . . . . . . . . . . . 173 Array Storage . . . . . . . . . . . . . . . . . . . . . . . . . 173 Logical and Character Data . . . . . . . . . . . . . . . . . . . . 174 Accessing Named Common from C and C++ . . . . . . . . . . . . . . . 175 Accessing Blank Common from C or C++ . . . . . . . . . . . . . . . . 177 Cray C and Fortran Example . . . . . . . . . . . . . . . . . . . . 179 Calling a Fortran Program from a Cray C++ Program . . . . . . . . . . . . . 181 Calling a C or C++ Function from a Fortran Program . . . . . . . . . . . . . . 182 Example 17: Calling a C Function from a Fortran Program . . . . . . . . . . . 183 Implementation-defined Behavior [14] 187 Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . 187 Identifiers . . . . . . . . . . . . . . . . . . . . . . . . . . 188 Types . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 Characters . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Wide Characters . . . . . . . . . . . . . . . . . . . . . . . . 191 Integers . . . . . . . . . . . . . . . . . . . . . . . . . . 191 Arrays and Pointers . . . . . . . . . . . . . . . . . . . . . . . 192 Registers . . . . . . . . . . . . . . . . . . . . . . . . . . 192 Classes, Structures, Unions, Enumerations, and Bit Fields . . . . . . . . . . . 192 Qualifiers . . . . . . . . . . . . . . . . . . . . . . . . . . 193 Declarators . . . . . . . . . . . . . . . . . . . . . . . . . 193 Statements . . . . . . . . . . . . . . . . . . . . . . . . . . 193 xiv S–2179–60Contents Page Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . 194 System Function Calls . . . . . . . . . . . . . . . . . . . . . . 194 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 194 Appendix A Possible Requirements for Non-C99 Code 195 Appendix B Libraries and Loader 197 Cray C and C++ Libraries Current Programming Environments . . . . . . . . . . 197 Loader . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 Appendix C Compatibility with Older C++ Code 199 Use of Nonstandard Cray C++ Header Files . . . . . . . . . . . . . . . . 199 When to Update Your C++ Code . . . . . . . . . . . . . . . . . . . . 200 Use the Proper Header Files . . . . . . . . . . . . . . . . . . . . 200 Add Namespace Declarations . . . . . . . . . . . . . . . . . . . . 203 Reconcile Header Definition Differences . . . . . . . . . . . . . . . . . 204 Recompile All C++ Files . . . . . . . . . . . . . . . . . . . . . 205 Appendix D Cray C and C++ Dialects 207 C++ Language Conformance . . . . . . . . . . . . . . . . . . . . . 207 Unsupported and Supported C++ Language Features . . . . . . . . . . . . . 207 C++ Anachronisms Accepted . . . . . . . . . . . . . . . . . . . . . 211 Extensions Accepted in Normal C++ Mode . . . . . . . . . . . . . . . . . 212 Extensions Accepted in C or C++ Mode . . . . . . . . . . . . . . . . . . 213 C++ Extensions Accepted in cfront Compatibility Mode . . . . . . . . . . . . 216 Appendix E Compiler Messages 223 Expanding Messages with the explain Command . . . . . . . . . . . . . . 223 Controlling the Use of Messages . . . . . . . . . . . . . . . . . . . . 223 Command Line Options . . . . . . . . . . . . . . . . . . . . . 224 Environment Options for Messages . . . . . . . . . . . . . . . . . . 224 ORIG_CMD_NAME Environment Variable . . . . . . . . . . . . . . . . . 225 S–2179–60 xvCray® C and C++ Reference Manual Page Message Severity . . . . . . . . . . . . . . . . . . . . . . . . 225 Common System Messages . . . . . . . . . . . . . . . . . . . . . 227 Appendix F Intrinsic Functions 229 Atomic Memory Operations . . . . . . . . . . . . . . . . . . . . . 230 BMM Operations . . . . . . . . . . . . . . . . . . . . . . . . 230 Bit Operations . . . . . . . . . . . . . . . . . . . . . . . . . 231 Function Operations . . . . . . . . . . . . . . . . . . . . . . . 232 Mask Operations . . . . . . . . . . . . . . . . . . . . . . . . 232 Memory Operations . . . . . . . . . . . . . . . . . . . . . . . 232 Miscellaneous Operations . . . . . . . . . . . . . . . . . . . . . . 232 Streaming Operations . . . . . . . . . . . . . . . . . . . . . . . 233 Glossary 235 Index 239 Tables Table 1. GCC C Language Extensions . . . . . . . . . . . . . . . . . . 16 Table 2. GCC C++ Language Extensions . . . . . . . . . . . . . . . . . 19 Table 3. Carriage Control Characters . . . . . . . . . . . . . . . . . . 28 Table 4. -h Option Descriptions . . . . . . . . . . . . . . . . . . . 32 Table 5. Cache Levels . . . . . . . . . . . . . . . . . . . . . . . 34 Table 6. Automatic Inlining Specifications . . . . . . . . . . . . . . . . 39 Table 7. Floating-point Optimization Levels . . . . . . . . . . . . . . . . 44 Table 8. -Glevel Definitions . . . . . . . . . . . . . . . . . . . . 46 Table 9. -Wphase Definitions . . . . . . . . . . . . . . . . . . . . 51 Table 10. -Yphase Definitions . . . . . . . . . . . . . . . . . . . . 52 Table 11. -h pragma Directive Processing . . . . . . . . . . . . . . . . 53 Table 12. Compiler-calculated Chunk Size . . . . . . . . . . . . . . . . 119 Table 13. Data Type Mapping . . . . . . . . . . . . . . . . . . . . 188 Table 14. Packed Characters . . . . . . . . . . . . . . . . . . . . . 190 xvi S–2179–60Contents Page Table 15. Unrecognizable Escape Sequences . . . . . . . . . . . . . . . . 190 Table 16. Run time Support Library Header Files . . . . . . . . . . . . . . 201 Table 17. Stream and Class Library Header Files . . . . . . . . . . . . . . 201 Table 18. Standard Template Library Header Files . . . . . . . . . . . . . . 202 S–2179–60 xviiPreface The information in this preface is common to Cray documentation provided with this software release. Accessing Product Documentation With each software release, Cray provides books and man pages, and in some cases, third-party documentation. These documents are provided in the following ways: CrayDoc The Cray documentation delivery system that allows you to quickly access and search Cray books, man pages, and in some cases, third-party documentation. Access this HTML and PDF documentation via CrayDoc at the following locations: • The local network location defined by your system administrator • The CrayDoc public website: docs.cray.com Man pages Access man pages by entering the man command followed by the name of the man page. For more information about man pages, see the man(1) man page by entering: % man man Third-party documentation Access third-party documentation not provided through CrayDoc according to the information provided with the product. S–2179–60 xixCray® C and C++ Reference Manual Conventions These conventions are used throughout Cray documentation: Convention Meaning command This fixed-space font denotes literal items, such as file names, pathnames, man page names, command names, and programming language elements. variable Italic typeface indicates an element that you will replace with a specific value. For instance, you may replace filename with the name datafile in your program. It also denotes a word or concept being defined. user input This bold, fixed-space font denotes literal items that the user enters in interactive sessions. Output is shown in nonbold, fixed-space font. [ ] Brackets enclose optional portions of a syntax representation for a command, library routine, system call, and so on. ... Ellipses indicate that a preceding element can be repeated. name(N) Denotes man pages that provide system and programming reference information. Each man page is referred to by its name followed by a section number in parentheses. Enter: % man man to see the meaning of each section number for your particular system. xx S–2179–60Preface Reader Comments Contact us with any comments that will help us to improve the accuracy and usability of this document. Be sure to include the title and number of the document with your comments. We value your comments and will respond to them promptly. Contact us in any of the following ways: E-mail: docs@cray.com Telephone (inside U.S., Canada): 1–800–950–2729 (Cray Customer Support Center) Telephone (outside U.S., Canada): +1–715–726–4993 (Cray Customer Support Center) Mail: Customer Documentation Cray Inc. 1340 Mendota Heights Road Mendota Heights, MN 55120–1128 USA Cray User Group The Cray User Group (CUG) is an independent, volunteer-organized international corporation of member organizations that own or use Cray Inc. computer systems. CUG facilitates information exchange among users of Cray systems through technical papers, platform-specific e-mail lists, workshops, and conferences. CUG memberships are by site and include a significant percentage of Cray computer installations worldwide. For more information, contact your Cray site analyst or visit the CUG website at www.cug.org. S–2179–60 xxiIntroduction [1] The Cray C++ Programming Environment contains both the Cray C and C++ compilers. The Cray C compiler conforms to the International Organization of Standards (ISO) standard ISO/IEC 9899:1999 (C99). The Cray C++ compiler conforms to the ISO/IEC 14882:1998 standard, with some exceptions. The exceptions are noted in Appendix D, page 207. Throughout this manual, the differences between the Cray C and C++ compilers are noted when appropriate. When there is no difference, the phrase the compiler refers to both compilers. Note: This manual documents Cray C and C++ compiler features for Cray X1 series and Cray X2 systems. Features unique to one platform are so noted in the text. 1.1 The Trigger Environment The user on the Cray X1 series system interacts with the system as if all elements of the Programming Environment are hosted on the Cray X1 series mainframe, including Programming Environment commands hosted on the Cray Programming Environment Server (CPES). CPES-hosted commands have corresponding commands on the Cray X1 series mainframe that have the same names. These are called triggers. Triggers are required only for the Programming Environment. Understanding the trigger environment aids administrators and end users in identifying the part of the system in which a problem occurs when using the trigger environment. Note: For Cray X2 systems, all elements of the Programming Environment are hosted either on a Cray XT series system login node or a cross-compiler machine. See the Getting Started on Cray X2 Systems manual for details. When a user enters the name of a CPES-hosted command on the command line of the Cray X1 series mainframe, the corresponding trigger executes, which sets up an environment for the CPES-hosted command. This environment duplicates the portion of the current working environment on the Cray X1 series mainframe that relates to the Programming Environment. This allows the CPES-hosted commands to function properly. S–2179–60 1Cray® C and C++ Reference Manual To replicate the current working environment, the trigger captures the current working environment on the Cray X1 series system and copies the standard I/O as follows: • Copies the standard input of the current working environment to the standard input of the CPES-hosted command • Copies the standard output of the CPES-hosted command to standard output of the current working environment • Copies the standard error of the CPES-hosted command to the standard error of the current working environment All catchable interrupts, quit signals, and terminate signals propagate through the trigger to reach the CPES-hosted command. Upon termination of the CPES-hosted command, the trigger terminates and returns with the CPES-hosted commands return code. Uncatchable signals have a short processing delay before the signal is passed to the CPES-hosted command. If you execute its trigger again before the CPES-hosted command has time to process the signal, an indeterminate behavior may occur. Because the trigger has the same name, inputs, and outputs as the CPES-hosted command, user scripts, makefiles, and batch files can function without modification. That is, running a command in the trigger environment is very similar to running the command hosted on the Cray X1 series system. 2 S–2179–60Introduction [1] The following commands have triggers: • ar • as • c++filt • c89 • c99 • cc • ccp • CC • ftn • ftnlx • ftnsplit • ld • nm • pat_build • pat_help • pat_report • pat_remps • remps 1.1.1 Working in the Programming Environment To use the Programming Environment, you must work on a file system that is cross-mounted to the CPES. If you attempt to use the Programming Environment from a directory that is not cross-mounted to the CPES, you will receive the following message: trigexecd: trigger command cannot access current directory. [directory] is not properly cross-mounted on host [CPES] S–2179–60 3Cray® C and C++ Reference Manual The default files used by the Programming Environment are installed in the /opt/ctl file system. The default include file directory is /opt/ctl/include. All Programming Environment products are found in the /opt/ctl file system. 1.1.2 Preparing the Trigger Environment To prepare the trigger environment for use, you must use the module command to load the PrgEnv module. This module loads all Programming Environment products and sets up the necessary to find the include files, libraries, and product paths on the CPES and the Cray X1 series system. Enter the following command on the command line to load the Programming Environment: % module load PrgEnv Loading the PrgEnv module causes all Programming Environment products to be loaded and available to the user. A user may swap an individual product in the product set, but should not unload any one product. To see the list of products loaded by the PrgEnv module, enter the following command: % module list If you have questions on setting up the programming environment, contact your system support staff. 1.2 General Compiler Description Both the Cray C and C++ compilers are contained within the same Programming Environment. If you are compiling code written in C, use the cc, c89, or c99 command to compile source files. If you are compiling code written in C++, use the CC command. 1.2.1 Cray C++ Compiler The Cray C++ compiler consists of a preprocessor, a language parser, a prelinker, an optimizer, and a code generator. The Cray C++ compiler is invoked by a command called CC(1) in this manual, but it may be renamed at individual sites. The CC(1) command is described in Section 2.1, page 8 and on the CC(1) man page. Command line examples are shown in Section 2.23, page 64. 4 S–2179–60Introduction [1] 1.2.2 Cray C Compiler The Cray C compiler consists of a preprocessor, a language parser, an optimizer, and a code generator. The Cray C compiler is invoked by a command called cc, c89, or c99 in this manual, but it may be renamed at individual sites. The cc and c99 commands are discussed in Section 2.2, page 8, the c89 command is described in Section 2.3, page 9. All are also discussed in the CC(1) man page. Command line examples are shown in Section 2.23, page 64. Note: C code developed under other C compilers of the Cray Programming Environments that do not conform to the C99 standard may require modification to successfully compile with the c99 command. For more information, see Appendix A, page 195. 1.3 Related Publications The following documents contain additional information that may be helpful: • Getting Started on Cray X2 Systems • CC(1) and aprun(1) man pages • Optimizing Applications on Cray X1 Series Systems • Optimizing Applications on Cray X2 Systems • Cray C++ Tools Library Reference Manual, Rogue Wave document, Tools.h++ Introduction and Reference Manual, publication TPD-0005 • Cray C++ Mathpack Class Library Reference Manual by Thomas Keefer and Allan Vermeulen, publication TPD-0006 • LAPACK.h++ Introduction and Reference Manual, Version 1, by Allan Vermeulen, publication TPD-0010 • Using Cray Performance Analysis Tools S–2179–60 5Cray® C and C++ Reference Manual 6 S–2179–60Compiler Commands [2] This chapter describes the compiler commands and the necessary to execute the Cray C and C++ compilers. The following the commands invoke the compilers: • CC, which invokes the Cray C++ compiler. • cc and c99, which invoke the Cray C compiler. • c89, which invokes the Cray C compiler. This command is a subset of the cc command. It conforms with POSIX standard (P1003.2, Draft 12). • cpp, which invokes the C language preprocessor. By default, the CC, cc, c89, and c99 commands invoke the preprocessor automatically. The cpp command provides a way for you to invoke only the preprocessor component of the Cray C compiler. A successful compilation creates an absolute binary file, named a.out by default, that reflects the contents of the source code and any referenced library functions. This binary file, a.out, can then be executed on the target system. For example, the following command sequence compiles file mysource.c and executes the resulting executable program: % cc mysource.c % aprun ./a.out With the use of appropriate options, compilation can be terminated to produce one of several intermediate translations, including relocatable object files (-c option), assembly source expansions (-S option), or the output of the preprocessor phase of the compiler (-P or -E option). In general, the intermediate files can be saved and later resubmitted to the CC, cc, c89, or c99 command, with other files or libraries included as necessary. By default, the CC, cc, c89, and c99 commands automatically call the loader, which creates an executable file. If only one source file is specified, the object file is deleted. If more than one source file is specified, the object files are retained. The following command creates object files file1.o, file2.o, and file3.o, and the executable file a.out: % cc file1.c file2.c file3.c The following command creates the executable file a.out only: % cc file.c S–2179–60 7Cray® C and C++ Reference Manual 2.1 CC Command The CC command invokes the Cray C++ compiler. The CC command accepts C++ source files that have the following suffixes: .c .C .i .c++ .C++ .cc .cxx .Cxx .CXX .CC .cpp The .i files are created when the preprocessing compiler command option (-P) is used. The CC command also accepts object files with the .o suffix; library files with the .a suffix; and assembler source files with the .s suffix. The CC command format is as follows: CC [-c] [-C] [-d string] [-D macro[=def]] [-E] [-g] [-G level] [-h arg] [-I incldir] [-l libfile] [-L libdir] [-M] [-nostdinc] [-o outfile] [-O level] [-P] [-s] [-S] [-U macro] [-V] [-Wphase,"opt..."] [-Xnpes] [-Yphase,dirname] [-#] [-##] [-###] files ... For an explanation of the command line options, see Section 2.5, page 10. 2.2 cc and c99 Commands The cc command invokes the Cray C compiler. The cc and c99 commands accept C source files that have the .c and .i suffixes; object files with the .o suffix; library files with the .a suffix; and assembler source files with the .s suffix. 8 S–2179–60Compiler Commands [2] The cc and c99 commands format are as follows: cc or c99 [-c] [-C] [-d string] [-D macro[=def]] [-E] [-g] [-G level] [-h arg] [-I incldir] [-l libfile] [-L libdir] [-M] [-nostdinc] [-o outfile] [-O level] [-P] [-s] [-S] [-U macro] [-V] [-Wphase,"opt..."] [-Xnpes] [-Yphase,dirname] [-#] [-##] [-###] files ... For an explanation of the command line options, see Section 2.5, page 10. 2.3 c89 Command The c89 command invokes the Cray C compiler. This command is a subset of the cc command and conforms with the POSIX standard (P1003.2, Draft 12). The c89 command accepts C source files that have a .c or .i suffix; object files with the .o suffix; library files with the .a suffix; and assembler source files with the .s suffix. The c89 command format is as follows: c89 [-c] [-D macro[=def]] [-E] [-g] [-I incldir] [-l libfile] [-L libdir] [-o outfile] [-O level] [-s] [-U macro] [-Yphase,dirname] files ... For an explanation of the command line options, see Section 2.5, page 10. 2.4 cpp Command The cpp command explicitly invokes the preprocessor component of the Cray C compiler. Most cpp options are also available from the CC, cc, c89, and c99 commands. The cpp command format is as follows: cpp [-C] [-D macro[=def]] [-E] [-I incldir] [-M] [-N] [-nostdinc] [-P] [-U macro] [-V] [-Yphase,dirname] [-#] [-##] [-###] [infile] [outfile] S–2179–60 9Cray® C and C++ Reference Manual The infile and outfile files are, respectively, the input and output for the preprocessor. If you do not specify these arguments, input is defaulted to standard input (stdin) and output to standard output (stdout). Specifying a minus sign (-) for infile also indicates standard input. For an explanation of the command line options, see Section 2.5, page 10. 2.5 Command Line Options The following subsections describe options for the CC, cc, c89, c99, and cpp commands. These options are grouped according to the following functions: • Language options: – The standard conformance options (Section 2.6, page 13): Section Option Section 2.6.1, page 13 -h [no]c99 Section 2.6.2, page 13 -h [no]conform and -h [no]stdc Section 2.6.3, page 14 -h cfront Section 2.6.4, page 14 -h [no]parse_templates Section 2.6.5, page 14 -h [no]dep_name Section 2.6.6, page 15 -h [no]exceptions Section 2.6.7, page 15 -h [no]anachronisms Section 2.6.8, page 15 -h new_for_init Section 2.6.9, page 16 -h [no]tolerant Section 2.6.10, page 16 -h [no] const_string_literals 10 S–2179–60Compiler Commands [2] Section 2.6.11, page 16 -h [no]gnu – The template options (Section 2.7, page 20): Section Option Section 2.7.1, page 20 -h simple_templates Section 2.7.2, page 20 -h [no]autoinstantiate Section 2.7.3, page 20 -h one_instantiation_per_object Section 2.7.4, page 20 -h instantiation_dir=dirname Section 2.7.5, page 21 -h instantiate=mode Section 2.7.6, page 22 -h [no]implicitinclude Section 2.7.7, page 22 -h remove_instantiation_flags Section 2.7.8, page 22 -h prelink_local_copy Section 2.7.9, page 22 -h prelink_copy_if_nonlocal – The virtual function options (Section 2.8, page 22): -h forcevtbl and -h suppressvtbl. – General language options (Section 2.9, page 23): Section Options Section 2.9.1, page 23 -h keep=file Section 2.9.2, page 23 -h restrict=args Section 2.9.3, page 24 -h [no]calchars S–2179–60 11Cray® C and C++ Reference Manual Section 2.9.4, page 25 -h [no]signedshifts • Optimization options: – General optimization options (Section 2.10, page 25) – Automatic cache management option (Section 2.11.1, page 33) – Multistreaming Processor (MSP) options (Cray X1 series systems only) (Section 2.12, page 34) – Vectorization options (Section 2.13, page 35) – Inlining options (Section 2.14, page 37) – Scalar optimization options (Section 2.15, page 41) • Math options (Section 2.16, page 42) • Debugging options (Section 2.17, page 45) • Message control options (Section 2.18, page 47) • Compilation phase control options (Section 2.19, page 49) • Preprocessing options (Section 2.20, page 52) • Loader options (Section 2.21, page 55) • Miscellaneous options (Section 2.22, page 57) • Command line examples (Section 2.23, page 64) • Compile-time (Section 2.24, page 65) • Run time (Section 2.25, page 67) • OpenMP (Section 2.26, page 71) Options other than those described in this manual are passed to the loader. For more information about the loader, see the ld(1) man page. There are many options that start with -h. Multiple -h options can be specified using commas to separate the arguments. For example, the -h parse_templates and -h fp0 command line options can be specified as -h parse_templates,fp0. 12 S–2179–60Compiler Commands [2] If conflicting options are specified, the option specified last on the command line overrides the previously specified option. Exceptions to this rule are noted in the individual descriptions of the options. The following examples illustrate the use of conflicting options: • In this example, -h fp0 overrides -h fp1: % CC -h fp1,fp0 myfile.C • In this example, -h vector2 overrides the earlier vector optimization level 3 implied by the -O3 option: % CC -O3 -h vector2 myfile.C Most #pragma directives override corresponding command line options. Exceptions to this rule are noted in descriptions of options or #pragma directives. 2.6 Standard Language Conformance Options This section describes standard conformance language options. Each subsection heading shows in parentheses the compiler with which the option can be used. 2.6.1 -h [no]c99 (cc, c99) Defaults: -h noc99 (cc) -h c99 (c99) This option enables/disables language features new to the C99 standard and Cray C compiler, while providing support for features that were previously defined as Cray extensions. If the previous implementation of the Cray extension differed from the C99 standard, both implementations will be available when the -h c99 option is enabled. The -h c99 option is also required for C99 features not previously supported as extensions. When -h noc99 is used, c99 language features such as VLAs and restricted pointers that were available as extensions previously to adoption of the c99 standard remain available to the user. 2.6.2 -h [no]conform (CC, cc, c99), -h [no]stdc (cc, c99) Default: -h noconform, -h nostdc S–2179–60 13Cray® C and C++ Reference Manual The -h conform and -h stdc options specify strict conformance to the ISO C standard or the ISO C++ standard. The -h noconform and -h nostdc options specify partial conformance to the standard. The -h exceptions, -h dep_name, -h parse_templates, and -h const_string_literals options are enabled by the -h conform option in Cray C++. Note: The c89 command does not accept the-h conform or -h stdc option. It is enabled by default when the command is issued. 2.6.3 -h cfront (CC) The -h cfront option causes the Cray C++ compiler to accept or reject constructs that were accepted by previous cfront-based compilers (such as Cray C++ 1.0) but which are not accepted in the C++ standard. The -h anachronisms option is implied when -h cfront is specified. 2.6.4 -h [no]parse_templates (CC) Default: -h noparse_templates This option allows existing code that defines templates using previous versions of the Cray Standard Template Library (STL) (before Programming Environment 3.6) to compile successfully with the -h conform option. Consequently, this allows you to compile existing code without having to use the Cray C++ STL. To do this, use the noparse_templates option. Also, the compiler defaults to this mode when the -h dep_name option is used. To have the compiler verify that your code uses the Cray C++ STL properly, use the parse_templates option. 2.6.5 -h [no]dep_name (CC) Default: -h nodep_name This option enables or disables dependent name processing (that is, the separate lookup of names in templates when the template is parsed and when it is instantiated). The -h dep_name option cannot be used with the -h noparse_templates option. 14 S–2179–60Compiler Commands [2] 2.6.6 -h [no]exceptions (CC) Default: The default is -h exceptions; however, if the CRAYOLDCPPLIB environment variable is set to a nonzero value, the default is -h noexceptions. The -h exceptions option enables support for exception handling. The -h noexceptions option issues an error whenever an exception construct, a try block, a throw expression, or a throw specification on a function declaration is encountered. -h exceptions is enabled by -h conform. 2.6.7 -h [no]anachronisms (CC) Default: -h noanachronisms The -h [no]anachronisms option disables/enables anachronisms in Cray C++. This option is overridden by -h conform. 2.6.8 -h new_for_init (CC) The -h new_for_init option enables the new scoping rules for a declaration in a for-init statement. This means that the new (standard-conforming) rules are in effect, which means that the entire for statement is wrapped in its own implicitly generated scope. -h new_for_init is implied by the -h conform option. This is the result of the scoping rule: { . . . for (int i = 0; i < n; i++) { . . . } // scope of i ends here for -h new_for_init . . . } // scope of i ends here by default S–2179–60 15Cray® C and C++ Reference Manual 2.6.9 -h [no]tolerant (cc, c99) Default: -h notolerant The -h tolerant option allows older, less standard C constructs to facilitate porting of code written for previous C compilers. Errors involving comparisons or assignments of pointers and integers become warnings. The compiler generates casts so that the types agree. With -h notolerant, the compiler is intolerant of the older constructs. The use of the -h tolerant option causes the compiler to tolerate accessing an object with one type through a pointer to an entirely different type. For example, a pointer to long might be used to access an object declared with type double. Such references violate the C standard and should be eliminated, if possible. They can reduce the effectiveness of alias analysis and inhibit optimization. 2.6.10 -h [no] const_string_literals (CC) Default: -h noconst_string_literals The -h [no]const_string_literals options controls whether string literals are const (as required by the standard) or non-const (as was true in earlier versions of the C++ language). 2.6.11 -h [no]gnu (CC, cc) Default: -h nognu The -h gnu option enables the compiler to recognize the subset of the GCC version 3.3.2 extensions to C listed in Table 1. Table 2, page 19 lists the extensions that apply only to C++. For detailed descriptions of the GCC C and C++ language extensions, see http://gcc.gnu.org/onlinedocs/. Table 1. GCC C Language Extensions GCC C Language Extension Description Typeof typeof: referring to the type of an expression Lvalues Using ?:, and casts in lvalues Conditionals Omitting the middle operand of a ?: expression 16 S–2179–60Compiler Commands [2] GCC C Language Extension Description Long Long Double-word integers –long long int Complex Data types for complex numbers Statement Exprs Putting statements and declarations inside expressions Zero Length Zero-length arrays Variable Length Arrays whose length is computed at run time Empty Structures Structures with no members; applies to C but not C++ Variadic Macros Macros with a variable number of arguments Escaped Newlines Slightly looser rules for escaped newlines Multiline strings String literals with embedded newlines Initializers Non-constant initializers Compound Literals Compound literals give structures, unions or arrays as values Designated Inits Labeling elements of initializers Cast to Union Casting to union type from any member of the union Case Ranges 'case 1 ... 9' and such Mixed Declarations Mixing declarations and code Attribute Syntax Formal syntax for attributes Function Prototypes Prototype declarations and old-style definitions; applies to C but not C++ C++ Comments C++ comments are recognized Dollar Signs Dollar sign is allowed in identifiers Character Escapes \e stands for the character Alignment Inquiring about the alignment of a type or variable Inline Defining inline functions (as fast as macros) Alternate Keywords __const__, __asm__, etc., for header files Incomplete Enums enum foo;, with details to follow Function Names Printable strings which are the name of the current function Return Address Getting the return or frame address of a function Unnamed Fields Unnamed struct/union fields within structs/unions S–2179–60 17Cray® C and C++ Reference Manual GCC C Language Extension Description Function Attributes: • nothrow • format, format_arg • deprecated • used • unused • alias • weak Declaring that functions have no side effects, or that they can never return Variable Attributes: • alias • deprecated • unused • used • transparent_union • weak Specifying attributes of variables Type Attributes: • deprecated • unused • used • transparent_union Specifying attributes of types 18 S–2179–60Compiler Commands [2] GCC C Language Extension Description Asm Labels Specifying the assembler name to use for a C symbol Other Builtins: • __builtin_types_compatible_p • __builtin_choose_expr • __builtin_constant_p • __builtin_huge_val • __builtin_huge_valf • __builtin_huge_vall • __builtin_inf • __builtin_inff • __builtin_infl • __builtin_nan • __builtin_nanf • __builtin_nanl • __builtin_nans • __builtin_nansf • __builtin_nansl Other built-in functions Special files such as /dev/null may be used as source files. The supported subset of the GCC version 3.3.2 extensions to C++ are listed in Table 2. Table 2. GCC C++ Language Extensions GCC C++ Extensions Description Min and Max C++ minimum and maximum operators Restricted Pointers C99 restricted pointers and references Backwards Compatibility Compatibilities with earlier definitions of C++ S–2179–60 19Cray® C and C++ Reference Manual GCC C++ Extensions Description Strong Using A using-directive with __attribute ((strong)) Explicit template specializations Attributes may be used on explicit template specializations 2.7 Template Language Options This section describes template language options. For more information about template instantiation, see Chapter 8, page 143. Each subsection heading shows in parentheses the compiler with which the option can be used. 2.7.1 -h simple_templates (CC) The -h simple_templates option enables simple template instantiation by the Cray C++ compiler. For more information about template instantiation, see Chapter 8, page 143. The default is autoinstantiate. 2.7.2 -h [no]autoinstantiate (CC) Default: -h autoinstantiate The -h [no]autoinstantiate option enables or disables prelinker (automatic) instantiation of templates by the Cray C++ compiler. For more information about template instantiation, see Chapter 8, page 143. 2.7.3 -h one_instantiation_per_object (CC) The -h one_instantiation_per_object option puts each template instantiation used in a compilation into a separate object file that has a .int.o extension. The primary object file will contain everything else that is not an instantiation. For the location of the object files, see the -h instantiation_dir option. 2.7.4 -h instantiation_dir=dirname (CC) The -h instantiation_dir = dirname option specifies the instantiation directory that the -h one_instantiation_per_object option should use. If directory dirname does not exist, it will be created. The default directory is ./Template.dir. 20 S–2179–60Compiler Commands [2] 2.7.5 -h instantiate=mode (CC) Default: -h instantiate=none Usually, during compilation of a source file, no template entities are instantiated (except those assigned to the file by automatic instantiation). However, the overall instantiation mode can be changed by using the -h instantiate=mode option, where mode is specified as none (the default), used, all, or local. The default is instantiate=none. To change the overall instantiation mode, specify one of the following for mode: none Default. Does not automatically create instantiations of any template entities. This is the most appropriate mode when prelinker (automatic) instantiation is enabled. used Instantiates only those template entities that were used in the compilation. This includes all static data members that have template definitions. all Instantiates all template functions declared or referenced in the compilation unit. For each fully instantiated template class, all of its member functions and static data members are instantiated regardless of whether they were used. Nonmember template functions are instantiated even if the only reference was a declaration. local Similar to instantiate=used except that the functions are given internal linkage. This mode provides a simple mechanism for those who are not familiar with templates. The compiler instantiates the functions used in each compilation unit as local functions, and the program links and runs correctly (barring problems due to multiple copies of local static variables). This mode may generate multiple copies of the instantiated functions and is not suitable for production use. This mode cannot be used in conjunction with prelinker (automatic) template instantiation. Automatic template instantiation is disabled by this mode. If CC is given a single source file to compile and link, all instantiations are done in the single source file and, by default, the instantiate=used mode is used to suppress prelinker instantiation. S–2179–60 21Cray® C and C++ Reference Manual 2.7.6 -h [no]implicitinclude (CC) Default: -h implicitinclude The -h [no]implicitinclude option enables or disables implicit inclusion of source files as a method of finding definitions of template entities to be instantiated. 2.7.7 -h remove_instantiation_flags (CC) The -h remove_instantiation_flags option causes the prelinker to recompile all the source files to remove all instantiation flags. 2.7.8 -h prelink_local_copy (CC) The -h prelink_local_copy indicates that only local files (for example, files in the current directory) are candidates for assignment of instantiations. 2.7.9 -h prelink_copy_if_nonlocal (CC) The -h prelink_copy_if_nonlocal option specifies that assignment of an instantiation to a nonlocal object file will result in the object file being recompiled in the current directory. 2.8 Virtual Function Options This section describes general language options. Each subsection heading shows in parentheses the compiler with which the option can be used. 2.8.1 -h forcevtbl (CC) Forces the definition of virtual function tables in cases where the heuristic methods used by the compiler to decide on definition of virtual function tables provide no guidance. The virtual function table for a class is defined in a compilation if the compilation contains a definition of the first noninline, nonpure virtual function of the class. For classes that contain no such function, the default behavior is to define the virtual function table (but to define it as a local static entity). The -h forcevtbl option differs from the default behavior in that it does not force the definition to be local. 22 S–2179–60Compiler Commands [2] 2.8.2 -h suppressvtbl (CC) Suppresses the definition of virtual function tables in cases where the heuristic methods used by the compiler to decide on definition of virtual function tables provide no guidance. 2.9 General Language Options This section describes general language options. Each subsection heading shows in parentheses the compiler with which the option can be used. 2.9.1 -h keep=file (CC) When the -h keep=file option is specified, the static constructor/destructor object (.o) file is retained as file. This option is useful when linking .o files on a system that does not have a C++ compiler. The use of this option requires that the main function must be compiled by C++ and the static constructor/destructor function must be included in the link. With these precautions, mixed object files (files with .o suffixes) from C and C++ compilations can be linked into executables by using the loader command instead of the CC command. 2.9.2 -h restrict=args (CC, cc, c99) The -h restrict=args option globally tells the compiler to treat certain classes of pointers as restricted pointers. You can use this option to enhance optimizations (this includes vectorization). Classes of affected pointers are determined by the value contained in args, as follows: args Description a All pointers to object and incomplete types are considered restricted pointers, regardless of where they appear in the source code. This includes pointers in class, struct, and union declarations, type casts, function prototypes, and so on. ! Caution: Do not specify restrict=a if, during execution of any function, an object is modified and that object is referenced through either two different pointers or through the declared name of the object and a pointer. Undefined behavior may result. S–2179–60 23Cray® C and C++ Reference Manual f All function parameters that are pointers to objects or incomplete types can be treated as restricted pointers. ! Caution: Do not specify restrict=f if, during execution of any function, an object is modified and that object is referenced through either two different pointer function parameters or through the declared name of the object and a pointer function parameter. Undefined behavior may result. t All parameters that are this pointers can be treated as restricted pointers (Cray C++ only). ! Caution: Do not specify restrict=t if, during execution of any function, an object is modified and that object is referenced through the declared name of the object and a this pointer. Undefined behavior may result. The args arguments tell the compiler to assume that, in the current compilation unit, each pointer (=a), each pointer that is a function parameter (=f), or each this pointer (=t) points to a unique object. This assumption eliminates those pointers as sources of potential aliasing, and may allow additional vectorization or other optimizations. These options cause only data dependencies from pointer aliasing to be ignored, rather than all data dependencies, so they can be used safely for more programs than the -h ivdep option. ! Caution: Like -h ivdep, the arguments make assertions about your program that, if incorrect, can introduce undefined behavior. You should not use -h restrict=a if, during the execution of any function, an object is modified and that object is referenced through either of the following: • Two different pointers • The declared name of the object and a pointer The -h restrict=f and -h restrict=t options are subject to the analogous restriction, with "function parameter pointer" replacing "pointer." 2.9.3 -h [no]calchars (CC, cc, c99) Default: -h nocalchars The -h calchars option allows the use of the @ (Cray X1 series only) and $ characters in identifier names. This option is useful for porting codes in which identifiers include these characters. With -h nocalchars, these characters are not allowed in identifier names. 24 S–2179–60Compiler Commands [2] ! Caution: Use this option with extreme care, because identifiers with these characters are within Cray X1 series UNICOS/mp or Cray X2 CNL name space and are included in many library identifiers, internal compiler labels, objects, and functions. You must prevent conflicts between any of these uses, current or future, and identifier declarations or references in your code; any such conflict is an error. 2.9.4 -h [no]signedshifts (CC, cc, c99) Default: -h signedshifts The -h [no]signedshifts option affects the result of the right shift operator. For the expression e1 >> e2 where e1 has a signed type, when -h signedshifts is in effect, the vacated bits are filled with the sign bit of e1. When -h nosignedshifts is in effect, the vacated bits are filled with zeros, identical to the behavior when e1 has an unsigned type. Also, see Section 14.2.5, page 191 about the effects of this option when shifting integers. 2.10 General Optimization Options This section describes general optimization options. Each subsection heading shows in parentheses the compiler with which the option can be used. 2.10.1 -h [no]aggress (CC, cc, c99) Default: -h noaggress The -h aggress option provides greater opportunity to optimize loops that would otherwise by inhibited from optimization due to an internal compiler size limitation. -h noaggress leaves this size limitation in effect. With -h aggress, internal compiler tables are expanded to accommodate larger loop bodies. This option can increase the compilation's time and memory size. 2.10.2 -h display_opt The -h display_opt option displays the current optimization settings for this compilation. S–2179–60 25Cray® C and C++ Reference Manual 2.10.3 -h [no]fusion (CC, cc, c99) Default: -h fusion The –h [no]fusion option globally allows or disallows loop fusion. By default, the compiler attempts to fuse all loops, unless the –h nofusion option is specified. Fusing loops generally increases single processor performance by reducing memory traffic and loop overhead. On rare occasions loop fusing may degrade performance. Note: Loop fusion is disabled when the scalar level is set to 0. For more information about loop fusion, see the Optimizing Applications on Cray X1 Series Systems manual or the Optimizing Applications on Cray X2 Systems manual. 2.10.4 -h gcpn Default: -hgcp0 Enables global constant propagation, where n can be 0 (optimization is disabled) or 1 (optimization is enabled). Exposing constants at compile time can allow more aggressive and efficient optimization. The entire executable program must be presented to the compiler at once. Analysis will not occur if the main entry point to the program is not available. This optimization examines the entire program for statically initialized variables that are not modified within the program. References to the variables are replaced with the constant present in the initializer. Replacement will not occur if there is a type mismatch between the constant and a variable being replaced. If GCP analysis encounters a dead-end in the call graph, the compiler issues a message saying that the dead-end routine was not available for interprocedural analysis. A dead-end in the call graph causes the analysis to fail and no variables are replaced with constants. If a routine is not present for analysis, the compiler assumes that arguments passed on a call to that routine are modified and that all global static data is modified as well. The user can specify that a routine has no side effects by declaring a routine "pure" meaning the routine has no side effects and GCP analysis will not be inhibited. The compiler consults its database of "pure" library routines when doing GCP analysis. 26 S–2179–60Compiler Commands [2] The user can invoke -hipafrom= in conjunction with the -hgcpn option, as shown in the following example: cc -hipafrom=ipa.c -hgcp1 t.c When using the -hipafrom= command line option as shown above, the compiler searches only in ipa.c for routine definitions to use during interprocedural analysis. To have interprocedural analysis also search t.c for routines, invoke the compiler as follows: cc -hipafrom=t.c:ipa.c -hgcp1 t.c Note: Only routines in t.c actually get linked into the executable. To link a routine into an executable, it must be input to the compile step. Warning: If the user has duplicate definitions of a routine in the input to the compiler and in the input to -hipafrom=, it is the user's responsibility to ensure that the input is identical. If not, the behavior of the generated code is unpredictable. 2.10.5 -h gen_private_callee (CC, cc, c99) Note: The -h gen_private_callee option is not supported on Cray X2 systems. The -h gen_private_callee option is used when compiling source files containing routines that will be called from streamed regions, whether those streamed regions are created by CSD directives or by the use of the ssp_private or concurrent directives to cause autostreaming. For more information about the ssp_private directive, see Section 3.8.1, page 104. For more information about CSDs, see Chapter 4, page 115. 2.10.6 -h [no]intrinsics (CC, cc, c99) Default: -h intrinsics The -h intrinsics option allows the use of intrinsic hardware functions, which allow direct access to some hardware instructions or generate inline code for some functions. This option has no effect on specially-handled library functions. Intrinsic functions are described in Appendix F, page 229. S–2179–60 27Cray® C and C++ Reference Manual 2.10.7 -h list=opt (CC, cc, c99) The -h list=opt option allows the creation of a loopmark listing and controls its format. The listings are written to source_file_name_without_suffix.lst. For additional information about loopmark listings, see the Optimizing Applications on Cray X1 Series Systems manual or the Optimizing Applications on Cray X2 Systems manual. The values for opt are: a Use all list options; source_file_name_without_suffix.lst includes summary report, options report, and source listing. b Add page breaks to listing. d Produce decompilation output files. See Section 2.22.3, page 59. Provided for compatibility with the Fortran option set. e Expand include files. Note: Using this option may result in a very large listing file. All system include files are also expanded. i Intersperse optimization messages within the source listing rather than at the end. m Create loopmark listing; source_file_name_without_suffix.lst includes summary report and source listing. p Causes the compiler to insert carriage control characters into column one of each line in the listing. Use this option for line printers which require the carriage control characters to control the vertical position of each printed line. Table 3 shows the carriage control characters used. Table 3. Carriage Control Characters Control character Action 1 New page Blank Single spacing s Create a complete source listing (include files not expanded). w Create a wide listing rather than the default of 80 characters. 28 S–2179–60Compiler Commands [2] Using -h list=m creates a loopmark listing. The b, e, i, s, and w options provide additional listing features. Using -h list=a combines all options. 2.10.8 -h msp (CC, cc, c99) Note: The -h msp option is not supported on Cray X2 systems. Default: -h msp The -h msp option causes the compiler to generate code and to select the appropriate libraries to create an executable that runs on one or more multistreaming processors (MSP mode). Any code, including code using Cray-supported distributed memory models, can use MSP mode. Executables compiled for MSP mode can contain object files compiled with MSP or SSP mode. That is, MSP and SSP object files can be specified during the load step as follows: cc -h msp -c ... /* Produce MSP object files */ cc -h ssp -c ... /* Produce SSP object files */ /* Link MSP and SSP object files */ /* to create an executable to run on MSPs */ cc sspA.o sspB.o msp.o ... For more information about MSP mode, see the Optimizing Applications on Cray X1 Series Systems manual. For information about SSP mode, see Section 2.10.13, page 30. 2.10.9 -h [no]overindex (CC, cc, c99) Default: -h nooverindex The -h overindex option declares that there are array subscripts that index a dimension of an array that is outside the declared bounds of that array. The -h nooverindex option declares that there are no array subscripts that index a dimension of an array that is outside the declared bounds of that array. S–2179–60 29Cray® C and C++ Reference Manual 2.10.10 -h [no]pattern (CC, cc, c99) Default: -h pattern The -h [no]pattern option globally enables or disables pattern matching. Pattern matching is on by default. For details on pattern matching, see the Optimizing Applications on Cray X1 Series Systems manual or the Optimizing Applications on Cray X2 Systems manual. 2.10.11 -h profile_data=pgo_opt (CC, cc, c99) The -h profile_data=pgo_opt option tells the compiler how to treat #pragma pgo profile-guided optimization directives. There are two pgo_opt levels: sample Directs the compiler to treat #pragma pgo directives as actions that gather information from a sample program. This pgo_opt level prevents the compiler from performing unsafe optimizations with the data. absolute Directs the compiler to treat #pragma pgo directives as acting on the only data set that the program will ever use. This pgo_opt level can be used when program units are always called with the same arguments or when it is known that the data will not change from the experimental runs. For information about the pgo loop_info directive, see Section 3.7.10, page 99. For information about CrayPat and profile information, see the Using Cray Performance Analysis Tools guide. 2.10.12 -h profile_generate (CC, cc, c99) The -h profile_generate option directs that the source code be instrumented for gathering profile information. The compiler inserts calls and data-gathering instructions to allow CrayPat to gather information about the loops in a compilation unit. If you use this option, you must run CrayPat on the resulting executable so the CrayPat data-gathering routines are linked in. For information about CrayPat and profile information, see the Using Cray Performance Analysis Tools guide. 2.10.13 -h ssp (CC, cc, c99) Note: The -h ssp option is not supported on Cray X2 systems. 30 S–2179–60Compiler Commands [2] Default: -h msp The -h ssp option causes the compiler to compile the code and select the appropriate libraries to create an executable that runs on one single-streaming processor (SSP mode). Any code, including code using Cray supported distributed memory models, can use SSP mode. Executables compiled for SSP mode can contain only object files compiled in SSP mode. When loading object files separately from the compile step, the SSP mode must be specified during the load step as this example shows: /* Produce SSP object files */ cc -h ssp -c ... /* Link SSP object files */ /* to create an executable to run on a single SSP */ cc -h ssp sspA.o sspB.o ... Since SSP mode does not use multistreaming, the -h ssp option also changes the compiler's behavior in the same way as the -h stream0 option. This option then causes the compiler to ignore CSDs. Note: Code explicitly compiled with the -h stream0 option can be linked with object files compiled with MSP or SSP mode. You can use this option to create a universal library that can be used in MSP or SSP mode. For more information about SSP mode, see the Optimizing Applications on Cray X1 Series Systems manual. For information about MSP mode, see Section 2.10.8, page 29. Note: For Cray X1 series systems, the -h ssp and -h command options both create executables that run on an SSP. The executable created via the -h ssp option runs on an application node. The executable created via the -h command option runs on the support node. 2.10.14 -h [no]unroll (CC, cc, c99) Default: –h unroll The –h [no]unroll option globally allows or disallows unrolling of loops. By default, the compiler attempts to unroll all loops, unless the –h nounroll option is specified, or the unroll0 or unroll1 pragma (Section 3.9.5, page 109) is specified for a loop. Loop unrolling generally increases single processor performance at the cost of increased compile time and code size. S–2179–60 31Cray® C and C++ Reference Manual For more information about loop unrolling, see the Optimizing Applications on Cray X1 Series Systems manual or the Optimizing Applications on Cray X2 Systems manual. Note: Loop unrolling is disabled when the scalar level is set to 0. 2.10.15 -Olevel (CC, cc, c89, c99) Default: Equivalent to the appropriate -h option The -O level option specifies the optimization level for a group of compiler features. Specifying -O with no argument is the same as not specifying the -O option; this syntax is supported for compatibility with other vendors. A value of 0, 1, 2, or 3 sets that level of optimization for each of the -h scalarn, -h streamn, and -h vectorn, options. On Cray X2 systems,-O values of 0, 1, 2, or 3 set that level of optimization for -h cachen options. On Cray X1 series systems, the -h cache value is -h cache0 for all -O level values. For example, on Cray X1 series systems, -O2 is equivalent to the following: -h scalar2,stream2,vector2,cache0 Table 5 shows the equivalent level of automatic cache optimization for the -h option. Optimization features specified by -O are equivalent to the -h options listed in Table 4. Table 4. -h Option Descriptions -h option Description location -h cachen Section 2.11.1, page 33 -h streamn Section 2.12.1, page 34 -h vectorn Section 2.13.3, page 36 -h scalarn Section 2.15.2, page 41 32 S–2179–60Compiler Commands [2] 2.11 Automatic Cache Management Options This section describes the automatic cache management options. Automatic cache management can be overridden by the use of the cache directives (no_cache_alloc, cache_shared, cache_exclusive, and loop_info). 2.11.1 -h cachen (CC, cc, c99) Default: -h cache0 The -h cachen option specifies the levels of automatic cache management to perform. The default for Cray X2 systems is -h cache2. The n argument can be: 0 No automatic cache management; all memory references are allocated to cache in an exclusive state. Cache directives are still honored. Characteristics include low compile time. This level is compatible with all scalar, vector, and, for Cray X1 series systems, stream optimization levels. 1 Conservative automatic cache management. Characteristics include moderate compile time. Symbols are placed in the cache when the possibility of cache reuse exists and the predicted cache footprint of the symbol in isolation is small enough to experience the reuse. This level requires at least -h vector1. 2 Moderately aggressive automatic cache management. Characteristics include moderate compile time. Symbols are placed in the cache when the possibility of cache reuse exists and the predicted state of the cache model is such that the symbol will experience the reuse. This level requires at least -h vector1. 3 Aggressive automatic cache management. Characteristics include potentially high compile time. Symbols are placed in the cache when the possibility of cache reuse exists and the allocation of the symbol to the cache is predicted to increase the number of cache hits. This level requires at least -h vector1. S–2179–60 33Cray® C and C++ Reference Manual Table 5. Cache Levels -h option Cray X1 Series Systems Cray X2 Systems -O0 -h cache0 -h cache0 -O1 -h cache0 -h cache1 -O2 -h cache0 -h cache2 -O3 -h cache0 -h cache3 2.12 Multistreaming Processor Optimization Options Note: The multistreaming processor optimization options are not supported on Cray X2 systems. This section describes the multistreaming processor (MSP) options. For information about MSP #pragma directives, see Section 3.8, page 103. For information about streaming intrinsics, see Appendix F, page 229. Each subsection heading shows in parentheses the compiler command with which the option can be used. These options cannot be used in SSP mode, which is enabled with the -h ssp option. 2.12.1 -h streamn (CC, cc, c99) Default: -h stream2 The -h streamn option specifies the level of automatic MSP optimizations to be performed. Generally, vectorized applications that execute on a one-processor system can expect to execute up to four times faster on a processor with multistreaming enabled. 34 S–2179–60Compiler Commands [2] The n argument can be: n Description 0 No automatic multistreaming optimizations are performed. 1 Conservative automatic multistreaming optimizations. This level is compatible with -h vector1, 2, and 3. 2 Moderate automatic multistreaming optimizations. Automatic multistreaming optimization is performed on loop nests and appropriate bit matrix multiplication (BMM) operations. This option also enables conditional streaming. Conditional streaming allows runtime selection between streamed and nonstreamed versions of a loop based on dependence conditions which cannot be evaluated until runtime. For details, see the Optimizing Applications on Cray X1 Series Systems manual. This level is compatible with -h vector2 and 3. 3 Aggressive automatic multistreaming optimizations. Automatic multistreaming optimization is performed as with stream2. This level is compatible with -h vector2 and 3. 2.13 Vector Optimization Options This section describes vector optimization options. Each subsection heading shows in parentheses the compiler command with which the option can be used. 2.13.1 -h [no]infinitevl (CC, cc, c99) Default: -h infinitevl The -h infinitevl option tells the compiler to assume an infinite safe vector length for all #pragma _CRI ivdep directives. The -h noinfinitevl option tells the compiler to assume a safe vector length equal to the maximum supported vector length on the machine for all #pragma _CRI ivdep directives. S–2179–60 35Cray® C and C++ Reference Manual 2.13.2 -h [no]ivdep (CC, cc, c99) Default: -h noivdep The -h ivdep option tells the compiler to ignore vector dependencies for all loops. This is useful for vectorizing loops that contain pointers. With -h noivdep, loop dependencies inhibit vectorization. To control loops individually, use the #pragma _CRI ivdep directive, as discussed in Section 3.7.2, page 91. This option can also be used with "vectorization-like" optimizations found in Section 3.7, page 90. ! Caution: This option should be used with extreme caution because incorrect results can occur if there is a vector dependency within a loop. Combining this option with inlining is dangerous because inlining can introduce vector dependencies. This option severely constrains other loop optimizations and should be avoided if possible. 2.13.3 -h vectorn (CC, cc, c99) Default: -h vector2 The -h vectorn option specifies the level of automatic vectorizing to be performed. Vectorization results in dramatic performance improvements with a small increase in object code size. Vectorization directives are unaffected by this option. 36 S–2179–60Compiler Commands [2] Argument n can be one of the following: n Description 0 No automatic vectorization. Characteristics include low compile time and small compile size. This option is compatible with all scalar optimization levels. 1 Specifies conservative vectorization. Characteristics include moderate compile time and size. No loop nests are restructured; only inner loops are vectorized. No vectorizations that might create false exceptions are performed. Results may differ slightly from results obtained when -h vector0 is specified because of vector reductions. The -h vector1 option is compatible with -h scalar1, -h scalar2, -h scalar3, or -h stream1. 2 Specifies moderate vectorization. Characteristics include moderate compile time and size. Loop nests are restructured. The -h vector2 option is compatible with -h scalar2 or -h scalar3 and with -h stream0, -h stream1, and -h stream2. 3 Specifies aggressive vectorization. Characteristics include potentially high compile time and size. Loop nests are restructured. Vectorizations that might create false exceptions in rare cases may be performed. Vectorization directives are described in Section 3.7, page 90. 2.14 Inlining Optimization Options Inlining is the process of replacing a user function call with the function definition itself. This saves call overhead and may allow better optimization of the inlined code. If all calls within a loop are inlined, the loop becomes a candidate for vectorization or, for Cray X1 series systems, streaming. Inlining may increase object code size. S–2179–60 37Cray® C and C++ Reference Manual Inlining is inhibited if: • Arguments declared in a function differ in type from arguments in a function call. • The number of arguments declared in a function differ from the number of arguments in a function call. • A call site is within the range of a #pragma inline_disable directive. For a description of the inline_disable directive, see Section 3.10.1, page 112. • A function being called is specified on a #pragma inline_never directive. For a description of the inline_never directive, see Section 3.10.2, page 113. • The compiler determines that the routine is too big to inline. This is determined by an internal limit of the text (that is, the instruction segment in the executable) size of the routine. You can override this limit by inserting a #pragma inline_always directive. For a description of the inline_always directive, see Section 3.10.2, page 113. The compiler supports the following inlining modes: • Automatic inlining (see Section 2.14.2) • Explicit inlining (see Section 2.14.3, page 40) • Combined inlining (see Section 2.14.4, page 41) 2.14.1 -h clonen (CC, cc) Default: -h clone0 The following command line options control procedural cloning: • -h clone0, disable cloning (default) • -h clone1, enable cloning Cloning is the attempt to duplicate a procedure under certain conditions and replace dummy arguments with associated constant actual arguments throughout the cloned procedure. The compiler attempts to clone a procedure when a call site contains actual arguments that are scalar integer and/or scalar logical constants. When the constants are exposed to the optimizer, it can generate more efficient code. Note: Do not specify the -h ipafrom= option when using the cloning option. 38 S–2179–60Compiler Commands [2] The cloning option works in conjunction with the -h ipan option where n is greater than or equal to 2. When a clone is made, dummy arguments are replaced with associated constant values throughout the routine. When specifying the -hclone1 option, you must also specify one of the following inlining options on the command line: -h ipa2, -h ipa3, -h ipa4, or -h ipa5. 2.14.2 -h ipan (CC, cc, c89, c99) Default: -h ipa3 The -h ipan option specifies automatic inlining. Automatic inlining allows the compiler to automatically select, depending on the inlining level n, which functions to inline. Each n is a different set of heuristics. When -h ipan is used alone, the candidates for expansion are all those functions that are present in the input file to the compile step. If -h ipan is used in conjunction with -h ipafrom=, the candidates for expansion are all those functions present in 'source'. Table 6 explains what is inlined at each level. Table 6. Automatic Inlining Specifications Inlining level Description 0 All inlining is disabled. All inlining compiler directives are ignored. For more information about inlining directives, see Section 3.10, page 111. 1 Directive inlining. Inlining is attempted for call sites and routines that are under the control of an inlining pragma directive. 2 Call nest inlining. Inline a call nest to an arbitrary depth as long as the nest does not exceed some compiler determined threshold. A call nest can be a leaf routine. The expansion of the call nest must yield straight-line code (such as code which contains no external calls) for any expansion to occur. 3 Constant actual argument inlining. This is the combination of levels 1 and 2 plus any call site that contains a constant argument. The default inlining level is -h ipa3. S–2179–60 39Cray® C and C++ Reference Manual Inlining level Description 4 Tiny routine inlining. This includes levels 1, 2, and 3; plus, the inlining of very small routines regardless of where those routines fall in the call graph. The lower limit threshold is an internal compiler parameter. 5 Aggressive inlining. Inlining is attempted for every call site encountered. Cray does not recommend specifying this level. 2.14.3 -h ipafrom=source [:source ] (CC, cc, c89, c99) The -h ipafrom=source [:source] option specifies explicit inlining. The source arguments identify the files or directories that contain the functions to consider for inlining. Only those functions present in source are candidates for inlining. When a call is encountered to a function that resides in source, an attempt will be made to expand the function in place at that call site. Note that blanks are not allowed on either side of the equal sign. All inlining directives are recognized with explicit inlining. For information about inlining directives, see Section 3.10, page 111. The functions in source are not actually loaded with the final program. They are simply templates for the inliner. To have a function contained in source loaded with the program, you must include it in an input file to the compilation. Use one or more of the following objects in the source argument. Source Description C or C++ source files The functions in C or C++ source files are candidates for inline expansion and must contain error-free code. C files that are acceptable for inlining are files with the .c extension. C++ files that are acceptable for inlining are files that have one of the following extensions: .C, .c++, .C++, .cc, .cxx, .Cxx, .CXX, .CC, , or .cpp. dir A directory that contains any of the file types described in this table. 40 S–2179–60Compiler Commands [2] 2.14.4 Combined Inlining Combined inlining is a combination of automatic inlining and explicit inlining. It allows you to specify targets for inline expansion, while applying the selected level of inlining heuristics. You invoke combined inlining by including both the -h ipan and the -h ipafrom=source [:source ] options on the command line. The only candidates for expansion are those functions that reside in source. The rules that apply to deciding whether to inline are defined by the -h ipan setting. 2.15 Scalar Optimization Options This section describes scalar optimization options. Each subsection heading shows in parentheses the compiler command with which the option can be used. 2.15.1 -h [no]interchange (CC, cc, c99) Default: -h interchange The -h interchange option allows the compiler to attempt to interchange all loops, a technique that is used to gain performance by having the compiler swap an inner loop with an outer loop. The compiler attempts the interchange only if the interchange will increase performance. Loop interchange is performed only at scalar optimization level 2 or higher. The -h nointerchange option prevents the compiler from attempting to interchange any loops. To disable interchange of loops individually, use the #pragma _CRI nointerchange directive. 2.15.2 -h scalarn (CC, cc, c99) Default: -h scalar2 The -h scalarn option specifies the level of automatic scalar optimization to be performed. Scalar optimization directives are unaffected by this option (see Section 3.9, page 107). S–2179–60 41Cray® C and C++ Reference Manual Use one of the following values for n: 0 No automatic scalar optimization. The -h matherror=errno and -h zeroinc options are implied by -h scalar0. 1 Conservative automatic scalar optimization. This level implies -h matherror=abort and -h nozeroinc. 2 Moderate automatic scalar optimization. The scalar optimizations specified by scalar1 are performed. 3 Aggressive automatic scalar optimization. 2.15.3 -h [no]zeroinc (CC, cc, c99) Default: -h nozeroinc The -h nozeroinc option improves run time performance by causing the compiler to assume that constant increment variables (CIVs) in loops are not incremented by expressions with a value of 0. The -h zeroinc option causes the compiler to assume that some CIVs in loops might be incremented by 0 for each pass through the loop, preventing generation of optimized code. For example, in a loop with index i, the expression expr in the statement i +=expr can evaluate to 0. This rarely happens in actual code. -h zeroinc is the safer and slower option. This option is affected by the -h scalarn option (see Section 2.15.2, page 41). 2.16 Math Options This section describes compiler options pertaining to math functions. Each subsection heading shows in parentheses the compiler command with which the option can be used. 2.16.1 -h fpn (CC, cc, c99) Default: -h fp2 The -h fp option allows you to control the level of floating-point optimizations. The n argument controls the level of allowable optimization; 0 gives the compiler minimum freedom to optimize floating-point operations, while 3 gives it maximum freedom. The higher the level, the lesser the floating-point operations conform to the IEEE standard. 42 S–2179–60Compiler Commands [2] This option is useful for code that use unstable algorithms, but which are optimizable. It is also useful for applications that want aggressive floating-point optimizations that go beyond what the Fortran standard allows. Generally, this is the behavior and usage for each -h fp level: • The -h fp0 option causes your program's executable code to conform more closely to the IEEE floating-point standard than the default mode (-h fp2). When this level is specified, many identity optimizations are disabled, vectorization of floating point reductions are disabled, executable code is slower than higher floating-point optimization levels, and a scaled complex divide mechanism is enabled that increases the range of complex values that can be handled without producing an underflow. Note: Use the-h fp0 option only when your code pushes the limits of IEEE accuracy or requires strong IEEE standard conformance. • The -h fp1 option performs various, generally safe, non-conforming IEEE optimizations, such as folding a == a to true, where a is a floating point object. At this level, floating-point reassociation 1 is greatly limited, which may affect the performance of your code. The -h fp1 options should never be used, except when your code pushes the limits of IEEE accuracy, or requires strong IEEE standard conformance. • -h fp2—includes optimizations of -h fp1. • -h fp3—includes optimizations of -h fp2. The -h fp3 option should be used when performance is more critical than the level of IEEE standard conformance provided by -h fp2. 1 For example, a+b+c is rearranged to b+a+c, where a, b, and c are floating point variables. S–2179–60 43Cray® C and C++ Reference Manual Table 7 compares the various optimization levels of the -h fp option (levels 2 and 3 are usually the same). The table lists some of the optimizations performed; the compiler may perform other optimizations not listed. Table 7. Floating-point Optimization Levels Optimization Type 0 1 2 (default) 3 Inline selected mathematical library functions N/A N/A N/A Accuracy is slightly reduced. Complex divisions Accurate and slower Accurate and slower Less accurate (less precision) and faster. Less accurate (less precision) and faster. Exponentiation rewrite None None Maximum performance 2 Maximum performance 2, 3 Strength reduction Fast Fast Aggressive Aggressive Rewrite division as reciprocal equivalent 4 None None Yes Yes Floating point reductions Slow Fast Fast Fast Safety Maximum Moderate Moderate Low If multiple -h fp options are used, the compiler will use only the right-most option and will issue a message indicating such. 2 Rewriting values raised to a constant power into an algebraically equivalent series of multiplications and/or square roots. 3 Rewriting exponentiations (a b ) not previously optimized into the algebraically equivalent form exp(b * ln(a)). 4 For example, x/y is transformed to x * 1.0/y. 44 S–2179–60Compiler Commands [2] 2.16.2 -h ieee_nonstop Specifies that the IEEE-754 "nonstop" floating-point environment is used. This environment disables all traps (interrupts) on floating-point exceptions, enables recording of all floating-point exceptions in the floating-point status register, and rounds floating-point operations to nearest. When this option is omitted, Invalid, Overflow, and Divide by zero exceptions will trap and be recorded; Underflow and Inexact exceptions will neither trap nor be recorded; and floating-point operations round to nearest. For Cray X1 series systems, this option requires UNICOS/mp 2.5 release or later. For Cray X2 systems, this option requires UNICOS/lc 1.0. 2.16.3 -h matherror=method (CC, cc, c99) Default: -h matherror=abort The -h matherror=method option specifies the method of error processing used if a standard math function encounters an error. The method argument can have one of the following values: method Description abort If an error is detected, errno is not set. Instead a message is issued and the program aborts. An exception may be raised. errno If an error is detected, errno is set and the math function returns to the caller. This method is implied by the -h conform, -h scalar0, -O0, -Gn, and -g options. 2.17 Debugging Options This section describes compiler options used for debugging. Each subsection heading shows in parentheses the compiler command with which the option can be used. 2.17.1 -Glevel (CC, cc, c99) and -g (CC, cc, c89, c99) The -Glevel and -g options enable the generation of debugging information that is used by symbolic debuggers such as TotalView. These options allow debugging with breakpoints. Table 8 describes the values for the -G option. S–2179–60 45Cray® C and C++ Reference Manual Table 8. -Glevel Definitions level Optimization Breakpoints allowed on f Full Function entry and exit p Partial Block boundaries n None Every executable statement More extensive debugging (such as full) permits greater optimization opportunities for the compiler. Debugging at any level may inhibit some optimization techniques, such as inlining. The -g option is equivalent to -Gn. The -g option is included for compatibility with earlier versions of the compiler and many other UNIX systems; the -G option is the preferred specification. The -Gn and -g options disable all optimizations and imply -O0. The debugging options take precedence over any conflicting options that appear on the command line. If more than one debugging option appears, the last one specified overrides the others. Debugging is described in more detail in Chapter 12, page 167. 2.17.2 -h [no]bounds (cc, c99) Default: -h nobounds The -h bounds option provides checking of pointer and array references to ensure that they are within acceptable boundaries. -h nobounds disables these checks. The pointer check verifies that the pointer is greater than 0 and less than the machine memory limit. The array check verifies that the subscript is greater than or equal to 0 and is less than the array size, if declared. 2.17.3 -h zero (CC, cc, c99) The -h zero option causes stack-allocated memory to be initialized to all zeros. 46 S–2179–60Compiler Commands [2] 2.17.4 -h dir_check (CC, cc) Enables directive checking at runtime. Errors detected at compile time are reported during compilation and so are not reported at runtime. The following directives are checked: shortloop, shortloop128, and the loop_info clauses min_trips and max_trips. Violation of a runtime check results in an immediate fatal error diagnostic. Warning: Optimization of enclosing and adjacent loops is degraded when runtime directive checking is enabled. This capability, though useful for debugging, is not recommended for production runs. 2.18 Compiler Message Options This section describes compiler options that affect messages. Each subsection heading shows in parentheses the compiler command with which the option can be used. 2.18.1 -h msglevel_n (CC, cc, c99) Default: -h msglevel_3 The -h msglevel_n option specifies the lowest level of severity of messages to be issued. Messages at the specified level and above are issued. Argument n can be one of the following options: 0 Comment 1 Note 2 Caution 3 Warning 4 Error S–2179–60 47Cray® C and C++ Reference Manual 2.18.2 -h [no]message=n[:n...] (CC, cc, c99) Default: Determined by -h msglevel_n The -h [no]message=n[:n...] option enables or disables specified compiler messages. n is the number of a message to be enabled or disabled. You can specify more than one message number; multiple numbers must be separated by a colon with no intervening spaces. For example, to disable messages CC-174 and CC-9, specify: -h nomessage=174:9 The -h [no]message=n option overrides -h msglevel_n for the specified messages. If n is not a valid message number, it is ignored. Any compiler message except ERROR, INTERNAL, and LIMIT messages can be disabled; attempts to disable these messages by using the -h nomessage=n option are ignored. 2.18.3 -h report=args (CC, cc, c99) The -h report=args option generates report messages specified in args and lets you direct the specified messages to a file. args can be any combination of the following options: i Generates inlining optimization messages m Generates multistream optimization messages (Cray X1 series systems only) s Generates scalar optimization messages v Generates vector optimization messages f Writes specified messages to file file.V where file is the source file specified on the command line. If the f option is not specified, messages are written to stderr. No spaces are allowed around the equal sign (=) or any of the args codes. For example, the following example prints inlining and scalar optimization messages for myfile.c: % cc -h report=is myfile.c 48 S–2179–60Compiler Commands [2] 2.18.4 -h [no]abort (CC, cc, c99) Default: -h noabort The -h [no]abort option controls whether a compilation aborts if an error is detected. 2.18.5 -h errorlimit[=n] (CC, cc, c99) Default: -h errorlimit=100 The -h errorlimit[=n] option specifies the maximum number of error messages the compiler prints before it exits, where n is a positive integer. Specifying -h errorlimit=0 disables exiting on the basis of the number of errors. Specifying -h errorlimit with no qualifier is the same as setting n to 1. 2.19 Compilation Phase Options This section describes compiler options that affect compilation phases. Each subsection heading shows in parentheses the compiler command with which the option can be used. 2.19.1 -E (CC, cc, c89, c99, cpp) If the -E option is specified on the command line (except for cpp), it executes only the preprocessor phase of the compiler. The -E and -P options are equivalent, except that -E directs output to stdout and inserts appropriate #line preprocessing directives. The -E option takes precedence over the -h feonly, -S, and -c options. If the -E option is specified on the cpp command line, it inserts the appropriate #line directives in the preprocessed output. When both the -P and -E options are specified, the last one specified takes precedence. S–2179–60 49Cray® C and C++ Reference Manual 2.19.2 -P (CC, cc, c99) When the -P option is specified on the command line, it executes only the preprocessor phase of the compiler for each source file specified. The preprocessed output for each source file is written to a file with a name that corresponds to the name of the source file and has .i suffix substituted for the suffix of the source file. The -P option is similar to the -E option, except that #line directives are suppressed, and the preprocessed source does not go to stdout. This option takes precedence over -h feonly, -S, and -c. When both the -P and -E options are specified, the last one specified takes precedence. 2.19.3 -h feonly (CC, cc, c99) The -h feonly option limits the Cray C and C++ compilers to syntax checking. The optimizer and code generator are not executed. This option takes precedence over -S and -c. 2.19.4 -S (CC, cc, c99) The -S option compiles the named C or C++ source files and leaves their assembly language output in the corresponding files suffixed with a .s. If this option is used with -G or -g, debugging information is not generated. This option takes precedence over -c. 2.19.5 -c (CC, cc, c89, c99) The -c option creates a relocatable object file for each named source file but does not link the object files. The relocatable object file name corresponds to the name of the source file. The .o suffix is substituted for the suffix of the source file. 2.19.6 -#, -##, and -### (CC, cc, c99, cpp) The -# option produces output indicating each phase of the compilation as it is executed. Each succeeding output line overwrites the previous line. The -## option produces output indicating each phase of the compilation as it is executed. The -### option is the same as -##, except the compilation phases are not executed. 50 S–2179–60Compiler Commands [2] 2.19.7 -Wphase,"opt..." (CC, cc, c99) The -Wphase option passes arguments directly to a phase of the compiling system. Table 9 shows the system phases that phase can indicate. Table 9. -Wphase Definitions phase System phase Command p Preprocessor cpp 0 (zero) Compiler CC, cc, and c99 a Assembler as(1) l Loader ld Arguments to be passed to system phases can be entered in either of two styles. If spaces appear within a string to be passed, the string is enclosed in double quotes. When double quotes are not used, spaces cannot appear in the string. Commas can appear wherever spaces normally appear; an option and its argument can be either separated by a comma or not separated. If a comma is part of an argument, it must be preceded by the \ character. For example, any of the following command lines would send -e name and -s to the loader: % cc -Wl,"-e name -s" file.c % cc -Wl,-e,name,-s file.c % cc -Wl,"-ename",-s file.c Because the preprocessor is built into the compiler, -Wp and -W0 are equivalent. 2.19.8 -Yphase,dirname (CC, cc, c89, c99, cpp) The -Yphase,dirname option specifies a new directory (dirname) from which the designated phase should be executed. phase can be one or more of the values shown in Table 10. S–2179–60 51Cray® C and C++ Reference Manual Table 10. -Yphase Definitions phase System phase Command p Preprocessor cpp 0 (zero) Compiler CC,cc,c89,c89,cpp a Assembler as l Loader ld Because there is no separate preprocessor, -Yp and -Y0 are equivalent. If you are using the -Y option on the cpp command line, p is the only argument for phase that is allowed. 2.20 Preprocessing Options This section describes compiler options that affect preprocessing. Each subsection heading shows in parentheses the compiler command with which the option can be used. 2.20.1 -C (CC, cc, c99, cpp) The -C option retains all comments in the preprocessed source code, except those on preprocessor directive lines. By default, the preprocessor phase strips comments from the source code. This option is useful with cpp or in combination with the -P or -E option on the CC, cc, and c99 commands. 2.20.2 -D macro[=def] (CC, cc, c89, c99 cpp) The -D macro[=def] option defines a macro named macro as if it were defined by a #define directive. If no =def argument is specified, macro is defined as 1. Predefined macros also exist; these are described in Chapter 10, page 157. Any predefined macro except those required by the standard (see Section 10.1, page 158) can be redefined by the -D option. The -U option overrides the -D option when the same macro name is specified regardless of the order of options on the command line. 52 S–2179–60Compiler Commands [2] 2.20.3 -h [no]pragma=name[:name...] (CC, cc, c99) Default: -h pragma The [no]pragma=name[:name...] option enables or disables the processing of specified directives in the source code. name can be the name of a directive or a word shown in Table 11 to specify a group of directives. More than one name can be specified. Multiple names must be separated by a colon and have no intervening spaces. Table 11. -h pragma Directive Processing name Group Directives affected all All All directives allinline Inlining inline_enable, inline_disable, inline_reset, inline_always, inline_never allscalar Scalar optimization concurrent, nointerchange, noreduction, suppress, unroll/nounroll allvector Vectorization ivdep, novector, loop_info, hand_tuned, nopattern, novector, novsearch, permutation, pipeline/nopipeline, prefervector, safe_address, safe_conditional, shortloop, shortloop128 When using this option to enable or disable individual directives, note that some directives must occur in pairs. For these directives, you must disable both directives if you want to disable either; otherwise, the disabling of one of the directives may cause errors when the other directive is (or is not) present in the compilation unit. S–2179–60 53Cray® C and C++ Reference Manual 2.20.4 -I incldir (CC, cc, c89, c99, cpp) The -I incldir option specifies a directory for files named in #include directives when the #include file names do not have a specified path. Each directory specified must be specified by a separate -I option. The order in which directories are searched for files named on #include directives is determined by enclosing the file name in either quotation marks ("") or angle brackets (< and >). Directories for #include "file" are searched in the following order: 1. Directory of the input file. 2. Directories named in -I options, in command line order. 3. Site-specific and compiler release-specific include files directories. 4. Directory /usr/include. Directories for #include are searched in the following order: 1. Directories named in -I options, in command line order. 2. Site-specific and compiler release-specific include files directories. 3. Directory /usr/include. If the -I option specifies a directory name that does not begin with a slash (/), the directory is interpreted as relative to the current working directory and not relative to the directory of the input file (if different from the current working directory). For example: % cc -I. -I yourdir mydir/b.c The preceding command line produces the following search order: 1. mydir (#include "file" only). 2. Current working directory, specified by -I. 3. yourdir (relative to the current working directory), specified by -I yourdir. 4. Site-specific and compiler release-specific include files directories. 5. Directory /usr/include. 54 S–2179–60Compiler Commands [2] 2.20.5 -M (CC, cc, c99, cpp) The -M option provides information about recompilation dependencies that the source file invokes on #include files and other source files. This information is printed in the form expected by make. Such dependencies are introduced by the #include directive. The output is directed to stdout. 2.20.6 -N (cpp) The -N option specified on the cpp command line enables the old style (referred to as K & R) preprocessing. If you have problems with preprocessing (especially non-C source code), use this option. 2.20.7 -nostdinc (CC, cc, c89, c99, cpp) The -nostdinc option stops the preprocessor from searching for include files in the standard directories (/usr/include/CC and /usr/include). 2.20.8 -U macro (CC, cc, c89, c99, cpp) The -U option removes any initial definition of macro. Any predefined macro except those required by the standard (see Section 10.1, page 158) can be undefined by the -U option. The -U option overrides the -D option when the same macro name is specified, regardless of the order of options on the command line. Predefined macros are described in Chapter 10, page 157. Macros defined in the system headers are not predefined macros and are not affected by the -U option. 2.21 Loader Options This section describes compiler options that affect loader tasks. Each subsection heading shows in parentheses the compiler command with which the option can be used. 2.21.1 -l libfile (CC, cc, c89, c99) The -l libfile option identifies a library file. To request more than one library file, specify multiple -l options. S–2179–60 55Cray® C and C++ Reference Manual The loader searches for libraries by prepending ldir/lib on the front of libfile and appending .a on the end of it, for each ldir that has been specified by using the -L option. It uses the first file it finds. See also the -L option (Section 2.21.2, page 56). There is no search order dependency for libraries. Default libraries are shown in the following list: libC.a (Cray C++ only) libu.a libm.a libc.a libsma.a libf.a libfi.a libsci.a If you specify personal libraries by using the -l command line option, as in the following example, those libraries are added to the top of the preceding list. (The -l option is passed to the loader.) cc -l mylib target.c When the previous command line is issued, the loader looks for a library named libmylib.a (following the naming convention) and adds it to the top of the list of default libraries. 2.21.2 -L libdir (CC, cc, c89, c99) The -L libdir option changes the -l option search algorithm to look for library files in directory ldir. To request more than one library directory, specify multiple -L options. The loader searches for library files in the compiler release-specific directories. Note: Multiple -L options are treated cumulatively as if all libdir arguments appeared on one -L option preceding all -l options. Therefore, do not attempt to load functions of the same name from different libraries through the use of alternating -L and -l options. 56 S–2179–60Compiler Commands [2] 2.21.3 -o outfile (CC, cc, c89, c99) The -o outfile option produces an absolute binary file named outfile. A file named a.out is produced by default. When this option is used in conjunction with the -c option and a single C or C++ source file, a relocatable object file named outfile is produced. 2.22 Miscellaneous Options This section describes compiler options that affect general tasks. Each subsection heading shows in parentheses the compiler command with which the option can be used. 2.22.1 -h command (cc, c99) The -h command option allows you to create a command-mode executable. Command-mode executables run serially; they cannot be executed across multiple PEs. If you want to disable vectorization, add the -h vector0 option to the compiler command line. The compiled commands will have less debugging information, unless you specify a debugging option. The debugging information does not slow execution time, but it does result in a larger executable that may take longer to load. On Cray X1 series systems, command-mode executables run serially on a single-streaming processor (SSP) within a support node. They are launched without assistance from aprun. Commands created with this option cannot multistream. On Cray X2 systems, command-mode executables run serially on compute nodes without being lauched by aprun. However, because users do not log in to Cray X2 compute nodes directly, command-mode executables are of limited use. A command-mode executable can be executed by the system(3) library function called from within a program running on a compute node, or from a script launched by aprun to run on a compute node. On Cray X2 systems, use the cc command, not the ld command, for linking. S–2179–60 57Cray® C and C++ Reference Manual On Cray X1 series systems, you should use the cc command for linking, because the required options and libraries are automatically specified and loaded for you. If you decide to load the libraries manually, you must use the Cray X1 series systems loader command (ld) and specify on its command line the -command and -ssp options and the -L option with the path to the command mode libraries. The command mode libraries are found in the cmdlibs directory under the path defined by the CRAYLIBS_SV2 environment variable. These must also be linked: • Start0.o • libc library • libm library • libu library The following sample command line illustrates compiling the code for a command named fierce: % cc -h command -h vector0 -o fierce fierce.c Note: For Cray X1 series systems, the -h ssp and -h command options both create executables that run on an SSP. The executable created via the -h ssp option runs on an application node. The executable created via the -h command option runs on the support node. 2.22.2 -h cpu=target_system (CC, cc, c99) The -h cpu=target_system option specifies the system on which the absolute binary file is to be executed. Default: -h cpu=cray-x1 Use one of these values for target_system: target_system Description cray-x1 Use this option (default) if the absolute binary file will be executed on a Cray X1 system cray-x1e Use this option if the absolute binary file will be executed on a Cray X1E system cray-x2 Use this option if the absolute binary file will be executed on a Cray X2 system 58 S–2179–60Compiler Commands [2] Note: Currently, there are no differences in the code produced for the cray-x1 and cray-x1e targets. This option was created to allow us to support future changes in optimization or code generation based on our experience with the Cray X1E hardware. It is possible that compilations with the -hcpu=cray-x1e option will not be compatible with Cray X1 machines in the future. 2.22.3 -h decomp (CC, cc, c99) The -h decomp option decompiles (translates) the intermediate representation of the compiler into listings that resemble the format of the source code. This is performed twice, resulting in two output files, at different points during the optimization process. You can use these files to examine the restructuring and optimization changes made by the compiler, which can lead to insights about changes you can make to your C or C++ source to improve its performance. For more information about optimization, see the Optimizing Applications on Cray X1 Series Systems manual or the Optimizing Applications on Cray X2 Systems manual. This action can also be invoked by specifying the -h list=d option (see Section 2.10.7, page 28). The compiler produces two decompilation listing files, with these extensions, per source file specified on the command line: .opt and .cg. The compiler generates the .opt file after applying most high level loop nest transformations to the code. The code structure of this listing most resembles your source code and is readable by most users. In some cases, because of optimizations, the structure of the loops and conditionals will be significantly different than the structure in your source file. The .cg file contains a much lower level of decompilation. It is still displayed in a C or C++ like format, but is quite close to what will be produced as assembly output. This version displays the intermediate text after all multistreaming translation (Cray X1 series systems only), vector translation, and other optimizations have been performed. An intimate knowledge of the hardware architecture of the system is helpful to understanding this listing. The .opt and .cg files are intended as a tool for performance analysis and are not valid C or C++ functions. The format and contents of the files can be expected to change from release to release. The following example shows the listings generated when the -h decomp is applied. S–2179–60 59Cray® C and C++ Reference Manual Source code of example.c void example( double a[restrict], double b[restrict], double c[restrict] ) { int i; for ( i = 0; i < 100; i++ ) { a[i] = b[i] * c[i]; } } The following listing is of the example.opt file after loop optimizations are performed: 2. void 2. example( a, b, c ) 2. { 6. $Induc01_N4 = 0; 6. #pragma ivdep 6. do { 7. a[$Induc01_N4] = b[$Induc01_N4] * c[$Induc01_N4]; 6. $Induc01_N4++; 6. } while ( $Induc01_N4 < 100 ); 9. return; 9. } The following listing is of the example.cg file after other optimizations are performed: 2. void 2. example( a, b, c ) 2. { 7. $SC_c_I0 = c; 7. $SC_b_I1 = b; 7. $SC_a_I2 = a; 6. /* === Begin Short Vector Loop === */; 7. 0[(long) $SC_a_I2:100:1].L = 0[(long) $SC_b_I1:100:1].L * 7. 0[(long) $SC_c_I0:100:1].L; 6. /* === End Short Vector Loop === */; 9. return; 9. } 60 S–2179–60Compiler Commands [2] 2.22.4 -h ident=name (CC, cc, c99) Default: File name specified on the command line The -h ident=name option changes the ident name to name. This name is used as the module name in the object file (.o suffix) and assembler file (.s suffix). Regardless of whether the ident name is specified or the default name is used, the following transformations are performed on the ident name: • All . characters in the ident name are changed to $. • If the ident name starts with a number, a $ is added to the beginning of the ident name. 2.22.5 -h keepfiles (CC, cc, c89, c99) The -h keepfiles option prevents the removal of the object (.o) files after an executable is created. Normally, the compiler automatically removes these files after linking them to create an executable. Since the original object files are required to instrument a program for performance analysis, if you plan to use CrayPat to conduct performance analysis experiments, you can use this option to preserve the object files. 2.22.6 -h [no]mpmd (CC, cc) Default: -h nompmd Used when compiling co-array Fortran (CAF) programs for multiple program, multiple data (MPMD) launch. For details, see Section 11.3, page 164. 2.22.7 -h [no]omp (CC, cc) Default: -h noomp Enables or disables the C or C++ compiler recognition of OpenMP directives. For details, see Chapter 5, page 129. 2.22.8 -h prototype_intrinsics (CC, cc, c99, cpp) Simulates the effect of including intrinsics.h at the beginning of a compilation. Use this option if the source code does not include the intrinsics.h statement and you cannot modify the code. This option is off by default. For details, see Appendix F, page 229. S–2179–60 61Cray® C and C++ Reference Manual 2.22.9 -h taskn (CC, cc) This option enables tasking in C or C++ applications that contain OpenMP directives. Default: -h task0 n Description 0 Disables tasking. OpenMP directives are ignored. Using this option can reduce compile time and the size of the executable. The -h task0 option is compatible with all vectorization and scalar optimization levels. 1 The -h task1 option specifies user tasking, so OpenMP directives are recognized. No level for scalar optimization is enabled automatically. The -h task1 option is compatible with all vectorization and scalar optimization levels. 2.22.10 -h [no]threadsafe (CC) Default: -h threadsafe This option enables or disables the generation of threadsafe code. Code that is threadsafe can be used with pthreads and OpenMP. This option is not binary-compatible with code generated by Cray C++ 5.1 and earlier compilers. Users who need binary compatibility with previously compiled code can use -h nothreadsafe, which causes the compiler to be compatible with Cray C++ 5.1 and earlier compilers at the expense of not being threadsafe. C++ code compiled with -h threadsafe (the default) cannot be linked with C++ code compiled with -h nothreadsafe or with code compiled with a Cray C++ 5.1 or earlier compiler. 2.22.11 -h upc (cc) The -h upc option enables compilation of Unified Parallel C (UPC) code. UPC is a C language extension for parallel program development that allows you to explicitly specify parallel programming through language syntax rather than through library functions such as are used in MPI or SHMEM. The Cray implementation of UPC is discussed in Chapter 6, page 135. 62 S–2179–60Compiler Commands [2] 2.22.12 -V (CC, cc, c99, cpp) The -V option displays compiler version information. If the command line specifies no source file, no compilation occurs. Version information consists of the product name, the version number, and the current date and time, as shown in the following example: % CC -V Cray C++ : Version 6.0.0.0.45 Thu Aug 09, 2007 15:21:45 2.22.13 -X npes (CC, cc, c99) The -X npes option specifies the number of processing elements to use during execution. The value for npes ranges from 1 through 4096, inclusive. Once set, the number of processing elements to use cannot be changed at load or run time. You must recompile the program with a different value for npes to change the number of processing elements. If you use the ld command to manually load a program compiled with the -X option, you must specify the same value to the loader as was specified at compile time. You can execute the compiled program without using the aprun command just by entering the name of the output file. If you use the command and specify the number of processing elements on the aprun command line, you must specify the same number on the aprun command as was specified at compile time. The _num_pes intrinsic function can be used in Cray X1 series and Cray X2 systems. The value returned by _num_pes is equal to the number of processing elements available to your program. The number of the first processing element is always 0, and the number of the last processing element is _num_pes() - 1. When the -X npes option is specified at compile time, the _num_pes intrinsic function returns the value specified by the npes argument. The _num_pes intrinsic can be used only in either of the following situations: • When the -X npes option is specified on the command line • When the value of the expression containing the _num_pes intrinsic function is not known until run time (that is, it can only be used in run time expressions) S–2179–60 63Cray® C and C++ Reference Manual One of the many uses for the _num_pes intrinsic is illustrated in the following example, which declares a variable length array of size equal to the number of processing elements: int a[_num_pes()]; Using the _num_pes intrinsic in conjunction with the -X npes option allows the user to program the number of processing elements into code in places that do not accept run time values. Specifying the number of processing elements at compile time can also enhance compiler optimization. 2.23 Command Line Examples The following examples illustrate a variety of command lines for the C and C++ compiler commands: Example 1: CC -X8 -h instantiate=all myprog.C This example compiles myprog.C, fixes the number of processing elements to 8, and instantiates all template entities declared or referenced in the compilation unit. % CC -X8 -h instantiate=all myprog.C Example 2: CC -h conform -h noautoinstantiate myprog.C This example compiles myprog.C. The -h conform option specifies strict conformance to the ISO C++ standard. No automatic instantiation of templates is performed. % CC -h conform -h noautoinstantiate myprog.C Example 3: CC -c -h ipa1 myprog.C subprog.C This example compiles input files myprog.C and subprog.C. The -c option tells the compiler to create object files myprog.o and subprog.o but not call the loader. Option -h ipa1 tells the compiler to inline function calls marked with the inline_always pragma. % CC -c -h ipa1 myprog.C subprog.C Example 4: CC -I. disc.C vend.C This example specifies that the compiler search the current working directory, 64 S–2179–60Compiler Commands [2] represented by a period (.), for #include files before searching the default #include file locations. % CC -I. disc.C vend.C Example 5: cc -P -D DEBUG newprog.c This example specifies that source file newprog.c be preprocessed only. Compilation and linking are suppressed. In addition, the macro DEBUG is defined. % cc -P -D DEBUG newprog.c Example 6: CC -c -h report=s mydata1.C This example compiles mydata1.C, creates object file mydata1.o, and produces a scalar optimization report to stdout. % CC -c -h report=s mydata1.C Example 7: cc -h listing mydata3.c This example compiles mydata3.c and produces the executable file a.out. A 132-column pseudo assembly listing file is also produced in file mydata3.L. % cc -h listing mydata3.c Example 8: CC -h ipa5,report=if myfile.C This example compiles myfile.C and tells the compiler to attempt to aggressively inline calls to functions defined within myfile.C. An inlining report is directed to myfile.V. % CC -h ipa5,report=if myfile.C 2.24 Compile Time The following are used during compilation. S–2179–60 65Cray® C and C++ Reference Manual Variable Description CRAYOLDCPPLIB When set to a nonzero value, enables C++ code to use the following nonstandard Cray C++ headers files: • common.h • complex.h • fstream.h • generic.h • iomanip.h • iostream.h • stdiostream.h • stream.h • strstream.h • vector.h If you want to use the standard header files, your code may require modification to compile successfully. For more information, see Appendix C, page 199. Note: Setting the CRAYOLDCPPLIB environment variable disables exception handling, unless you compile with the -h exceptions option. CRAY_PE_TARGET Specifies the target system to be applied to all compilations. Supported values are cray-x1 , cray-x1e, and cray-x2. Use the -hcpu=target-system option to override this variable for individual compilations. 66 S–2179–60Compiler Commands [2] CRI_CC_OPTIONS CRI_cc_OPTIONS CRI_c89_OPTIONS CRI_cpp_OPTIONS Specifies command line options that are applied to all compilations. Options specified by this environment variable are added following the options specified directly on the command line. This is especially useful for adding options to compilations done with build tools. LANG Identifies your requirements for native language, local customs, and coded character set with regard to compiler messages. MSG_FORMAT Controls the format in which you receive compiler messages. NLSPATH Specifies the message system catalogs that should be used. NPROC Specifies the number of processes used for simultaneous compilations on Cray X1 series and Cray X2 systems. The default is 1. When more than one source file is specified on the command line, compilations may be multiprocessed by setting the environment variable NPROC to a value greater than 1. You can set NPROC to any value; however, large values can overload the system. 2.25 Run Time The following are used during run time. S–2179–60 67Cray® C and C++ Reference Manual Variable Description CRAY_AUTO_APRUN_OPTIONS (Not supported on Cray X2 systems) The CRAY_AUTO_APRUN_OPTIONS environment variable specifies options for the aprun command when the command is called automatically (auto aprun). Calling the aprun command automatically occurs when only the name of the program and, where applicable, associated program options are entered on the command line; this will cause the system to automatically call aprun to run the program. The CRAY_AUTO_APRUN_OPTIONS environment variable does not specify options for the aprun command when you explicitly specify the command on the command line, nor does it specify options for your program. When setting options for the aprun command in the CRAY_AUTO_APRUN_OPTIONS environment variable, surround the options within double quotes and separate each option with a space. Do not include a space between an option and its associated value. For example, setenv CRAY_AUTO_APRUN_OPTIONS "-n10 -m16G" If you execute a program compiled with a fixed number of processing elements (that is, the –X compiler option was specified at compile time) and the CRAY_AUTO_APRUN_OPTIONS also specifies the -n option, you must ensure that the values used for both options are the same. To do otherwise is an error. X1_DYNAMIC_COMMON_SIZE The X1_DYNAMIC_COMMON_SIZE sets the size of the dynamic COMMON block defined by the loader. For more information about dynamic COMMON blocks, see the -LD_LAYOUT:dynamic= option in the ld(1) man page and the Optimizing Applications on Cray X1 Series Systems manual. 68 S–2179–60Compiler Commands [2] X1_COMMON_STACK_SIZE X1_PRIVATE_STACK_SIZE X1_STACK_SIZE X1_LOCAL_HEAP_SIZE X1_SYMMETRIC_HEAP_SIZE X1_HEAP_SIZE X1_PRIVATE_STACK_GAP The following allow you to change the default size of the application stacks or heaps, or consolidate the private stacks: • X1_COMMON_STACK_SIZE changes the common stack size to the specified value. • X1_PRIVATE_STACK_SIZE changes the private stack size to the specified value. • X1_STACK_SIZE sets the size of the common and private stack to the specified value. • X1_LOCAL_HEAP_SIZE changes the local heap size to the specified value. • X1_SYMMETRIC_HEAP_SIZE changes the symmetric heap size to the specified value. • X1_HEAP_SIZE changes the local and symmetric heap size to the specified value. • X1_PRIVATE_STACK_GAP, when used with X1_PRIVATE_STACK_SIZE, consolidates the four private stacks within an MSP into one segment, which frees up nontext pages for application use. The specified value, in bytes, indicates the gap to separate each stack. This gap serves as a guard region in case any of the stacks overflow. The default size of each application stack or heap is 1 GB. S–2179–60 69Cray® C and C++ Reference Manual The X1_STACK_SIZE and X1_HEAP_SIZE are termed general in that they set the values for multiple stacks or heaps, respectively. The other variables in this section are termed specific because they set the value for a particular stack or heap. A specific variable overrides a general variable if both are specified as follows: • The X1_COMMON_STACK_SIZE variable overrides the X1_STACK_SIZE variable if both are specified. • The X1_PRIVATE_STACK_SIZE variable overrides the X1_STACK_SIZE if both are specified. • The X1_LOCAL_HEAP_SIZE variable overrides the X1_HEAP_SIZE variable if both are specified. • The X1_SYMMETRIC_HEAP_SIZE overrides the X1_HEAP_SIZE variable if both are specified. The value you specify for a variable sets the size of a stack or heap in bytes. This number can be expressed as a decimal number, an octal number with a leading zero, or a hexadecimal number with a leading "0x". If you specify a number smaller than the page size you gave to the aprun or mpirun command, the system silently enforces a single-page minimum size. If you do not use the aprun command or do not specify a page size for aprun, the minimum page size is set to 64 KB. For more information about page sizes, see the –p text:other option on the aprun(1) man page. Using the X1_PRIVATE_STACK_GAP and X1_PRIVATE_STACK_SIZE together to consolidate the private stacks may help applications that have problems obtaining a sufficient number of large nontext pages via the aprun or mpirun commands. When the private stacks are consolidated, the pages that would have been used by the other private stacks are freed so they can be used by the application. 70 S–2179–60Compiler Commands [2] On Cray X1 series systems, each MSP used by an application uses four private stacks where each private stack occupies an integral number of pages, but if the application actually needs a private stack that is much smaller than the integral number of pages, space is wasted. In some of these cases, consolidating all four private stacks into one segment will free up the wasted space so it can be used by the application. For example, an application uses 256 MB pages, which means the size of each private stack is a multiple of 256 MB. If the application only needs 60 MB for each private stack, we can consolidate all four private stacks into a 256 MB page by setting X1_PRIVATE_STACK_SIZE to 0x3c00000 (60 MB) and X1_PRIVATE_STACK_GAP to 0x400000 (4 Mb). This packs the four private stacks into one 256 MB page with a 4 MB guard region between the stacks. This saves three 256 MB physical pages on each MSP. Warning: There is no protection against overflowing the private stacks; one private stack may corrupt another with unpredictable results if stack overflow occurs. CRAYNV_STACK_SIZE For Cray X2 systems, you can use this environment variable to change the initial stack size from its normal default of 8 MB. The purpose of this variable is to enable you to slightly reduce the cost of stack memory allocation by doing it all at once at the beginning of the program rather than piecemeal as the stack grows during execution. The stack limit is inherited from the limit in effect on the support node at the time of the application launch; limits less than 1 GB are not enforced. During execution, if your program attempts to exceed the stack size limit, the message stack overflow is printed and a segmentation fault occurs. 2.26 OpenMP This section describes the OpenMP C and C++ API that control the execution of parallel code. The names of must be in uppercase. The values assigned to them are case insensitive and may have leading and trailing white space. Modifications to the values after the program has started are ignored. S–2179–60 71Cray® C and C++ Reference Manual The are as follows: • OMP_SCHEDULE sets the run time schedule type and chunk size. • OMP_NUM_THREADS sets the number of threads to use during execution. • OMP_DYNAMIC enables or disables dynamic adjustment of the number of threads. • OMP_NESTED enables or disables nested parallelism. • OMP_THREAD_STACK_SIZE is a Cray specific, nonstandard variable used to change the size of the thread stack from the default size of 16 MB to the specified size. The examples in this section demonstrate how these variables might be set in UNIX C shell (csh) environments: setenv OMP_SCHEDULE "dynamic" In Korn shell environments, the actions are similar, as follows: export OMP_SCHEDULE="dynamic" 2.26.1 OMP_SCHEDULE OMP_SCHEDULE applies only to for and parallel for directives that have the schedule type runtime. The schedule type and chunk size for all such loops can be set at run time by setting this environment variable to any of the recognized schedule types and to an optional chunk_size. For for and parallel for directives that have a schedule type other than runtime, OMP_SCHEDULE is ignored. The default value for this environment variable is implementation-defined. If the optional chunk_size is set, the value must be positive. If chunk_size is not set, a value of 1 is assumed, except in the case of a static schedule. For a static schedule, the default chunk size is set to the loop iteration space divided by the number of threads applied to the loop. Example: setenv OMP_SCHEDULE "guided,4" setenv OMP_SCHEDULE "dynamic" 72 S–2179–60Compiler Commands [2] 2.26.2 OMP_NUM_THREADS The OMP_NUM_THREADS environment variable sets the default number of threads to use during execution, unless that number is explicitly changed by calling the omp_set_num_threads library routine (see the omp_threads(3) man page) or by an explicit num_threads clause on a parallel directive. The value of the OMP_NUM_THREADS environment variable must be a positive integer. Its effect depends upon whether dynamic adjustment of the number of threads is enabled. For information about the interaction between the OMP_NUM_THREADS environment variable and dynamic adjustment of threads, see Section 5.2, page 130. If no value is specified for the OMP_NUM_THREADS environment variable, or if the value specified is not a positive integer, or if the value is greater than the maximum number of threads the system can support, the number of threads to use is implementation-defined. Example: setenv OMP_NUM_THREADS 16 2.26.3 OMP_DYNAMIC The OMP_DYNAMIC environment variable enables or disables dynamic adjustment of the number of threads available for execution of parallel regions unless dynamic adjustment is explicitly enabled or disabled by calling the omp_set_dynamic library routine (see the omp_threads(3) man page). Its value must be TRUE or FALSE. The default condition is FALSE. If set to TRUE, the number of threads that are used for executing parallel regions may be adjusted by the run time environment to best utilize system resources. If set to FALSE, dynamic adjustment is disabled. Example: setenv OMP_DYNAMIC TRUE S–2179–60 73Cray® C and C++ Reference Manual 2.26.4 OMP_NESTED The OMP_NESTED environment variable enables or disables nested parallelism unless nested parallelism is enabled or disabled by calling the omp_set_nested library routine (see the omp_nested(3) man page). If set to TRUE, nested parallelism is enabled; if it is set to FALSE, nested parallelism is disabled. The default value is FALSE. Example: setenv OMP_NESTED TRUE 2.26.5 OMP_THREAD_STACK_SIZE The OMP_THREAD_STACK_SIZE environment variable changes the size of the thread stack from the default size of 16 MB to the specified size. The size of the thread stack should be increased when thread-private variables may utilize more than 16 MB of memory. The requested thread stack space is allocated from the local heap when the threads are created. On Cray X1 series systems, the amount of space used by each thread for thread stacks depends on whether you are using MSP or SSP mode. In MSP mode, the memory used is five times the specified thread stack size because each SSP is assigned one thread stack and one thread stack is used as the MSP common stack. For SSP mode, the memory used equals the specified thread stack size. On Cray X2 systems, the memory used equals the specified thread stack size. This is the format for the OMP_THREAD_STACK_SIZE environment variable: OMP_THREAD_STACK_SIZE n where n is a decimal number, an octal number with a leading zero, or a hexadecimal number with a leading "0x" specifying the amount of memory, in bytes, to allocate for a thread's stack. For more information about memory on the Cray X1 series and Cray X2 systems, see the memory(7) man page. Example: setenv OMP_THREAD_STACK_SIZE 18000000 74 S–2179–60#pragma Directives [3] #pragma directives are used within the source program to request certain kinds of special processing. #pragma directives are part of the C and C++ languages, but the meaning of any #pragma directive is defined by the implementation. #pragma directives are expressed in the following form: #pragma [ _CRI] identifier [arguments] The _CRI specification is optional and ensures that the compiler will issue a message concerning any directives that it does not recognize. Diagnostics are not generated for directives that do not contain the _CRI specification. These directives are classified according to the following types: • General (Section 3.5, page 78) • Instantiation (Cray C++ only) (Section 3.6, page 90) • Vectorization (Section 3.7, page 90) • Multistreaming (Section 3.8, page 103) (Cray X1 series systems only) • Scalar (Section 3.9, page 107) • Inlining (Section 3.10, page 111) Macro expansion occurs on the directive line after the directive name. That is, macro expansion is applied only to arguments. Note: OpenMP #pragma directives are described in Chapter 5, page 129. At the beginning of each section that describes a directive, information is included about the compilers that allow the use of the directive and the scope of the directive. Unless otherwise noted, the following default information applies to each directive: Compiler: Cray C and Cray C++ Scope: Local and global S–2179–60 75Cray® C and C++ Reference Manual The scoping list may also indicate that a directive has a lexical block scope. A lexical block is the scope within which a directive is on or off and is bounded by the opening curly brace just before the directive was declared and the corresponding closing curly brace. Only applicable executable statements within the lexical block are affected as indicated by the directive. The lexical block does not include the statements contained within a procedure that is called from the lexical block. This example code fragment shows the lexical block for the upc strict and upc relaxed directives: void Example(void) { #pragma _CRI upc strict // UPC strict state is on ... { ... // UPC strict state is still on #pragma _CRI upc relaxed // UPC strict state is now off ... } // UPC strict state is back on ... } 3.1 Protecting Directives To ensure that your directives are interpreted only by the Cray C and C++ compilers, use the following coding technique in which directive is the name of the directive: #if _CRAYC #pragma _CRI directive #endif This ensures that other compilers used to compile this code will not interpret the directive. Some compilers diagnose any directives that they do not recognize. The Cray C and C++ compilers diagnose directives that are not recognized only if the _CRI specification is used. 76 S–2179–60#pragma Directives [3] 3.2 Directives in Cray C++ C++ prohibits referencing undeclared objects or functions. Objects and functions must be declared prior to using them in a #pragma directive. This is not always the case with C. Some #pragma directives take function names as arguments (for example: #pragma _CRI weak, #pragma _CRI suppress, and #pragma _CRI inline_always name [,name ] ...). Member functions and qualified names are allowed for these directives. 3.3 Loop Directives Many directives apply to groups. Unless otherwise noted, these directives must appear before a for, while, or do while loop. These directives may also appear before a label for if...goto loops. If a loop directive appears before a label that is not the top of an if...goto loop, it is ignored. 3.4 Alternative Directive form: _Pragma Compiler directives can also be specified in the following form, which has the advantage in that it can appear inside macro definitions: _Pragma("_CRI identifier"); This form has the same effect as using the #pragma form, except that everything that appeared on the line following the #pragma must now appear inside the double quotation marks and parentheses. The expression inside the parentheses must be a single string literal; it cannot be a macro that expands into a string literal. _Pragma is an extension to the C and C++ standards. The following is an example using the #pragma form: #pragma _CRI ivdep The following is the same example using the alternative form: _Pragma("_CRI ivdep"); S–2179–60 77Cray® C and C++ Reference Manual In the following example, the loop automatically vectorizes wherever the macro is used: #define SEARCH(A, B, KEY, SIZE, RES) { int i; _Pragma("_CRI ivdep"); for (i = 0; i < (SIZE); i++) if ( (A)[ (B)[i] ] == (KEY)) break; (RES)=i; } Macros are expanded in the string literal argument for _Pragma in an identical fashion to the general specification of a #pragma directive. 3.5 General Directives General directives specify compiler actions that are specific to the directive and have no similarities to the other types of directives. The following sections describe general directives. 3.5.1 [no]bounds Directive (Cray C Compiler) The bounds directive specifies that pointer and array references are to be checked. The nobounds directive specifies that this checking is to be disabled. When bounds checking is in effect, pointer references are checked to ensure that they are not 0 or are not greater than the machine memory limit. Array references are checked to ensure that the array subscript is not less than 0 or greater than or equal to the declared size of the array. Both directives may be used only within function bodies. They apply until the end of the function body or until another bounds/nobounds directive appears. They ignore block boundaries. These directives have the following format: #pragma _CRI bounds #pragma _CRI nobounds 78 S–2179–60#pragma Directives [3] The following example illustrates the use of the bounds directive: int a[30]; #pragma _CRI bounds void f(void) { int x; x = a[30]; . . . } 3.5.2 duplicate Directive (Cray C Compiler) Scope: Global The duplicate directive lets you provide additional, externally visible names for specified functions. You can specify duplicate names for functions by using a directive with one of the following forms: #pragma _CRI duplicate actual as dupname... #pragma _CRI duplicate actual as (dupname...) The actual argument is the name of the actual function to which duplicate names will be assigned. The dupname list contains the duplicate names that will be assigned to the actual function. The dupname list may be optionally parenthesized. The word as must appear as shown between the actual argument and the comma-separated list of dupname arguments. The duplicate directive can appear anywhere in the source file and it must appear in global scope. The actual name specified on the directive line must be defined somewhere in the source as an externally accessible function; the actual function cannot have a static storage class. S–2179–60 79Cray® C and C++ Reference Manual The following example illustrates the use of the duplicate directive: #include extern void maxhits(void); #pragma _CRI duplicate maxhits as count, quantity /* OK */ void maxhits(void) { #pragma _CRI duplicate maxhits as tempcount /* Error: #pragma _CRI duplicate can't appear in local scope */ } double _Complex minhits; #pragma _CRI duplicate minhits as lower_limit /* Error: minhits is not declared as a function */ extern void derivspeed(void); #pragma _CRI duplicate derivspeed as accel /* Error: derivspeed is not defined */ static void endtime(void) { } #pragma _CRI duplicate endtime as limit /* Error: endtime is defined as a static function */ Because duplicate names are simply additional names for functions and are not functions themselves, they cannot be declared or defined anywhere in the compilation unit. To avoid aliasing problems, duplicate names may not be referenced anywhere within the source file, including appearances on other directives. In other words, duplicate names may only be referenced from outside the compilation unit in which they are defined. 80 S–2179–60#pragma Directives [3] The following example references duplicate names: void converter(void) { structured(void); } #pragma _CRI duplicate converter as factor, multiplier /* OK */ void remainder(void) { } #pragma _CRI duplicate remainder as factor, structured /* Error: factor and structured are referenced in this file */ Duplicate names can be used to provide alternate external names for functions, as shown in the following examples. main.c: extern void fctn(void), FCTN(void); main() { fctn(); FCTN(); } fctn.c: #include void fctn(void) { printf("Hello world\n"); } #pragma _CRI duplicate fctn as FCTN Files main.c and fctn.c are compiled and linked using the following command line: % cc main.c fctn.c S–2179–60 81Cray® C and C++ Reference Manual When the executable file a.out is run, the program generates the following output: Hello world Hello world 3.5.3 message Directive The message directive directs the compiler to write the message defined by text to stderr as a warning message. Unlike the error directive, the compiler continues after processing a message directive. The format of this directive is as follows: #pragma _CRI message "text" The following example illustrates the use of the message compiler directive: #define FLAG 1 #ifdef FLAG #pragma _CRI message "FLAG is Set" #else #pragma _CRI message "FLAG is NOT Set" #endif 3.5.4 no_cache_alloc Directive The no_cache_alloc directive is an advisory directive that specifies objects that should not be placed into the cache. Advisory directives are directives the compiler will honor if conditions permit it to. When this directive is honored, the performance of your code may be improved because the cache is not occupied by objects that have a lower cache hit rate. Theoretically, this makes room for objects that have a higher cache hit rate. Here are some guidelines that will help you determine when to use this directive. This directive works only on objects that are vectorized. That is, other objects with low cache hit rates can still be placed into the cache. Also, you should use this directive for objects that should not be placed into the cache. To use the directive, you must place it only in the specification part, before any executable statement. The format of the no_cache_alloc directive is: #pragma _CRI no_cache_alloc base_name [,base_name] ... 82 S–2179–60#pragma Directives [3] base_name The base name of the object that should not be placed into the cache. This can be the base name of any object such as an array, scalar structure, etc., without member references like C[10]. If you specify a pointer in the list, only the references, not the pointer itself, have the no cache allocate property. This directive may be locally overidden by use of a loop_info #pragma directive. This directive overrides automatic cache management decisions (see -h cachen). 3.5.5 cache_shared Directive Scope: Declaration This directive asserts that all vector loads with the specified symbols as the base are to be made using cache-shared instructions. This is an advisory directive; if the compiler honors it, vector load misses cause the cache line to be allocated in a shared state, in anticipation of a subsequent load by a different processor. For vector store operations, this directive is not meaningful and will be ignored. Scalar loads and stores also are unaffected. The compiler may override the directive if it determines the directive is not beneficial. The scope of this directive is the scope of the declaration of the specified symbol. The format of the cache_shared directive is: #pragma _CRI cache_shared symbol [,symbol...] symbol A base symbol (an array or scalar structure, but not a member reference or array element). Examples of valid cache_shared symbols are A, B, C. Expressions such as B.E or C[10] cannot be used as cache_shared symbols. This directive may be locally overidden by use of a loop_info #pragma directive. This directive overrides automatic cache management decisions (see -h cachen). S–2179–60 83Cray® C and C++ Reference Manual 3.5.6 cache_exclusive Directive The cache_exclusive directive asserts that all vector loads with the specified symbols as the base are to be made using cache-exclusive instructions. This is an advisory directive; if the compiler honors it, any vector load that misses causes the cache line to be allocated in an exclusive state, in anticipation of a subsequent store. The cache_exclusive directive is meaningful for stores in that it allows the user to override a decision made by the automatic cache management. Scalar loads and stores are unaffected. This directive may be locally overridden by the use of a #pragma loop_info directive. This directive overrides automatic cache management decisions (see -h cachen). To use the directive, you must place it only in the specification part, before any executable statement. The format of the cache_exclusive directive is: #pragma _CRI cache_exclusive base_name [,base_name] base_name The base name of the object that should be placed into the cache. This can be the base name of any object such as an array, scalar structure, etc., without member references like C[10]. If you specify a pointer in the list, only the references, not the pointer itself, have the no cache allocate property. 3.5.7 [no]opt Directive Scope: Global The noopt directive disables all automatic optimizations and causes optimization directives to be ignored in the source code that follows the directive. Disabling optimization removes various sources of potential confusion in debugging. The opt directive restores the state specified on the command line for automatic optimization and directive recognition. These directives have global scope and override related command line options. The format of these directives is as follows: #pragma _CRI opt #pragma _CRI noopt 84 S–2179–60#pragma Directives [3] The following example illustrates the use of the opt and noopt compiler directives: #include void sub1(void) { printf("In sub1, default optimization\n"); } #pragma _CRI noopt void sub2(void) { printf("In sub2, optimization disabled\n"); } #pragma _CRI opt void sub3(void) { printf("In sub3, optimization enabled\n"); } main() { printf("Start main\n"); sub1(); sub2(); sub3(); } 3.5.8 Probability Directives The probability, probability_almost_always, and probability_almost_never directives specify information used by the IPA and optimizer to produce faster code sequences. The specified probability is a hint, rather than a statement of fact. You can also specify almost_never and almost_always by using the values 0.0 and 1.0, respectively. These directives have the following format: #pragma probability #pragma probability_almost_always #pragma probability_almost_never S–2179–60 85Cray® C and C++ Reference Manual is an expression that evaluates to a floating point constant at compilation time. 0.0 <= <= 1.0 These directives can appear anywhere executable code is legal. The directive applies to the block of code where it appears. It is important to realize that the directive should not be applied to a conditional test directly; rather, it should be used to indicate the relative probability of a 'then' or 'else' branch being executed. Example: if ( a[i] > b[i] ) { #pragma probability 0.3 a[i] = b[i]; } This example states that the probability of entering the block of code with the assignment statement is 0.3 or 30%. This also means that a[i] is expected to be greater than b[i] 30% of the time. Note that the probability directive appears within the conditional block of code, rather than before it. This removes some of the ambiguity that has plagued other implementations that tie the directive directly to the conditional code. This information is used to guide inlining decisions, branch elimination optimizations, branch hint marking, and the choice of the optimal algorithmic approach to the vectorization of conditional code. The following GCC-style intrinsic is also accepted when it appears in a conditional test: __builtin_expect( , ) The following example: if ( __builtin_expect( a[i] > b[i], 0 ) ) { a[i] = b[i]; } is roughly equivalent to: if ( a[i] > b[i] ) { #pragma _CRI probability_almost_never a[i] = b[i]; } 86 S–2179–60#pragma Directives [3] 3.5.9 weak Directive Scope: Global The weak directive specifies an external identifier that may remain unresolved throughout the compilation. A weak external reference can be to a function or to a data object. A weak external does not increase the total memory requirements of your program. Declaring an object as a weak external directs the loader to do one of these tasks: • Link the object only if it is already linked (that is, if a strong reference exists); otherwise, leave it is as an unsatisfied external. The loader does not display an unsatisfied external message if weak references are not resolved. • If a strong reference is specified in the weak directive, resolve all weak references to it. Note: The loader treats weak externals as unsatisfied externals, so they remain silently unresolved if no strong reference occurs during compilation. Thus, it is your responsibility to ensure that run time references to weak external names do not occur unless the loader (using some "strong” reference elsewhere) has actually loaded the entry point in question. These are the forms of the weak directive: #pragma _CRI weak var #pragma _CRI weak sym1 = sym2 var The name of an external sym1 Defines an externally visible weak symbol sym2 Defines an externally visible strong symbol defined in the current compilation. The first form allows you to declare one or more weak references on one line. The second form allows you to assign a strong reference to a weak reference. The weak directive must appear at global scope. S–2179–60 87Cray® C and C++ Reference Manual The attributes that weak externals must have depend on the form of the weak directive that you use: • First form, weak externals must be declared, but not defined or initialized, in the source file. • Second form, weak externals may be declared, but not defined or initialized, in the source file. • Either form, weak externals cannot be declared with a static storage class. The following example illustrates these restrictions: extern long x; #pragma _CRI weak x /* x is a weak external data object */ extern void f(void); #pragma _CRI weak f /* f is a weak external function */ extern void g(void); #pragma _CRI weak g=fun; /* g is a weak external function with a strong reference to fun */ long y = 4; #pragma _CRI weak y /* ERROR - y is actually defined */ static long z; #pragma _CRI weak z /* ERROR - z is declared static */ void fctn(void) { #pragma _CRI weak a /* ERROR - directive must be at global scope */ } 3.5.10 vfunction Directive Scope: Global The vfunction directive lists external functions that use the call-by-register calling sequence. Such functions can be vectorized but must be written in Cray Assembly Language. For more information, see the Cray Assembly Language (CAL) for Cray X1 Systems Reference Manual. The format of this directive is as follows: #pragma _CRI vfunction func 88 S–2179–60#pragma Directives [3] The func variable specifies the name of the external function. The following example illustrates the use of the vfunction compiler directive: extern double vf(double); #pragma _CRI vfunction vf void f3(int n) { int i; for (i = 0; i < n; i++) { /* Vectorized */ b[i] = vf(c[i]); } } 3.5.11 ident Directive The ident directive directs the compiler to store the string indicated by text into the object (.o) file. This can be used to place a source identification string into an object file. The format of this directive is as follows: #pragma _CRI ident text S–2179–60 89Cray® C and C++ Reference Manual 3.6 Instantiation Directives The Cray C++ compiler recognizes three instantiation directives. Instantiation directives can be used to control the instantiation of specific template entities or sets of template entities. The following directives are described in detail in Section 8.5, page 149: • #pragma _CRI instantiate • #pragma _CRI do_not_instantiate • #pragma _CRI can_instantiate • The #pragma _CRI instantiate directive causes a specified entity to be instantiated. • The #pragma _CRI do_not_instantiate directive suppresses the instantiation of a specified entity. It is typically used to suppress the instantiation of an entity for which a specific definition is supplied. • The #pragma _CRI can_instantiate directive indicates that a specified entity can be instantiated in the current compilation, but need not be. It is used in conjunction with automatic instantiation to indicate potential sites for instantiation if the template entity is deemed to be required by the compiler. For more information about template instantiation, see Chapter 8, page 143. 3.7 Vectorization Directives Because vector operations cannot be expressed directly in Cray C and C++, the compilers must be capable of vectorization, which means transforming scalar operations into equivalent vector operations. The candidates for vectorization are operations in loops and assignments of structures. For more information, see the Optimizing Applications on Cray X1 Series Systems manual or the Optimizing Applications on Cray X2 Systems manual. The subsections that follow describe the compiler directives used to control vectorization. 3.7.1 hand_tuned Directive The hand_tuned directive applies to the next loop in the same manner as the concurrent and safe_address directives. 90 S–2179–60#pragma Directives [3] The format of this directive is: #pragma _CRI hand_tuned This directive asserts that the code in the loop nest has been arranged by hand for maximum performance, and the compiler should restrict some of the more aggressive automatic expression rewrites. The compiler should still fully optimize, vectorize, and, for Cray X1 series systems, multistream the loop within the constraints of the directive. Warning: Use of this directive may severely impede performance. Use carefully and evaluate before and after performance. 3.7.2 ivdep Directive Scope: Local The ivdep directive tells the compiler to ignore vector dependencies for the loop immediately following the directive. Conditions other than vector dependencies can inhibit vectorization. If these conditions are satisfactory, the loop vectorizes. This directive is useful for some loops that contain pointers and indirect addressing. The format of this directive is as follows: #pragma _CRI ivdep safevl=vlen|infinitevl vlen Specifies a vector length in which no dependency will occur. vlen must be an integer between 1 and 1024 inclusive. infinitevl Specifies an infinite safe vector length. This option asserts that no data dependency will occur at any vector length. The following example illustrates the use of the ivdep compiler directive: p = a; q = b; #pragma _CRI ivdep for (i = 0; i < n; i++) { /* Vectorized */ *p++ = *q++; } On Cray X1 series or Cray X2 systems, the compiler by default assumes an infinite safe vector length; that is, any vector length can safely be used to vectorize the loop. You can use the -h noinfinitevl compiler option to change this behavior for all loops in the compilation unit. S–2179–60 91Cray® C and C++ Reference Manual ! Caution: Use the ivdep pragma with caution. Asserting a safe vector length that proves to be not safe can produce incorrect results. For more information, see the Optimizing Applications on Cray X1 Series Systems manual or the Optimizing Applications on Cray X2 Systems manual. 3.7.3 loop_info Directive Scope: Local The loop_info directive allows additional information to be specified about the behavior of a loop, including runtime trip count and hints on cache allocation strategy. In regard to trip count information, the loop_info directive is similar to the shortloop or shortloop128 directive but provides more information to the optimizer and can produce faster code sequences. loop_info is used immediately before a DO or DO WHILE loop to indicate minimum, maximum, or estimated trip count. The compiler will diagnose misuse at compile time (when able) or when option -hdir_check is specified at run time. For cache allocation hints, the loop_info directive can be used to override default settings, supersede earlier cache_exclusive, no_cache_alloc or cache_shared directives, or override automatic cache management decisions. The cache hints are local and apply only to the specified loop nest. If your system hardware supports vector atomic memory operations, the compiler will automatically use AMOs if it will improve performance. If you want the compiler to use AMOs when possible throughout a loop, specify the prefer_amo clause. The format of this directive is: #pragma _CRI loop_info [min_trips(c)] [est_trips(c)] [max_trips(c)] [cache_ex( symbol [, symbol ...] )] [cache_sh( symbol [, symbol ...] )] [cache_na( symbol [, symbol ...] ) ] [prefer_amo] [prefer_noamo] [prefetch] [noprefetch] c An expression that evaluates to an integer constant at compilation time. min_trips Specifies guaranteed minimum number of trips. est_trips Specifies estimated or average number of trips. max_trips Specifies guaranteed maximum number of trips. 92 S–2179–60#pragma Directives [3] cache_ex Specifies symbol is to receive the exclusive cache hint; this is the default if no hint is specified and the NO_CACHE_ALLOC or CACHE_SHARED directives are not specified. cache_sh Specifies symbol is to receive the shared cache hint. cache_na Specifies symbol is to receive the non-allocating cache hint. prefer_amo Instructs the compiler to use vector atomic memory operations (AMOs). The compiler automatically uses vector AMOs if it will improve performance. prefer_noamo Instructs the compiler to avoid all uses of vector atomic memory operations. The compiler automatically uses vector AMOs if it will improve performance. prefetch (Cray X2 systems only) Instructs the compiler to generate prefetch instructions to preload scalar data into L1 cache. This can improve the frequency of cache hits and lower latency. If the scalar optimization level is zero (-Oscalar0), prefetch mode is always off. For -Oscalar1, prefetch mode is off, but you can turn it on via the loop_info prefetch directive. For -Oscalar2 or -Oscalar3, prefetch mode is on, but you can turn it off on a loop basis by using the loop_info noprefetch directive. noprefetch (Cray X2 systems only) Instructs the compiler to not generate prefetch code. symbol The base name of the object that should not be placed into the cache. This can be the base name of any object (such as an array or scalar structure) without member references like C[10]. If you specify a pointer in the list, only the references, not the pointer itself, have the no cache allocate property. S–2179–60 93Cray® C and C++ Reference Manual Example 9: Trip counts In the following example, the minimum trip count is 1 and the maximum trip count is 1000: void loop_info( double *restrict a, double *restrict b, double s1, int n ) { int i; #pragma _CRI loop_info min_trips(1) max_trips(1000), cache_na(b) for (i = 0; i< n; i++) { if( a[i] != 0.0) { a[i] = a[i] + b [i]*s1; } } } Example 10: Specifying AMOs In the following example for test case p_amo or architectures that support vector AMOs, the compiler does not use a vector AMO for the first loop, but uses AMOs for the second loop because the prefer_amo clause is specified. Vector AMOs are supported only on Cray X2 systems. void p_amo( long * restrict ia, long * restrict ib, int n ) { int i; /* Compiler avoids vector AMOs in this case for most access patterns */ for ( i = 0; i < n; i++ ) { ib[i]++; } /* Direct the compiler to use vector AMOs when possible */ #pragma _CRI loop_info prefer_amo for ( i = 0; i < n; i++ ) { ib[i]++; } } A message similar to the following is issued when messages are enabled: CC-6385 cc: VECTOR File = p_amo.c, Line = 12 A vector atomic memory operation was used for this statement. 94 S–2179–60#pragma Directives [3] Example 11: Using prefer_noamo clause In the following example for test case a_amo, the compiler uses a vector AMO for the update construct in the first loop. In the second loop, the compiler does not use vector AMOs because the prefer_noamo clause was specified. void a_amo ( long * restrict a, long * restrict b, long * restrict c, long * restrict ia, long * restrict ib, int n ) { int i; /* Compiler automatically uses a vector AMO */ for ( i = 0; i < n; i++ ) { a[ia[i]] += c[i]; } /* Instruct the compiler to avoid using a vector AMO */ #pragma _CRI loop_info prefer_noamo for ( i = 0; i < n; i++ ) { b[ib[i]] += c[i]; } } The following messages may be issued for the two loop bodies: CC-6385 cc: VECTOR File = ac_amo.c, Line = 8 A vector atomic memory operation was used for this statement. CC-6371 cc: VECTOR File = ac_amo.c, Line = 12 A vectorized loop contains potential conflicts due to indirect addressing at line 13, causing less efficient code to be generated. 3.7.4 nopattern Directive Scope: Local The nopattern directive disables pattern matching for the loop immediately following the directive. The format of this directive is as follows: #pragma _CRI nopattern S–2179–60 95Cray® C and C++ Reference Manual By default, the compiler detects coding patterns in source code sequences and replaces these sequences with calls to optimized library functions. In most cases, this replacement improves performance. There are cases, however, in which this substitution degrades performance. This can occur, for example, in loops with very low trip counts. In such a case, you can use the nopattern directive to disable pattern matching and cause the compiler to generate inline code. In the following example, placing the nopattern directive in front of the outer loop of a nested loop turns off pattern matching for the matrix multiply that takes place inside the inner loop: double a[100][100], b[100][100], c[100][100]; void nopat(int n) { int i, j, k; #pragma _CRI nopattern for (i=0; i < n; ++i) { for (j = 0; j < n; ++j) { for (k = 0; k < n; ++k) { c[i][j] += a[i][k] * b[k][j]; } } } } 3.7.5 novector Directive Scope: Local The novector directive directs the compiler to not vectorize the loop that immediately follows the directive. It overrides any other vectorization-related directives, as well as the -h vector and -h ivdep command line options. The format of this directive is as follows: #pragma _CRI novector The following example illustrates the use of the novector compiler directive: #pragma _CRI novector for (i = 0; i < h; i++) { /* Loop not vectorized */ a[i] = b[i] + c[i]; } 96 S–2179–60#pragma Directives [3] 3.7.6 novsearch Directive Note: This directive is no longer recognized. Use the #pragma _CRI novector directive instead. 3.7.7 permutation Directive The permutation directive specifies that an integer array has no repeated values. This directive is useful when the integer array is used as a subscript for another array (vector-valued subscript). This directive may improve code performance. This directive has the following format: #pragma _CRI permutation symbol [, symbol ] ... In a sequence of array accesses that read array element values from the specified symbols with no intervening accesses that modify the array element values, each of the accessed elements will have a distinct value. When an array with a vector-valued subscript appears on the left side of the equal sign in a loop, many-to-one assignment is possible. Many-to-one assignment occurs if any repeated elements exist in the subscripting array. If it is known that the integer array is used merely to permute the elements of the subscripted array, it can often be determined that many-to-one assignment does not exist with that array reference. Sometimes a vector-valued subscript is used as a means of indirect addressing because the elements of interest in an array are sparsely distributed; in this case, an integer array is used to select only the desired elements, and no repeated elements exist in the integer array, as in the following example: int *ipnt; #pragma permutation ipnt ... for ( i = 0; i < N; i++ ) { a[ipnt[i]] = b[i] + c[i]; } The permutation directive does not apply to the array a, rather it applies to the pointer used to index into it, ipnt. By knowing that ipnt is a permutation, the compiler can safely generate an unordered scatter for the write to a. S–2179–60 97Cray® C and C++ Reference Manual 3.7.8 [no]pipeline Directive Software-based vector pipelining (software vector pipelining) provides additional optimization beyond the normal hardware-based vector pipelining. In software vector pipelining, the compiler analyzes all vector loops and automatically attempts to pipeline a loop if doing so can be expected to produce a significant performance gain. This optimization also performs any necessary loop unrolling. In some cases the compiler either does not pipeline a loop that could be pipelined or pipelines a loop without producing performance gains. In these situations, you can use the pipeline or nopipeline directive to advise the compiler to pipeline or not pipeline the loop immediately following the directive. Software vector pipelining is valid only for the innermost loop of a loop nest. The pipeline and nopipeline directives are advisory only. While you can use the nopipeline directive to inhibit automatic pipelining, and you can use the pipeline directive to attempt to override the compiler's decision not to pipeline a loop, you cannot force the compiler to pipeline a loop that cannot be pipelined. Loops that have been pipelined are so noted in loopmark listing messages. The formats of the pipelining directives are as follows: #pragma _CRI pipeline #pragma _CRI nopipeline For more information about software vector pipelining, see the Optimizing Applications on Cray X1 Series Systems manual or the Optimizing Applications on Cray X2 Systems manual. 3.7.9 prefervector Directive Scope: Local The prefervector directive tells the compiler to vectorize the loop that immediately follows the directive if the loop contains more than one loop in the nest that can be vectorized. The directive states a vectorization preference and does not guarantee that the loop has no memory dependence hazard. The format of this directive is as follows: #pragma _CRI prefervector 98 S–2179–60#pragma Directives [3] The following example illustrates the use of the prefervector directive: #pragma _CRI prefervector for (i = 0; i < n; i++) { #pragma _CRI ivdep for (j = 0; j < m; j++) a[i] += b[j][i]; } In the preceding example, both loops can be vectorized, but the directive directs the compiler to vectorize the outer for loop. Without the directive and without any knowledge of n and m, the compiler vectorizes the inner for loop. In this example, the outer for loop is vectorized even though the inner for loop had an ivdep directive. 3.7.10 pgo loop_info Directive Scope: Local The format of this directive is as follows: #pragma _CRI pgo loop_info The pgo loop_info directive enables profile-guided optimizations by tagging loopmark information as having come from profiling. For information about the -h profile_data=pgo_opt compiler option, seeSection 2.10.11, page 30. For information about CrayPat and profile information, see the Using Cray Performance Analysis Tools guide. 3.7.11 safe_address Directive Scope: Local The format of this directive is as follows: #pragma _CRI safe_address The safe_address directive specifies that it is safe to speculatively execute memory references within all conditional branches of a loop. In other words, you know that these memory references can be safely executed in each iteration of the loop. S–2179–60 99Cray® C and C++ Reference Manual For most code, the safe_address directive can improve performance significantly by preloading vector expressions. However, most loops do not require this directive to have preloading performed. The directive is required only when the safety of the operation cannot be determined or index expressions are very complicated. The safe_address directive is an advisory directive. That is, the compiler may override the directive if it determines the directive is not beneficial. If you do not use the directive on a loop and the compiler determines that it would benefit from the directive, it issues a message indicating such. The message is similar to this: CC-6375 cc: VECTOR File = ctest.c, Line = 6 A loop would benefit from "#pragma safe_address". If you use the directive on a loop and the compiler determines that it does not benefit from the directive, it issues a message that states the directive is superfluous and can be removed. To see the messages, you must use the -hreport=v option. ! Caution: Incorrect use of the directive can result in segmentation faults, bus errors, or excessive page faulting. However, it should not result in incorrect answers. Incorrect usage can result in very severe performance degradations or program aborts. In the example below, the compiler will not preload vector expressions, because the value of j is unknown. However, if you know that references to b[i][j] is safe to evaluate for all iterations of the loop, regardless of the condition, we can use the safe_address directive for this loop as shown below: void x3( double a[restrict 1000], int j ) { int i; #pragma _CRI safe_address for ( i = 0; i < 1000; i++ ) { if ( a[i] != 0.0 ) { b[j][i] = 0.0; } } } With the directive, the compiler can load b[i][j] with a full vector mask, merge 0.0 where the condition is true, and store the resulting vector using a full mask. 100 S–2179–60#pragma Directives [3] 3.7.12 safe_conditional Directive The safe_conditional directive specifies that it is safe to execute all references and operations within all conditional branches of a loop. In other words, you know that these memory references can be safely executed in each iteration of the loop. This directive specifies that memory and arithmetic operations are safe. This directive applies to scalar and vector loop nests. For Cray X1 series systems, this directive also applies to mulistreamed loop nests. It can improve performance by allowing the hoisting of invariant expressions from conditional code and by allowing prefetching of memory references. The safe_conditional directive is an advisory directive. That is, the compiler may override the directive if it determines the directive is not beneficial. ! Caution: Incorrect use of the directive can result in segmentation faults, bus errors, excessive page faulting, or arithmetic aborts. However, it should not result in incorrect answers. Incorrect usage can result in severe performance degradations or program aborts. The safe_conditional directive has the following format: #pragma _CRI safe_conditional S–2179–60 101Cray® C and C++ Reference Manual In the following example, without the safe_conditional directive, the compiler cannot precompute the invariant expression s1*s2 because their values are unknown and may cause an arithmetic trap if executed unconditionally. However, if you know that the condition is true at least once, then s1*s2 is safe to speculatively execute. The safe_conditional compiler directive can be used to imply the safety of the operation. With the directive, the compiler evaluates s1*s2 outside of the loop, rather than under control of the conditional code. In addition, all control flow is removed from the body of the vector loop, because s1*s2 no longer poses a safety risk. void safe_cond( double a[restrict 1000], double s1, double s2 ) { int i; #pragma _CRI safe_conditional for (i = 0; i< 1000; i++) { if( a[i] != 0.0) { a[i] = a[i] + s1*s2; } } } 3.7.13 shortloop and shortloop128 Directives Scope: Local The shortloop and shortloop128 directives improve performance of a vectorized loop by allowing the compiler to omit the run time test to determine whether it has been completed. The shortloop compiler directive identifies vector loops that execute with a maximum iteration count of 64 and a minimum iteration count of 1. The shortloop128 compiler directive identifies vector loops that execute with a maximum iteration count of 128 and a minimum iteration count of 1. If the iteration count is outside the range for the directive, results are unpredictable. The compiler will diagnose misuse at compile time (when able) or at run time when option -hdir_check is specified. These directives are ignored if the loop trip count is known at compile time and is greater than the target machine's vector length. The maximum hardware vector length is 64. 102 S–2179–60#pragma Directives [3] The syntax of these directives are as follows: #pragma _CRI shortloop #pragma _CRI shortloop128 The following examples illustrate the use of the shortloop and shortloop128 directives: #pragma _CRI shortloop for (i = 0; i < n; i++) { /* 0 < = n < = 63 */ a[i] = b[i] + c[i]; } #pragma _CRI shortloop128 for (i = 0; i < n; i++) { /* 0 < = n < = 127 */ a[i] = b[i] + c[i]; } The shortloop and shortloop128 directives are exactly equivalent to #pragma _CRI loop_info min_trips(1) max_trips(64) and #pragma _CRI loop_info min_trips(1) max_trips(128), respectively. The loop_info pragma is the preferred form. 3.8 Multistreaming Processor (MSP) Directives Note: The MSP directives are not supported on Cray X2 systems. This section describes the multistreaming processor (MSP) optimization directives. MSPs are advisory directives; the compiler is not obligated to honor them. For information about MSP compiler options, see Section 2.12, page 34 and for streaming intrinsics, see Appendix F, page 229. For details on Cray Streaming Directives, see Chapter 4, page 115. The MSP directives work with the -h streamn command line option to determine whether parts of your program are optimized for the MSP. The level of multistreaming must be greater than 0 in order for these directives to be recognized. For more information about the -h streamn command line option, see Section 2.12.1, page 34. The MSP #pragma directives are as follows: • #pragma _CRI ssp_private (see Section 3.8.1, page 104) • #pragma _CRI nostream (see Section 3.8.2, page 106) • #pragma _CRI preferstream (see Section 3.8.3, page 106) S–2179–60 103Cray® C and C++ Reference Manual 3.8.1 ssp_private Directive (cc, c99) The ssp_private directive allows the compiler to multistream loops that contain function calls. By default, the compiler does not multistream loops containing function calls, because the function may cause side effects that interfere with correct parallel execution. The ssp_private directive asserts that the specified function is free of side effects that inhibit parallelism and that the specified function, and all functions it calls, will run on an SSP. An implied condition for multistreaming a loop containing a call to a function specified with the ssp_private directive is that the loop body must not contain any data reference patterns that prevent parallelism. The compiler can disregard an ssp_private directive if it detects possible loop-carried dependencies that are not directly related to a call inside the loop. Note: The ssp_private directive affects only whether or not loops are multistreamed. It has no effect on loops within CSD parallel regions. When using the ssp_private directive, you must ensure that the function called within the body of the loop follows these criteria: • The function does not modify an object in one iteration and reference this same data in another iteration of the multistreamed loop. • The function does not reference data in one iteration that is defined in another iteration. • If the function modifies data, the iterations cannot modify data at the same storage location, unless these variables are scoped as PRIVATE. Following the multistreamed loop, the content of private variables are undefined. The ssp_private directive does not force the master thread to execute the last iteration of the multistreamed loop. • If the function uses shared data that can be written to and read, you must protect it with a guard (such as the CSD critical directive or the lock command) or have the SSPs access the data disjointedly (where access does not overlap). • The function calls only other routines that are capable of being called privately. • The function calls I/O properly. Note: The preceding list assumes that you have a working knowledge of race conditions. 104 S–2179–60#pragma Directives [3] To use the ssp_private directive, it must placed in the specification part, before any executable statements. This is the syntax of the ssp_private directive: #pragma _CRI ssp_private PROC_NAME[,PROC_NAME] ... PROC_NAME The name of a function. Any number of ssp_private directives may be specified in a function. If a function is specified with the ssp_private directive, the function retains this attribute throughout the entire program unit. Also, the ssp_private directive is considered a declarative directive and must be specified before the start of any executable statements. The following example demonstrates use of the ssp_private pragma: /* Code in example.c */ extern void poly_eval( float *y, float x, int m, float p[m] ); #pragma _CRI ssp_private poly_eval void example(int n, int m, float x[n], float y[n], float p[]) { int i; for (i = 0; i < n; ++i) { poly_eval( &y[i], x[i], m, p ); } } /* Code in example poly_eval.c */ void poly_eval( float *y, float x, int m, float p[] ) { float result = p[m]; int i; for (i = m-1; m >= 0; --m) { result = x * result + p[i]; } *y = result; } This example compiles the code: % cc -c example.c % cc -c -h gen_private_callee poly_eval.c % cc -o example example.o poly_eval.o S–2179–60 105Cray® C and C++ Reference Manual Now run the code: % aprun -L1 ./example SSP private routines are appropriate for user-specified math support functions. Intrinsic math functions, like COS are effectively SSP private routines. 3.8.2 nostream Directive Scope: Local The #pragma _CRI nostream directive directs the compiler to not perform MSP optimizations on the loop that immediately follows the directive. It overrides any other MSP-related directives as well as the -h streamn command line option. The format of this directive is as follows: #pragma _CRI nostream The following example illustrates the use of the nostream directive: #pragma _CRI nostream for ( i = 0; i < n1; i++ ) { x[i] = y[i] + z[i]; } 3.8.3 preferstream Directive Scope: Local The preferstream directive tells the compiler to multistream the following loop. It can be used when one of these conditions apply: • The compiler issues a message saying there are too few iterations in the loop to make multistreaming worthwhile. • The compiler multistreams a loop in a loop nest, and you want it to multistream a different eligible loop in the same nest. The format of this directive is as follows: #pragma _CRI preferstream 106 S–2179–60#pragma Directives [3] The following example illustrates the use of the preferstream directive: for (j = 0; j< n2; j++ ) { #pragma _CRI preferstream for (i = 0; i < n1; i++ ) { a[j][i] = b[j][i] + c[j][i]; } } 3.9 Scalar Directives This section describes the scalar optimization directives, which control aspects of code generation, register storage, and other scalar operations. 3.9.1 concurrent Directive Scope: Local The concurrent directive indicates that no data dependence exists between array references in different iterations of the loop that follows the directive. This can be useful for vectorization and multistreaming optimizations. The format of the concurrent directive is as follows: #pragma _CRI concurrent [safe_distance=n] n An integer that represents the number of additional consecutive loop iterations that can be executed in parallel without danger of data conflict. n must be an integral constant > 0. The concurrent directive is ignored if the safe_distance clause is used and MSP optimizations, multistreaming, or vectorization is requested on the command line. In the following example, the concurrent directive indicates that the relationship k>3 is true. The compiler will safely load all the array references x[i-k], x[i-k+1], x[i-k+2], and x[i-k+3] during loop iteration i. #pragma _CRI concurrent safe_distance=3 for (i = k + 1; i < n;i++) { x[i] = a[i] + x[i-k]; } S–2179–60 107Cray® C and C++ Reference Manual 3.9.2 nointerchange Directive Scope: Local The nointerchange directive inhibits the compiler's ability to interchange the loop that follows the directive with another inner or outer loop. The format of this directive is as follows: #pragma _CRI nointerchange In the following example, the nointerchange directive prevents the iv loop from being interchanged by the compiler with either the jv loop or the kv loop: for (jv = 0; jv < 128; jv++) { #pragma _CRI nointerchange for (iv = 0; iv < m; iv++) { for (kv = 0; kv < n; kv++) { p1[iv][jv][kv] = pw[iv][jv][kv] * s; } } } 3.9.3 noreduction Directive Note: This directive is no longer recognized. Use the #pragma _CRI novector directive instead. 3.9.4 suppress Directive The suppress directive suppresses optimization in two ways, determined by its use with either global or local scope. The global scope suppress directive specifies that all associated local variables are to be written to memory before a call to the specified function. This ensures that the value of the variables will always be current. The global suppress directive takes the following form: #pragma _CRI suppress func... 108 S–2179–60#pragma Directives [3] The local scope suppress directive stores current values of the specified variables in memory. If the directive lists no variables, all variables are stored to memory. This directive causes the values of these variables to be reloaded from memory at the first reference following the directive. The local suppress directive has the following format: #pragma _CRI suppress [var] ... The net effect of the local suppress directive is similar to declaring the affected variables to be volatile except that the volatile qualifier affects the entire program whereas the local suppress directive affects only the block of code in which it resides. 3.9.5 [no]unroll Directive Scope: Local The unroll directive allows the user to control unrolling for individual loops or to specify no unrolling of a loop. Loop unrolling can improve program performance by revealing cross-iteration memory optimization opportunities such as read-after-write and read-after-read. The effects of loop unrolling also include: • Improved loop scheduling by increasing basic block size • Reduced loop overhead • Improved chances for cache hits The format for this compiler directive is as follows: #pragma _CRI [no]unroll [n] The nounroll directive disables loop unrolling for the next loop and does not accept the integer argument n. The nounroll directive is equivalent to the unroll 0 and unroll 1 directives. The n argument applies only to the unroll directive and specifies no loop unrolling (n = 0 or 1) or the total number of loop body copies to be generated (2 = n = 63). If you do not specify a value for n, the compiler will determine the number of copies to generate based on the number of statements in the loop nest. Note: The compiler cannot always safely unroll non-innermost loops due to data dependencies. In these cases, the directive is ignored (see Example 13, page 110). S–2179–60 109Cray® C and C++ Reference Manual The unroll directive can be used only on loops with iteration counts that can be calculated before entering the loop. If unroll is specified on a loop that is not the innermost loop in a loop nest, the inner loops must be nested perfectly. That is, all loops in the nest can contain only one loop, and the innermost loop can contain work. Example 12: Unrolling Outer Loops In the following example, assume that the outer loop of the following nest will be unrolled by 2: #pragma _CRI unroll 2 for (i = 0; i < 10; i++) { for (j = 0; j < 100; j++) { a[i][j] = b[i][j] + 1; } } With outer loop unrolling, the compiler produces the following nest, in which the two bodies of the inner loop are adjacent: for (i = 0; i < 10; i += 2) { for (j = 0; j < 100; j++) { a[i][j] = b[i][j] + 1; } for (j = 0; j < 100; j++) { a[i+1][j] = b[i+1][j] + 1; } } The compiler then jams, or fuses, the inner two loop bodies, producing the following nest: for (i = 0; i < 10; i += 2) { for (j = 0; j < 100; j++) { a[i][j] = b[i][j] + 1; a[i+1][j] = b[i+1][j] + 1; } } Example 13: Illegal Unrolling of Outer Loops Outer loop unrolling is not always legal because the transformation can change the semantics of the original program. For example, unrolling the following 110 S–2179–60#pragma Directives [3] loop nest on the outer loop would change the program semantics because of the dependency between a[i][...] and a[i+1][...]: /* directive will cause incorrect code due to dependencies! */ #pragma _CRI unroll 2 for (i = 0; i < 10; i++) { for (j = 1; j < 100; j++) { a[i][j] = a[i+1][j-1] + 1; } } 3.9.6 [no]fusion Directive The nofusion directive instructs the compiler to not attempt loop fusion on the following loop even when the -h fusion option was specified on the compiler command line. The fusion directive instructs the compiler to attempt loop fusion on the following loop unless -h nofusion was specified on the compiler command line. 3.10 Inlining Directives Inlining replaces calls to user-defined functions with the code that represents the function. This can improve performance by saving the expense of the function call overhead. It also increases the possibility of additional code optimization. Inlining may increase object code size. Inlining is invoked in the following ways: • Automatic inlining is enabled by the -h ipan option alone, as described in Section 2.14, page 37. • Explicit inlining is enabled by the -h ipafrom=source [:source ] option alone as described in Section 2.14.3, page 40. • Combined inlining is enabled by using both the -h ipan and -h ipafrom=source [:source ] options (see Section 2.14.4, page 41). Inlining directives can only appear in local scope; that is, inside a function definition. Inlining directives always take precedence over the command line settings. The -h report=i option writes messages identifying where functions are inlined or briefly explains why functions are not inlined. S–2179–60 111Cray® C and C++ Reference Manual 3.10.1 inline_enable, inline_disable, and inline_reset Directives The inline_enable directive tells the compiler to attempt to inline functions at call sites. It has the following format: #pragma _CRI inline_enable The inline_disable directive tells the compiler to not inline functions at call sites. It has the following format: #pragma _CRI inline_disable The inline_reset directive returns the inlining state to the state specified on the command line (-h ipan). It has the following format: #pragma _CRI inline_reset The following example illustrates the use of these directives. Example 14: Using the inline_enable, inline_disable, and inline_reset Directives To compile the file displayed in this example, enter the following commands: % cc -hipa4 b.c % cat b.c void qux(int x) { void bar(void); int a = 1; x = a+a+a+a+a+a+a+a+a+a+a+a; bar(); } void foo(void) { int j = 1; #pragma inline_enable /* enable inlining at all call sites here forward */ qux(j); qux(j); #pragma inline_disable /* disable inlining at all call sites here forward */ qux(j); 112 S–2179–60#pragma Directives [3] #pragma inline_reset /* reset control to the command line -hipa4 */ qux(j); } 3.10.2 inline_always and inline_never Directives The inline_always directive specifies functions that the compiler should always attempt to inline. If the directive is placed in the definition of the function, inlining is attempted at every call site to name in the entire input file being compiled. If the directive is placed in a function other than the definition, inlining is attempted at every call site to name within the specific function containing the directive. The format of the inline_always directive is as follows: #pragma _CRI inline_always name [,name ] ... The inline_never directive specifies functions that are never to be inlined. If the directive is placed in the definition of the function, inlining is never attempted at any call site to name in the entire input file being compiled. If the directive is placed in a function other than the definition, inlining is never attempted at any call site to name within the specific function containing the directive. The format of the inline_never directive is as follows: #pragma _CRI inline_never name [,name ] ... The name argument is the name of a function. S–2179–60 113Cray® C and C++ Reference Manual 114 S–2179–60Cray Streaming Directives (CSDs) [4] Note: Cray Streaming Directives are not supported on Cray X2 systems. The Cray streaming directives (CSDs) consist of six non-advisory directives which allow you to more closely control multistreaming for key loops in C and C++ programs. Non-advisory means that the compiler must honor these directives. The intention of these directives is not to create an additional parallel programming style or demand large effort in code development. They are meant to assist the compiler in multistreaming your program. On its own, the compiler should perform multistreaming correctly in most cases. However, if multistreaming for key loops is not occurring as you desire, then use the CSDs to override the compiler. CSDs are modeled after the OpenMP directives and are compatible with Pthreads and all distributed-memory parallel programming models on Cray X1 series systems. Multistreaming advisory directives (MSP directives) and CSDs cannot be mixed within the same block of code. For information about MSPs, see Section 3.8, page 103. Before explaining guidelines and other issues, you need an understanding of these CSD items: • CSD parallel regions • CSD parallel (defines a CSD parallel region) • CSD for (multistreams a for loop) • CSD parallel for (combines the CSD parallel and for directives into one directive) • CSD sync (synchronizes all SSPs within an MSP) • CSD critical (defines a critical section of code) • CSD ordered (specifies that SSPs execute in order) S–2179–60 115Cray® C and C++ Reference Manual When you are familiar with the directives, these topics will be beneficial to you: • Using CSDs with Cray programming models • CSD Placement • Protection of shared data • Dynamic memory allocation for CSD parallel regions • Compiler options affecting CSDs Note: For information about how to use the CSDs to optimize your code, see the Optimizing Applications on Cray X1 Series Systems manual. 4.1 CSD Parallel Regions CSDs are applied to a block of code (for example a loop), which will be referred to as the CSD parallel region. All CSDs must be used within this region. You must not branch into or out of the region. Multiple CSD parallel regions can exist within a program; however, only one parallel region will be active at any given time. For example, if a parallel region calls a function containing a parallel region, the function will execute as if it did not contain a parallel region. The CSD parallel region can contain loops and nonloop constructs, but only loops preceded by a for directive are partitioned. Parallel execution of other loops and nonloop constructs, such as initializing variables for the targeted loop, are performed redundantly on all SSPs. Functions called from the region will be executed redundantly, and loops within them can be partitioned with the for directive. Parallel execution of the function is independent on all SSPs, except for code blocks containing standalone CSDs. For more information, see Section 4.9, page 125. 4.2 parallel Directive The parallel directive defines the CSD parallel region, tells the compiler to multistream the region, and specifies private data objects. All other CSDs must be used within the region. You cannot place the parallel directive in the middle of a construct. 116 S–2179–60Cray Streaming Directives (CSDs) [4] This is the form of the parallel directives: #pragma _CRI csd parallel [private(list)] [ordered] { structured_block } /* End of CSD parallel region */ The private clause allows you to specify data objects that are private to each SSP within the CSD parallel region; that is, each SSP has its own copy of that object and is not shared with other SSPs. The main reason for having private objects is because updating them within the CSD parallel region could cause incorrect updates because of race conditions on their addresses. The list argument specifies a comma separated list of objects to make private. By default, the variables used for loop indexing are assumed to be private. Variables declared in the inner scope of a parallel region are implicitly private. Other variables, unless specified in the private clause, are assumed to be shared. You may need to take special steps when using private variables. If a data object existed before the parallel region is entered and the object is made private, the object may not have the same contents inside of the region as it did outside the region. The same is true when exiting the parallel region. This same object may not have the same content outside the region as it did within the region. Therefore, if you desire that a private object keep the same value when transitioning in and out of the parallel region, copy its value to a protected shared object so you can copy it back into the private object later. The ordered clause is needed if there is within the parallel region, but outside the loops within the region, any call to a function containing a CSD ordered directive. That is, if only the loops contain calls to functions that contain the CSD ordered directive, the clause is not needed. If the clause is used and there are no called functions containing a CSD ordered directive, the results produced by the code encapsulated by the directive will be correct, but performance of that code will be slightly degraded. If the ordered clause is missing and there is a called function containing a CSD ordered directive, your results will be incorrect. The following example shows when the ordered clause is needed: #pragma _CRI csd parallel ordered { fun(); /* fun contains ordered directive */ for_loop_block . . . } S–2179–60 117Cray® C and C++ Reference Manual The end of the CSD parallel region has an implicit barrier synchronization. The implicit barrier protects an SSP from prematurely accessing shared data. Note: At the point of the parallel directive, all SSPs are enabled and are disabled at the end of the CSD parallel region. This example shows how to use the parallel directive: #pragma _CRI csd parallel private(jx) { x = 2 * PI; /* This line is computed on all SSPs */ for(i=1; i void upc_all_free(shared void *ptr); upc_all_free deallocates memory allocated by the upc_all_alloc function. 6.1.1.2 upc_local_free The synopsis is: #include void upc_local_free(shared void *ptr); 136 S–2179–60Cray Unified Parallel C (UPC) [6] The upc_local_free function deallocates shared memory allocated by a call to either upc_alloc or upc_local_alloc. If the ptr argument does not point to memory that was allocated by either upc_alloc or upc_local_alloc or points to memory that was already deallocated, the behavior of the function is undefined. If the ptr argument is NULL, no action occurs. Note that program termination does not imply that shared data allocated dynamically is freed. 6.1.2 Pointer-to-shared Manipulation Functions The following sections describe the pointer-to-shared manipulation functions. 6.1.3 Lock Functions The following sections describe the Cray specific lock functions. 6.1.3.1 upc_all_lock_free The synopsis is: #include void upc_all_lock_free(upc_lock_t *ptr); upc_all_lock_free frees a lock allocated by the upc_all_lock_alloc function. 6.1.3.2 upc_global_lock_free The synopsis is: #include void upc_global_lock_free(upc_lock_t *ptr); upc_global_lock_free frees a lock allocated by the upc_global_lock_alloc function. The upc_global_lock_free function frees all resources associated with lock ptr, which was allocated by upc_global_lock_alloc. The upc_global_lock_free function will free ptr whether it is unlocked or locked by any thread. After ptr is freed, passing it to any locking functions in any thread will cause undefined behavior. S–2179–60 137Cray® C and C++ Reference Manual Only the thread that allocated lock ptr should free it. Be cautious when freeing the lock, because there is no implied synchronization with other threads. If the ptr argument is a NULL pointer, the function does nothing. If ptr was not allocated by the upc_global_lock_alloc function or if it was freed earlier, the behavior of upc_global_lock_free will be undefined. 6.2 Cray Implementation Differences There is a false sharing hazard when referencing shared char and short integers on Cray X1 series or Cray X2 systems. If two PEs store a char or short to the same 32-bit word in memory without synchronization, incorrect results can occur. It is possible for one PE's store to be lost. This is because these stores are implemented by reading the entire 32-bit word, inserting the char or short value and writing the entire word back to memory. The following output is a result of two PEs writing two different characters into the same word in memory without synchronization: Register Memory Initial Value 0x0000 PE 0 Reads 0x0000 0x0000 PE 1 Reads 0x0000 0x0000 PE 0 Inserts 3 0x3000 0x0000 PE 1 Inserts 7 0x0700 0x0000 PE 0 Writes 0x3000 0x3000 PE 1 Writes 0x0700 0x0700 Notice that the value stored by PE 0 has been lost. The final value intended was 0x3700. This situation is referred to as false sharing. It is the result of supporting data types that are smaller than the smallest type that can be individually read or written by the hardware. UPC programmers must take care when storing to shared char and short data that this situation does not occur. 6.3 Compiling and Executing UPC Code To compile UPC code, you must load the programming environment module (PrgEnv) and specify the -h upc option on the cc, c89, or c99 command line. The -X npes option can optionally be used to define the number of threads to use and statically set the value of the THREADS constant. 138 S–2179–60Cray Unified Parallel C (UPC) [6] Example 15: UPC and THREADS defined dynamically The following example enables UPC and allows the THREADS symbol to be defined dynamically for the examp1 application: % cc -h upc -o multupc examp1.c Example 16: UPC and THREADS defined statically The following example enables UPC and statically defines the THREADS symbol as 15 for the examp1 application: % cc -h upc -X15 -o multupc examp1.c For Cray X1 series systems, the processing elements specified by npes are either MSPs or SSPs. To run programs on SSPs, you must specify the -h ssp compiler option. The default is to run on MSPs. For Cray X2 systems, the processing elements specified by npes are compute node processors. For more information about using UPC in SSP mode, see Section 2.10.13, page 30. After compiling the UPC code, you run the program using the aprun command when the code contains UPC code only, or a mixture of UPC and SHMEM, and/or CAF code. If the code has a mixture of UPC and MPI code, use the mpirun command to run the program. If you use the –X npes compiler option, you must specify the same number of threads in the aprun command. Note: For more information about improving UPC code performance, see the Optimizing Applications on Cray X1 Series Systems manual or the Optimizing Applications on Cray X2 Systems manual. S–2179–60 139Cray® C and C++ Reference Manual 140 S–2179–60Cray C++ Libraries [7] The Cray C++ compiler together with the Dinkum C++ Libraries support the C++ 98 standard (ISO/IEC FDIS 14882) and continues to support existing Cray extensions. Most of the standard C++ features are supported, except for the few mentioned in Section 7.1. The Dinkum C++ Library is described in Section 7.2. For information about C++ language conformance and exceptions, see Appendix D, page 207. 7.1 Unsupported Standard C++ Library Features The Cray C++ compiler supports the C++ standard except for wide characters and multiple locales as follows: • String classes using basic string class templates with wide character types or that use the wstring standard template class • I/O streams using wide character objects • File-based streams using file streams with wide character types (wfilebuf, wifstream, wofstream, and wfstream) • Multiple localization libraries; Cray C++ supports only one locale Note: The C++ standard provides a standard naming convention for library routines. Therefore, classes or routines that use wide characters are named appropriately. For example, the fscanf and sprintf functions do not use wide characters, but the fwscanf and swprintf function do. 7.2 Dinkum and GNU C++ Libraries For Cray X1 series systems, the Cray C++ compiler uses the Dinkum C++ libraries, which support standard C++. The Dinkum C++ Library documentation is provided in HTML (see http://www.dinkumware.com/). You can also find other references to tutorials and advanced user materials for the standard C++ library in the preface of this document. For Cray X2 systems, the Cray C++ compiler uses the GNU standard C++ library, libstdc++.a. For details, see http://gcc.gnu.org/libstdc++/index.html. S–2179–60 141Cray® C and C++ Reference Manual 142 S–2179–60Cray C++ Template Instantiation [8] A template describes a class or function that is a model for a family of related classes or functions. The act of generating a class or function from a template is called template instantiation. For example, a template can be created for a stack class, and then a stack of integers, a stack of floats, and a stack of some user-defined type can be used. In source code, these might be written as Stack, Stack, and Stack. From a single source description of the template for a stack, the compiler can create instantiations of the template for each of the types required. The instantiation of a class template is always done as soon as it is needed during a compilation. However, the instantiations of template functions, member functions of template classes, and static data members of template classes (template entities) are not necessarily done immediately for the following reasons: • The preferred end result is one copy of each instantiated entity across all object files in a program. This applies to entities with external linkage. • A specialization of a template entity is allowed. For example, a specific version of Stack, or of just Stack::push could be written to replace the template-generated version and to provide a more efficient representation for a particular data type. • If a template function is not referenced, it should not be compiled because such functions could contain semantic errors that would prevent compilation. Therefore, a reference to a template class should not automatically instantiate all the member functions of that class. The goal of an instantiation mode is to provide trouble-free instantiation. The programmer should be able to compile source files to object code, link them and run the resulting program, without questioning how the necessary instantiations are done. In practice, this is difficult for a compiler to do, and different compilers use different instantiation schemes with different strengths and weaknesses. S–2179–60 143Cray® C and C++ Reference Manual The Cray C++ compiler requires a normal, top-level, explicitly compiled source file that contains the definition of both the template entity and of any types required for the particular instantiation. This requirement is met in one of the following ways: • Each .h file that declares a template entity also contains either the definition of the entity or includes another file containing the definition. • When the compiler identifies a template declaration in a .h file and discovers a need to instantiate that entity, implicit inclusion gives the compiler permission to search for an associated definition file having the same base name and a different suffix and implicitly include that file at the end of the compilation (see Section 8.6, page 151). • The programmer makes sure that the files that define template entities also have the definitions of all the available types and adds code or directives in those files to request instantiation of those entities. The Cray C++ compiler provides two instantiation mechanisms—simple instantiation and prelinker instantiation. These mechanisms perform template instantiation and provide command line options and #pragma directives that give the programmer more explicit control over instantiation. 8.1 Simple Instantiation The goal of the simple instantiation mode is to provide a method of instantiating templates without the need to create and manage intermediate (*.ti and *.ii) files. The Cray C++ compilers accomplishes simple instantiation as follows: 1. When the source files of a program are compiled using the -h simple_templates option, each of the *.o files contains a copy of all of the template instantiations it uses. 2. When the object files are linked together, the resulting executable file contains multiple copies of the template function. Unlike in prelinker instantiation, no *.ti or *.ii files are created. The programmer is not required to manage the naming and location of the intermediate files. The simple template instantiation process creates slightly larger object files and a slightly larger executable file than is the case for prelinker instantiation. 144 S–2179–60Cray C++ Template Instantiation [8] For example, you have three C++ source files, x.C, y.C, and z.C. The source files reference a template sortall that sorts int, float, and char array elements: template void sortall(X a[]) { ... code to sort int, float, char elements ... } Entering the command CC -c -h simple_templates x.C y.C z.C produces object files x.o, y.o, and z.o. Each *.o file has three copies of sortall, one for ints, one for floats, and one for chars. Then, entering the command CC x.o y.o z.o links the files and any needed library routines, creating a.out. Because the -h simple_templates option enables the -h instantiate=used option, all needed template entities are instantiated. The programmer can use the #pragma do_not_instantiate directive in programs compiled using the -h simple_templates option. For more information, see Section 3.6, page 90. 8.2 Prelinker Instantiation In prelinker mode, automatic instantiation is accomplished by the Cray C++ compiler as follows: 1. If the compiler is responsible for doing all instantiations automatically, it can only do so for the entire program. That is, the compiler cannot make decisions about instantiation of template entities until all source files of the complete program have been read. 2. The first time the source files of a program are compiled, no template entities are instantiated. However, the generated object files contain information about things that could have been instantiated in each compilation. For any source file that makes use of a template instantiation, an associated .ti file is created, if one does not already exist (for example, the compilation of abc.C results in the creation of abc.ti). 3. When the object files are linked together, a program called the prelinker is run. It examines the object files, looking for references and definitions of S–2179–60 145Cray® C and C++ Reference Manual template entities and for any additional information about entities that could be instantiated. ! Caution: The prelinker examines the object files in a library (.a) file but, because it does not modify them, is not able to assign template instantiations to them. 4. If the prelinker finds a reference to a template entity for which there is no definition in the set of object files, it looks for a file that indicates that it could instantiate that template entity. Upon discovery of such a file, it assigns the instantiation to that file. The set of instantiations assigned to a given file (for example, abc.C) is recorded in an associated file that has a .ii suffix (for example, abc.ii). 5. The prelinker then executes the compiler to again recompile each file for which the .ii was changed. 6. During compilation, the compiler obeys the instantiation requests contained in the associated .ii file and produces a new object file that contains the requested template entities and the other things that were already in the object file. 7. The prelinker repeats steps 3 through 5 until there are no more instantiations to be adjusted. 8. The object files are linked together. Once the program has been linked correctly, the .ii files contain a complete set of instantiation assignments. If source files are recompiled, the compiler consults the .ii files and does the indicated instantiations as it does the normal compilations. That means that, except in cases where the set of required instantiations changes, the prelink step from then on will find that all the necessary instantiations are present in the object files and no instantiation assignment adjustments need be done. This is true even if the entire program is recompiled. Because the .ii file contains information about how to recompile when instantiating, it is important that the .o and .ii files are not moved between the first compilation and linkage. The prelinker cannot instantiate into and from library files (.a), so if a library is to be shared by many applications its templates should be expanded. You may find that creating a directory of objects with corresponding .ii files and the use of -h prelink_copy_if_nonlocal (see Section 2.7.9, page 22) will work as if you created a library (.a) that is shared. 146 S–2179–60Cray C++ Template Instantiation [8] The -h prelink_local_copy option indicates that only local files (for example, files in the current directory) are candidates for assignment of instantiations. This option is useful when you are sharing some common relocatables but do not want them updated. Another way to ensure that shared .o files are not updated is to use the -h remove_instantiation_flags option when compiling the shared .o files. This also makes smaller resulting shared .o files. An easy way to create a library that instantiates all references of templates within the library is to create an empty main function and link it with the library, as shown in the following example. The prelinker will instantiate those template references that are within the library to one of the relocatables without generating duplicates. The empty dummy_main.o file is removed prior to creating the .a file. % CC a.C b.C c.C dummy_main.C % ar cr mylib.a a.o b.o c.o Another alternative to creating a library that instantiates all references of templates is to use the -h one_instantiation_per_object option. This option directs the prelinker to instantiate each template referenced within a library in its own object file. The following example shows how to use the option: % CC -h one_instantiation_per_object a.C b.C c.C dummy_main.C % ar cr mylib.a a.o b.o c.o myInstantiationsDir/*.int.o For more information about this alternative see Section 8.4, page 149 and Section 2.7.3, page 20. Prelinker instantiation can coexist with partial explicit control of instantiation by the programmer through the use of #pragma directives or the -h instantiate=mode option. Prelinker instantiation mode can be disabled by issuing the -h noautoinstantiate command line option. If prelinker instantiation is disabled, the information about template entities that could be instantiated in a file is not included in the object file. S–2179–60 147Cray® C and C++ Reference Manual 8.3 Instantiation Modes Normally, during compilation of a source file, no template entities are instantiated (except those assigned to the file by prelinker instantiation). However, the overall instantiation mode can be changed by issuing the -h instantiate=mode command line option. The mode argument can be specified as follows: mode Description none Do not automatically create instantiations of any template entities. This is the most appropriate mode when prelinker instantiation is enabled. This is the default instantiation mode. used Instantiate those template entities that were used in the compilation. This includes all static data members that have template definitions. all Instantiate all template entities declared or referenced in the compilation unit. For each fully instantiated template class, all of its member functions and static data members are instantiated, regardless of whether they were used. Nonmember template functions are instantiated even if the only reference was a declaration. local Similar to used mode, except that the functions are given internal linkage. This mode provides a simple mechanism for those who are not familiar with templates. The compiler instantiates the functions used in each compilation unit as local functions, and the program links and runs correctly (barring problems due to multiple copies of local static variables). This mode may generate multiple copies of the instantiated functions and is not suitable for production use. This mode cannot be used in conjunction with prelinker template instantiation. Prelinker instantiation is disabled by this mode. In the case where the CC(1) command is given a single source file to compile and link, all instantiations are done in the single source file and, by default, the used mode is used and prelinker instantiation is suppressed. 148 S–2179–60Cray C++ Template Instantiation [8] 8.4 One Instantiation Per Object File You can direct the prelinker to instantiate each template referenced in the source into its own object file. This method is preferred over other template instantiation object file generation options because: • The user of a library pulls in only the instantiations that are needed. • Multiple libraries with the same template can link. If each instantiation is not placed in its own object file, linking a library with another library that also contains the same instantiations will generate warnings on some platforms. Use the -h one_instantiation_per_object option to generate one object file per instantiation. For more information about this option, see Section 2.7.3, page 20. 8.5 Instantiation #pragma Directives Instantiation #pragma directives can be used in source code to control the instantiation of specific template entities or sets of template entities. There are three instantiation #pragma directives: • The #pragma _CRI instantiate directive causes a specified entity to be instantiated. • The #pragma _CRI do_not_instantiate directive suppresses the instantiation of a specified entity. It is typically used to suppress the instantiation of an entity for which a specific definition is supplied. • The #pragma _CRI can_instantiate directive indicates that a specified entity can be instantiated in the current compilation, but need not be. It is used in conjunction with prelinker instantiation to indicate potential sites for instantiation if the template entity is deemed to be required by the compiler. S–2179–60 149Cray® C and C++ Reference Manual The argument to the #pragma _CRI instantiate directive can be any of the following: • A template class name. For example: A • A template class declaration. For example: class A • A member function name. For example: A::f • A static data member name. For example: A::i • A static data declaration. For example: int A::i • A member function declaration. For example: void A::f(int, char) • A template function declaration. For example: char* f(int, float) A #pragma directive in which the argument is a template class name (for example, A or class A) is equivalent to repeating the directive for each member function and static data member declared in the class. When instantiating an entire class, a given member function or static data member may be excluded using the #pragma _CRI do_not_instantiate directive. For example: #pragma _CRI instantiate A #pragma _CRI do_not_instantiate A::f The template definition of a template entity must be present in the compilation for an instantiation to occur. If an instantiation is explicitly requested by use of the #pragma _CRI instantiate directive and no template definition is available or a specific definition is provided, an error is issued. 150 S–2179–60Cray C++ Template Instantiation [8] The following example illustrates the use of the #pragma _CRI instantiate directive: template void f1(T); // No body provided template void g1(T); // No body provided void f1(int) {} // Specific definition void main() { int i; double d; f1(i); f1(d); g1(i); g1(d); } #pragma _CRI instantiate void f1(int) // error-specific definition #pragma _CRI instantiate void g1(int) // error-no body provided In the preceding example, f1(double) and g1(double) are not instantiated because no bodies are supplied, but no errors will be produced during the compilation. If no bodies are supplied at link time, a linker error is issued. A member function name (such as A::f) can be used as a #pragma directive argument only if it refers to a single, user-defined member function (that is, not an overloaded function). Compiler-generated functions are not considered, so a name can refer to a user-defined constructor even if a compiler-generated copy constructor of the same name exists. Overloaded member functions can be instantiated by providing the complete member function declaration, as in the following example: #pragma _CRI instantiate char* A::f(int, char*) The argument to an instantiation directive cannot be a compiler-generated function, an inline function, or a pure virtual function. 8.6 Implicit Inclusion The implicit inclusion feature implies that if the compiler needs a definition to instantiate a template entity declared in a .h file, it can implicitly include the corresponding .C file to get the source code for the definition. For example, if a template entity ABC::f is declared in file xyz.h, and an instantiation of ABC::f is required in a compilation, but no definition of ABC::f appears in the source code processed by the compilation, the compiler will search for the xyz.C file and, if it exists, process it as if it were included at the end of the main source file. S–2179–60 151Cray® C and C++ Reference Manual To find the template definition file for a given template entity, the Cray C++ compiler must know the full path name to the file in which the template was declared and whether the file was included using the system include syntax (such as #include ). This information is not available for preprocessed source code containing #line directives. Consequently, the Cray C++ compiler does not attempt implicit inclusion for source code that contains #line directives. The set of definition-file suffixes that are tried by default, is .c, .C, .cpp, .CPP, .cxx, .CXX, and .cc. Implicit inclusion works well with prelinker instantiation; however, they are independent. They can be enabled or disabled independently, and implicit inclusion is still useful without prelinker instantiation. 152 S–2179–60Cray C Extensions [9] The Cray C compiler supports the following Cray extensions to the C standard: • Complex data extensions (Section 9.1, page 153) • fortran keyword (Section 9.2, page 154) • Hexadecimal floating-point constants (Section 9.3, page 154) A program that uses one or more extensions does not strictly conform to the standard. These extensions are not available in strict conformance mode. 9.1 Complex Data Extensions Cray C extends the complex data facilities defined by standard C with these extensions: • Imaginary constants • Incrementing or decrementing _Complex data The Cray C compiler supports the Cray imaginary constant extension and is defined in the header file. This imaginary constant has the following form: Ri R is either a floating constant or an integer constant; no space or other character can appear between R and i. If you are compiling in strict conformance mode (-h conform), the Cray imaginary constants are not available. The following example illustrates imaginary constants: #include double complex z1 = 1.2 + 3.4i; double complex z2 = 5i; The other extension to the complex data facility allows the prefix– and postfixincrement and decrement operators to be applied to the _Complex data type. The operations affect only the real portion of a complex number. S–2179–60 153Cray® C and C++ Reference Manual 9.2 fortran Keyword In extended mode, the identifier fortran is treated as a keyword. It specifies a storage class that can be used to declare a Fortran-coded external function. The use of the fortran keyword when declaring a function causes the compiler to verify that the arguments used in each call to the function are pass by addresses; any arguments that are not addresses are converted to addresses. As in any function declaration, an optional type-specifier declares the type returned, if any. Type int is the default; type void can be used if no value is returned (by a Fortran subroutine). The fortran storage class causes conversion of lowercase function names to uppercase, and, if the function name ends with an underscore character, the trailing underscore character is stripped from the function name. (Stripping the trailing underscore character is in keeping with UNIX practice.) Functions specified with a fortran storage class must not be declared elsewhere in the file with a static storage class. Note: The fortran keyword is not allowed in Cray C++. An example using the fortran keyword is shown in Section 13.3.7, page 179. 9.3 Hexadecimal Floating-point Constants The Cray C compiler supports the standard hexadecimal floating constant notations and the Cray hexadecimal floating constant notation. The standard hexadecimal floating constants are portable and have sizes that are dependent upon the hardware. The remainder of this section discusses the Cray hexadecimal floating constant. The Cray hexadecimal floating constant feature is not portable, because identical hexadecimal floating constants can have different meanings on different systems. It can be used whenever traditional floating-point constants are allowed. The hexadecimal constant has the usual syntax: 0x (or 0X) followed by hexadecimal characters. The optional floating suffix has the same form as for normal floating constants: f or F (for float), l or L (for long), optionally followed by an i (imaginary). The constant must represent the same number of bits as its type, which is determined by the suffix (or the default of double). The constant's bit length is four times the number of hexadecimal digits, including leading zeros. 154 S–2179–60Cray C Extensions [9] The following example illustrates hexadecimal constant representation: 0x7f7fffff.f 32-bit float 0x0123456789012345. 64-bit double The value of a hexadecimal floating constant is interpreted as a value in the specified floating type. This uses an unsigned integral type of the same size as the floating type, regardless of whether an object can be explicitly declared with such a type. No conversion or range checking is performed. The resulting floating value is defined in the same way as the result of accessing a member of floating type in a union after a value has been stored in a different member of integral type. The following example illustrates hexadecimal floating-point constant representation that use Cray floating-point format: int main(void) { float f1, f2; double g1, g2; f1 = 0x3ec00000.f; f2 = 0x3fc00000.f; g1 = 0x40fa400100000000.; g2 = 0x40fa400200000000.; printf("f1 = %8.8g\n", f1); printf("f2 = %8.8g\n", f2); printf("g1 = %16.16g\n", g1); printf("g2 = %16.16g\n", g2); return 1; } This is the output for the previous example: f1 = 0.375 f2 = 1.5 g1 = 107520.0625 g2 = 107520.125 S–2179–60 155Cray® C and C++ Reference Manual 156 S–2179–60Predefined Macros [10] Predefined macros can be divided into the following categories: • Macros required by the C and C++ standards (Section 10.1, page 158) • Macros based on the host machine (Section 10.2, page 159) • Macros based on the target machine (Section 10.3, page 160) • Macros based on the compiler (Section 10.4, page 161) • UPC macros (Section 10.5, page 162) Predefined macros provide information about the compilation environment. In this chapter, only those macros that begin with the underscore (_) character are defined when running in strict-conformance mode. Note: Any of the predefined macros except those required by the standard (see Section 10.1, page 158) can be undefined by using the -U command line option; they can also be redefined by using the -D command line option. A large set of macros is also defined in the standard header files. S–2179–60 157Cray® C and C++ Reference Manual 10.1 Macros Required by the C and C++ Standards The following macros are required by the C and C++ standards: Macro Description __TIME__ Time of translation of the source file. __DATE__ Date of translation of the source file. __LINE__ Line number of the current line in your source file. __FILE__ Name of the source file being compiled. __STDC__ Defined as the decimal constant 1 if compilation is in strict conformance mode; defined as the decimal constant 2 if the compilation is in extended mode. This macro is defined for Cray C and C++ compilations. __cplusplus Defined as 1 when compiling Cray C++ code and undefined when compiling Cray C code. The __cplusplus macro is required by the ISO C++ standard, but not the ISO C standard. 158 S–2179–60Predefined Macros [10] 10.2 Macros Based on the Host Machine The following macros provide information about the environment running on the host machine: Macro Description __unix Defined as 1 if the machine uses the UNIX OS. unix Defined as 1 if the machine uses the UNIX OS. This macro is not defined in strict-conformance mode. _UNICOSMP Defined as 1 if the operating system is UNICOS/mp. This macro is not defined in strict-conformance mode. __linux Defined as 1 on Cray X2 systems. __linux__ Defined as 1 on Cray X2 systems. linux Defined as 1 on Cray X2 systems. This macro is not defined in strict-conformance mode. __gnu_linux__ Defined as 1 on Cray X2 systems. S–2179–60 159Cray® C and C++ Reference Manual 10.3 Macros Based on the Target Machine The following macros provide information about the characteristics of the target machine: Macro Description _ADDR64 Defined as 1 if the targeted CPU has 64-bit address registers; if the targeted CPU does not have 64-bit address registers, the macro is not defined. __sv Defined as 1 on all Cray X1 series systems. __sv2 Defined as 1 and indicates that the current system is a Cray X1 series system. _CRAY Defined as 1 on Cray X1 series systems. Not defined on Cray X2 systems. _CRAYIEEE Defined as 1 if the targeted CPU type uses IEEE floating-point format. _CRAYSV2 Defined as 1 and indicates that the current system is a Cray X1 series system. __crayx1 Defined as 1 and indicates that the current system is a Cray X1 series system. __LITTLE_ENDIAN__ Defined as 1. Cray X2 systems use little endian byte ordering. __LITTLE_ENDIAN Defined as 1. Cray X2 systems use little endian byte ordering. 160 S–2179–60Predefined Macros [10] _MAXVL Defined as the maximum hardware vector length, which is 64 on Cray X1 series systems and 128 on Cray X2 systems. cray Defined as 1 on Cray X1 series systems. This macro is not defined in strict-conformance mode. Not defined on Cray X2 systems. __crayx2 Defined as 1 on Cray X2 systems. __craynv Defined as 1 on Cray X1 series and Cray X2 systems. CRAY Defined as 1 on Cray X1 series systems. This macro is not defined in strict-conformance mode. Not defined on Cray X2 systems. 10.4 Macros Based on the Compiler The following macros provide information about compiler features: Macro Description _RELEASE Defined as the major release level of the compiler. _RELEASE_MINOR Defined as the minor release level of the compiler. _RELEASE_STRING Defined as a string that describes the version of the compiler. _CRAYC Defined as 1 to identify the Cray C and C++ compilers on the Cray X1 series and Cray X2 systems. S–2179–60 161Cray® C and C++ Reference Manual 10.5 UPC Predefined Macros The following macros provide information about UPC functions: Macro Description __UPC__ The integer constant 1, indicating a conforming implementation. __UPC_DYNAMIC_THREADS__ The integer constant 1 in the dynamic THREADS translation environment. __UPC_STATIC_THREADS__ The integer constant 1 in the static THREADS translation environment. 162 S–2179–60Running C and C++ Applications [11] Cray X1 series and Cray X2 systems provide the following options for launching applications: • Launching a single non-MPI application • Launching a single MPI application • Launching multiple interrelated applications 11.1 Launching a Single Non-MPI Application Cray X1 series systems provide two methods of launching single, non-MPI applications. You can use the aprun command or the auto aprun method. The auto aprun method is not suppported on Cray X2 systems; use the aprun command to launch applications. To launch an application via aprun, you enter the name of the executable and any other desired command line options. For more information, see the aprun(1) man page. For example, if you want to compile and run programs prog1, prog2, and prog3 as application trio, you would enter the following command sequence: % CC -c prog1.C prog2.C prog3.C % CC -o trio prog1.o prog2.o prog3.o % aprun ./trio On a Cray X1 series system, you could use the auto aprun feature to perform the same functions: % CC -c prog1.C prog2.C prog3.C % CC -o trio prog1.o prog2.o prog3.o % ./trio The CRAY_AUTO_APRUN_OPTIONS environment variable for Cray X1 series system specifies options for the aprun command when the command is called automatically. For more information, see Section 2.25, page 67. S–2179–60 163Cray® C and C++ Reference Manual 11.2 Launching a Single MPI Application The process for launching a single MPI application is the same as for non-MPI applications except that you use the mpirun command instead of aprun. The aprun(1) man page also describes mpirun options. For example, if you want to compile and run programs mpiprog1, mpiprog2, and mpiprog3 as application mpitrio, you would enter the following command sequence: % CC -c mpiprog1.C mpiprog2.C mpiprog3.C % CC -o mpitrio mpiprog1.o mpiprog2.o mpiprog3.o % mpirun ./mpitrio 11.3 Multiple Program, Multiple Data (MPMD) Launch You can launch multiple interrelated applications with a single aprun or mpirun command. The applications must have the following characteristics: • The applications can use MPI, SHMEM, or CAF to perform application-to-application communications. Using UPC for application-to-application communication is not supported. • Within each application, the supported programming models are MPI, SHMEM, CAF, pthreads, and OpenMP. • All applications must be of the same mode; that is, they must all be MSP-mode applications or all SSP-mode applications. • If one or more of the applications in an MPMD job use a shared memory model (OpenMP or pthreads) and need a depth greater than the default of 1, then all of the applications will have the depth specified by the aprun or mpirun -d option, whether they need it or not. To launch multiple applications with one command, you use aprun or mpirun. For example, suppose you have created three MPI applications which contain CAF statements: % CC -o multiabc a.o b.o c.o % CC -o multijkl j.o k.o l.o % CC -o multixyz x.o y.o z.o and the number of processing elements required are 128 for multiabc, 16 for multijkl, and 4 for multixyz. 164 S–2179–60Running C and C++ Applications [11] To launch all three applications simultaneously, you would enter: % mpirun -np 128 multiabc : -np 16 multijkl : -np 4 multixyz S–2179–60 165Cray® C and C++ Reference Manual 166 S–2179–60Debugging Cray C and C++ Code [12] The TotalView symbolic debugger is available to help you debug C and C++ codes (see Etnus TotalView Users Guide). In addition, the Cray C and C++ compilers provide the following features to help you in debugging codes: • The -G and -g compiler options provide symbol information about your source code for use by the TotalView debugger. For more information about these compiler options, see Section 2.17.1, page 45. • The -h [no]bounds option and the #pragma _CRI [no]bounds directive let you check pointer and array references. The -h [no]bounds option is described in Section 2.17.2, page 46. The #pragma _CRI [no]bounds directive is described in Section 3.5.1, page 78. • The #pragma _CRI message directive lets you add warning messages to sections of code where you suspect problems. The #pragma _CRI message directive is described in Section 3.5.3, page 82. • The #pragma _CRI [no]opt directive lets you selectively isolate portions of your code to optimize, or to toggle optimization on and off in selected portions of your code. The #pragma _CRI [no]opt directive is described in Section 3.5.7, page 84. 12.1 TotalView Debugger Some of the functions available in the TotalView debugger allow you to perform the following actions: • Set and clear breakpoints, which can be conditional, at both the source code level and the assembly code level • Examine core files • Step through a program, including across function calls • Reattach to the executable file after editing and recompiling • Edit values of variables and memory locations • Evaluate code fragments S–2179–60 167Cray® C and C++ Reference Manual 12.2 Compiler Debugging Options To use the TotalView debugger in debugging your code, you must first compile your code using one of the debugging options (-g or -G). These options are specified as follows: • -Gf If you specify the -Gf debugging option, the TotalView debugger allows you to set breakpoints at function entry and exit and at labels. • -Gp If you specify the -Gp debugging option, the TotalView debugger allows you to set breakpoints at function entry and exit, labels, and at places where execution control flow changes (for example, loops, switch, and if...else statements). • -Gn or -g If you specify the -Gn or -g debugging option, the TotalView debugger allows you to set breakpoints at function entry and exit, labels, and executable statements. These options force all compiler optimizations to be disabled as if you had specified -O0. Users of the Cray C and C++ compilers do not have to sacrifice run time performance to debug codes. Many compiler optimizations are inhibited by breakpoints generated for debugging. By specifying a higher debugging level, fewer breakpoints are generated and better optimization occurs. However, consider the following cases in which optimization is affected by the -Gp and -Gf debugging options: • Vectorization can be inhibited if a label exists within the vectorizable loop. • Vectorization can be inhibited if the loop contains a nested block and the -Gp option is specified. • When the -Gp option is specified, setting a breakpoint at the first statement in a vectorized loop allows you to stop and display at each vector iteration. However, setting a breakpoint at the first statement in an unrolled loop may not allow you to stop at each vector iteration. 168 S–2179–60Interlanguage Communication [13] In some situations, it is necessary or advantageous to make calls to assembly or Fortran functions from C or C++ programs. This chapter describes how to make such calls. It also discusses calls to C and C++ functions from Fortran and assembly language. For additional information about interlanguage communication, see Interlanguage Programming Conventions. The calling sequence is described in detail on the callseq(3) man page. The C and C++ compilers provide a mechanism for declaring external functions that are written in other languages. This allows you to write portions of an application in C, C++, Fortran, or assembly language. This can be useful in cases where the other languages provide performance advantages or utilities that are not available in C or C++. This chapter describes how to call assembly language and Fortran programs from a C or C++ program. It also discusses the issues related to calling C or C++ programs from other languages. 13.1 Calls between C and C++ Functions The following requirements must be considered when making calls between functions written in C and C++: • In Cray C++, the extern "C" linkage is required when declaring an external function that is written in Cray C or when declaring a Cray C++ function that is to be called from Cray C. Normally the compiler will mangle function names to encode information about the function's prototype in the external name. This prevents direct access to these function names from a C function. The extern "C" keyword will prevent the compiler from performing name mangling. • The program must be linked using the CC command. • The program's main routine must be C or C++ code compiled with the CC command. S–2179–60 169Cray® C and C++ Reference Manual Objects can be shared between C and C++. There are some Cray C++ objects that are not accessible to Cray C functions (such as classes). The following object types can be shared directly: • Integral and floating types. • Structures and unions that are declared identically in C and C++. In order for structures and unions to be shared, they must be declared with identical members in the identical order. • Arrays and pointers to the above types. In the following example, a Cray C function (C_add_func) is called by the Cray C++ main program: #include extern "C" int C_add_func(int, int); int global_int = 123; main() { int res, i; cout << "Start C++ main" << endl; /* Call C function to add two integers and return result. */ cout << "Call C C_add_func" << endl; res = C_add_func(10, 20); cout << "Result of C_add_func = " << res << endl; cout << "End C++ main << endl; } 170 S–2179–60Interlanguage Communication [13] The Cray C function (C_add_func) is as follows: #include extern int global_int; int C_add_func(int p1, int p2) { printf("\tStart C function C_add_func.\n"); printf("\t\tp1 = %d\n", p1); printf("\t\tp2 = %d\n", p2); printf("\t\tglobal_int = %d\n", global_int); return p1 + p2; } The output from the execution of the calling sequence illustrated in the preceding example is as follows: Start C++ main Call C C_add_func Start C function C_add_func. p1 = 10 p2 = 20 global_int = 123 Result of C_add_func = 30 End C++ main 13.2 Calling Assembly Language Functions from a C or C++ Function You can sometimes avoid bottlenecks in programs by rewriting parts of the program in assembly language, maximizing performance by selecting instructions to reduce machine cycles. When writing assembly language functions that will be called by C or C++ functions, use the standard UNICOS/mp or UNICOS/lc program linkage macros. When using these macros, you do not need to know the specific registers used by the C or C++ program or by the calling sequence of the assembly coded routine. In Cray C++, use extern "C" to declare the assembly language function. The ALLOC, DEFA, DEFS, ENTER, EXIT, and MXCALLEN macros can be used to define the calling list, A and S register use, temporary storage, and entry and exit points. S–2179–60 171Cray® C and C++ Reference Manual 13.3 Calling Fortran Functions and Subroutines from a C or C++ Function This subsection describes the following aspects of calling Fortran from C or C++. Topics include requirements and guidelines, argument passing, array storage, logical and character data, accessing named common, and accessing blank common. 13.3.1 Requirements Keep the following points in mind when calling Fortran functions from C or C++: • Fortran uses the call-by-address convention. C and C++ use the call-by-value convention, which means that only pointers should be passed to Fortran subprograms. For more information, see Section 13.3.2, page 173. • Fortran arrays are in column-major order. C and C++ arrays are in row-major order. This indicates which dimension is indicated by the first value in an array element subscript. For more information, see Section 13.3.3, page 173. • Single-dimension arrays of signed 32-bit integers and single dimension arrays of 32-bit floating-point numbers are the only aggregates that can be passed as parameters without changing the arrays. • Fortran character pointers and character pointers from Cray C and C++ are incompatible. For more information, see Section 13.3.4, page 174. • Fortran logical values and the Boolean values from C and C++ are not fully compatible. For more information, see Section 13.3.4, page 174. • External C and C++ variables are stored in common blocks of the same name, making them readily accessible from Fortran programs if the C or C++ variable is in uppercase. • When declaring Fortran functions or objects in C or C++, the name must be specified in all uppercase letters, digits, or underscore characters and consist of 31 or fewer characters. • In Cray C, Fortran functions can be declared using the fortran keyword (see Section 9.2, page 154). The fortran keyword is not available in Cray C++. Instead, Fortran functions must be declared by specifying extern "C". 172 S–2179–60Interlanguage Communication [13] 13.3.2 Argument Passing Because Fortran subroutines expect arguments to be passed by pointers rather than by value, C and C++ functions called from Fortran subroutines must pass pointers rather than values. All argument passing in Cray C is strictly by value. To prepare for a function call between two Cray C functions, a copy is made of each actual argument. A function can change the values of its formal parameters, but these changes cannot affect the values of the actual arguments. It is possible, however, to pass a pointer. (All array arguments are passed by this method.) This capability is analogous to the Fortran method of passing arguments. In addition to passing by value, Cray C++ also provides passing by reference. 13.3.3 Array Storage C and C++ arrays are stored in memory in row-major order. Fortran arrays are stored in memory in column-major order. For example, the C or C++ array declaration int A[3][2] is stored in memory as: A[0][0] A[0][1] A[1][0] A[1][1] A[2][0] A[2][1] The previously defined array is viewed linearly in memory as: A[0][0] A[0][1] A[1][0] A[1][1] A[2][0] A[2][1] The Fortran array declaration INTEGER A(3,2) is stored in memory as: A(1,1) A(2,1) A(3,1) A(1,2) A(2,2) A(3,2) The previously defined array is viewed linearly in memory as: A(1,1) A(2,1) A(3,1) A(1,2) A(2,2) A(3,2) S–2179–60 173Cray® C and C++ Reference Manual When an array is shared between Cray C, C++, and Fortran, its dimensions are declared and referenced in C and C++ in the opposite order in which they are declared and referenced in Fortran. Arrays are zero-based in C and C++ and are one-based in Fortran, so in C and C++ you should subtract 1 from the array subscripts that you would normally use in Fortran. For example, using the Fortran declaration of array A in the preceding example, the equivalent declaration in C or C++ is: int a[2][3]; The following list shows how to access elements of the array from Fortran and from C or C++: Fortran C or C++ A(1,1) A[0][0] A(2,1) A[0][1] A(3,1) A[0][2] A(1,2) A[1][0] A(2,2) A[1][1] A(3,2) A[1][2] 13.3.4 Logical and Character Data Logical and character data need special treatment for calls between C or C++ and Fortran. Fortran has a character descriptor that is incompatible with a character pointer in C and C++. The techniques used to represent logical (Boolean) values also differ between Cray C, C++, and Fortran. Mechanisms you can use to convert one type to the other are provided by the fortran.h header file and conversion macros shown in the following list: Macro Description _btol Conversion utility that converts a 0 to a Fortran logical .FALSE. and a nonzero value to a Fortran logical .TRUE. 174 S–2179–60Interlanguage Communication [13] _ltob Conversion utility that converts a Fortran logical .FALSE. to a 0 and a Fortran logical .TRUE. to a 1. 13.3.5 Accessing Named Common from C and C++ The following example demonstrates how external C and C++ variables are accessible in Fortran named common blocks. It shows a C or C++ C function calling a Fortran subprogram, the associated Fortran subprogram, and the associated input and output. In this example, the C or C++ structure ST is accessed in the Fortran subprogram as common block ST. The name of the structure and the Fortran common block must match. Note that this requires that the structure name be uppercase. The C and C++ C structure member names and the Fortran common block member names do not have to match, as is shown in this example. The following Cray C main program calls the Fortran subprogram FCTN: #include struct { int i; double a[10]; long double d; } ST; main() { int i; /* initialize struct ST */ ST.i = 12345; for (i = 0; i < 10; i++) ST.a[i] = i; ST.d = 1234567890.1234567890L; /* print out the members of struct ST */ printf("In C: ST.i = %d, ST.d = %20.10Lf\n", ST.i, ST.d); printf("In C: ST.a = "); for (i = 0; i < 10; i++) S–2179–60 175Cray® C and C++ Reference Manual printf("%4.1f", ST.a[i]); printf("\n\n"); /* call the fortran function */ FCTN(); } The following example is the Fortran subprogram FCTN called by the previous Cray C main program: C *********** Fortran subprogram (f.f): *********** SUBROUTINE FCTN COMMON /ST/STI, STA(10), STD INTEGER STI REAL STA DOUBLE PRECISION STD INTEGER I WRITE(6,100) STI, STD 100 FORMAT ('IN FORTRAN: STI = ', I5, ', STD = ', D25.20) WRITE(6,200) (STA(I), I = 1,10) 200 FORMAT ('IN FORTRAN: STA =', 10F4.1) END The previous Cray C and Fortran examples are executed by the following commands, and they produce the output shown: % cc -c c.c % ftn -c f.f % ftn c.o f.o % ./a.out ST.i = 12345, ST.d = 1234567890.1234567890 In C: ST.a = 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 IN FORTRAN: STI = 12345, STD = .12345678901234567889D+10 IN FORTRAN: STA = 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 176 S–2179–60Interlanguage Communication [13] 13.3.6 Accessing Blank Common from C or C++ Fortran includes the concept of a common block. A common block is an area of memory that can be referenced by any program unit in a program. A named common block has a name specified in names of variables or arrays stored in the block. A blank common block, sometimes referred to as blank common, is declared in the same way, but without a name. There is no way to access blank common from C or C++ similar to accessing a named common block. However, you can write a simple Fortran function to return the address of the first word in blank common to the C or C++ program and then use that as a pointer value to access blank common. The following example shows how Fortran blank common can be accessed using C or C++ source code: #include struct st { float a; float b[10]; } *ST; #ifdef __cplusplus extern "C" struct st *MYCOMMON(void); extern "C" void FCTN(void); #else fortran struct st *MYCOMMON(void); fortran void FCTN(void); #endif main() { int i; ST = MYCOMMON(); ST->a = 1.0; for (i = 0; i < 10; i++) ST->b[i] = i+2; printf("\n In C and C++\n"); printf(" a = %5.1f\n", ST->a); printf(" b = "); for (i = 0; i < 10; i++) S–2179–60 177Cray® C and C++ Reference Manual printf("%5.1f ", ST->b[i]); printf("\n\n"); FCTN(); } This Fortran source code accesses blank common and is accessed from the C or C++ source code in the preceding example: SUBROUTINE FCTN COMMON // STA,STB(10) PRINT *, "IN FORTRAN" PRINT *, " STA = ",STA PRINT *, " STB = ",STB STOP END FUNCTION MYCOMMON() COMMON // A MYCOMMON = LOC(A) RETURN END This is the output of the previous C or C++ source code: a = 1.0 b = 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 This is the output of the previous Fortran source code: STA = 1. STB = 2., 3., 4., 5., 6., 7., 8., 9., 10., 11. 178 S–2179–60Interlanguage Communication [13] 13.3.7 Cray C and Fortran Example Here is an example of a Cray C function that calls a Fortran subprogram. The Fortran subprogram example follows the Cray C function example, and the input and output from this sequence follows the Fortran subprogram example. Note: This example assumes that the Cray Fortran function is compiled with the -s default32 option enabled. The examples will not work if the -s default64 option is enabled. /* C program (main.c): */ #include #include #include /* Declare prototype of the Fortran function. Note the last */ /* argument passes the length of the first argument. */ fortran double FTNFCTN (char *, int *, int); double FLOAT1 = 1.6; double FLOAT2; /* Initialized in FTNFCTN */ main() { int clogical, ftnlogical, cstringlen; double rtnval; char *cstring = "C Character String"; /* Convert clogical to its Fortran equivalent */ clogical = 1; ftnlogical = _btol(clogical); /* Print values of variables before call to Fortran function */ printf(" In main: FLOAT1 = %g; FLOAT2 = %g\n", FLOAT1, FLOAT2); printf(" Calling FTNFCTN with arguments:\n"); printf(" string = \"%s\"; logical = %d\n\n", cstring, clogical); cstringlen = strlen(cstring); rtnval = FTNFCTN(cstring, &ftnlogical, cstringlen); /* Convert ftnlogical to its C equivalent */ clogical = _ltob(&ftnlogical); S–2179–60 179Cray® C and C++ Reference Manual /* Print values of variables after call to Fortran function */ printf(" Back in main: FTNFCTN returned %g\n", rtnval); printf(" and changed the two arguments:\n"); printf(" string = \"%.*s\"; logical = %d\n", cstringlen, cstring, clogical); } C Fortran subprogram (ftnfctn.f): FUNCTION FTNFCTN(STR, LOG) REAL FTNFCTN CHARACTER*(*) STR LOGICAL LOG COMMON /FLOAT1/FLOAT1 COMMON /FLOAT2/FLOAT2 REAL FLOAT1, FLOAT2 DATA FLOAT2/2.4/ ! FLOAT1 INITIALIZED IN MAIN C PRINT CURRENT STATE OF VARIABLES PRINT*, ' IN FTNFCTN: FLOAT1 = ', FLOAT1, 1 ';FLOAT2 = ', FLOAT2 PRINT*, ' ARGUMENTS: STR = "', STR, '"; LOG = ', LOG C CHANGE THE VALUES FOR STR(ING) AND LOG(ICAL) STR = 'New Fortran String' LOG = .FALSE. FTNFCTN = 123.4 PRINT*, ' RETURNING FROM FTNFCTN WITH ', FTNFCTN PRINT* RETURN END 180 S–2179–60Interlanguage Communication [13] The previous Cray C function and Fortran subprogram are executed by the following commands and produce the following output: % cc -c main.c % ftn -c ftnfctn.f % ftn main.o ftnfctn.o % ./a.out In main: FLOAT1 = 1.6; FLOAT2 = 2.4 Calling FTNFCTN with arguments: string = "C Character String"; logical = 1 IN FTNFCTN: FLOAT1 = 1.6; FLOAT2 = 2.4 ARGUMENTS: STR = "C Character String"; LOG = T RETURNING FROM FTNFCTN WITH 123.4 Back in main: FTNFCTN returned 123.4 and changed the two arguments: string = "New Fortran String"; logical = 0 13.3.8 Calling a Fortran Program from a Cray C++ Program The following example illustrates how a Fortran program can be called from a Cray C++ program: #include extern "C" int FORTRAN_ADD_INTS(int *arg1, int &arg2); main() { int num1, num2, res; cout << "Start C++ main" << endl << endl; //Call FORTRAN function to add two integers and return result. //Note that the second argument is a reference parameter so //it is not necessary to take the address of the //variable num2. num1 = 10; num2 = 20; cout << "Before Call to FORTRAN_ADD_INTS" << endl; res = FORTRAN_ADD_INTS(&num1, num2); cout << "Result of FORTRAN Add = " << res << endl << endl; cout << "End C++ main" << endl; } S–2179–60 181Cray® C and C++ Reference Manual The Fortran program that is called from the Cray C++ main function in the preceding example is as follows: INTEGER FUNCTION FORTRAN_ADD_INTS(Arg1, Arg2) INTEGER Arg1, Arg2 PRINT *," FORTRAN_ADD_INTS, Arg1,Arg2 = ", Arg1, Arg2 FORTRAN_ADD_INTS = Arg1 + Arg2 END The output from the execution of the preceding example is as follows: Start C++ main Before Call to FORTRAN_ADD_INTS FORTRAN_ADD_INTS, Arg1,Arg2 = 10, 20 Result of FORTRAN Add = 30 End C++ main 13.4 Calling a C or C++ Function from a Fortran Program A C or C++ function can be called from a Fortran program. One of two methods can be used to call C functions from Fortran: the C interoperability feature provided by the Fortran 2000 facility or the method documented in this section. C interoperability provides a standard portable interoperability mechanism for Fortran and C programs. For more information about C interoperability, see the Cray Fortran Reference Manual. If you are using the method documented in this section to call C functions from Fortran, keep in mind the information in Section 13.3, page 172. When calling a Cray C++ function from a Fortran program, observe the following rules: • The Cray C++ function must be declared with extern "C" linkage. • The program must be linked with the CC(1) command. • The program's main routine must be C or C++ code compiled with the CC command. 182 S–2179–60Interlanguage Communication [13] The example that follows illustrates a Fortran program, main.f, that calls a Cray C function, ctctn.c. The Cray C function being called, the commands required, and the associated input and output are also included. Note: This example assumes that the Cray Fortran program is compiled with the -s default32 option enabled. The examples will not work if the -s default64 option is enabled. Example 17: Calling a C Function from a Fortran Program Fortran program main.f source code: C Fortran program (main.f): PROGRAM MAIN REAL CFCTN COMMON /FLOAT1/FLOAT1 COMMON /FLOAT2/FLOAT2 REAL FLOAT1, FLOAT2 DATA FLOAT1/1.6/ ! FLOAT2 INITIALIZED IN cfctn.c LOGICAL LOG CHARACTER*24 STR REAL RTNVAL C INITIALIZE VARIABLES STR(ING) AND LOG(ICAL) STR = 'Fortran Character String' LOG = .TRUE. C PRINT VALUES OF VARIABLES BEFORE CALL TO C FUNCTION PRINT*, 'In main.f: FLOAT1 = ', FLOAT1, 1 '; FLOAT2 = ', FLOAT2 PRINT*, 'Calling cfctn.c with these arguments: ' PRINT*, 'LOG = ', LOG PRINT*, 'STR = ', STR RTNVAL = CFCTN(STR, LOG) C PRINT VALUES OF VARIABLES AFTER CALL TO C FUNCTION PRINT*, 'Back in main.f:: cfctn.c returned ', RTNVAL PRINT*, 'and changed the two arguments to: ' S–2179–60 183Cray® C and C++ Reference Manual PRINT*, 'LOG = ', LOG PRINT*, 'STR = ', STR END PROGRAM Compile main.f, creating main.o: % ftn -c main.f C function cfctn.c source code: /* C function (cfctn.c) */ #include #include #include #include float FLOAT1; /* Initialized in MAIN */ float FLOAT2 = 2.4; /* The slen argument passes the length of string in str */ float cfctn_(char * str, int *log, int slen) { int clog; float rtnval; char *cstring; /* Convert log passed from Fortran MAIN */ /* into its C equivalent */ cstring = malloc(slen+1); strncpy(cstring, str, slen); cstring[slen] = '\0'; clog = _ltob(log); /* Print the current state of the variables */ printf(" In CFCTN: FLOAT1 = %.1f; FLOAT2 = %.1f\n", FLOAT1, FLOAT2); printf(" Arguments: str = '%s'; log = %d\n", cstring, clog); /* Change the values for str and log */ strncpy(str, "C Character String ", 24); *log = 0; 184 S–2179–60Interlanguage Communication [13] rtnval = 123.4; printf(" Returning from CFCTN with %.1f\n\n", rtnval); return(rtnval); } Compile cfctn.c, creating cfctn.o: % cc -c cfctn.c Link main.o and cfctn.o, creating executable interlang1: % ftn -o interlang1 main.o cfctn.o Run program interlang1: % ./interlang1 Program output: In main.f: FLOAT1 = 1.60000002 ; FLOAT2 = 2.4000001 Calling cfctn.c with these arguments: LOG = T STR = Fortran Character String In CFCTN: FLOAT1 = 1.6; FLOAT2 = 2.4 Arguments: str = 'Fortran Character String'; log = 1 Returning from CFCTN with 123.4 Back in main.f:: cfctn.c returned 123.400002 and changed the two arguments to: LOG = F STR = C Character String S–2179–60 185Cray® C and C++ Reference Manual 186 S–2179–60Implementation-defined Behavior [14] This chapter describes compiler behavior that is defined by the implementation according to the C and/or C++ standards. The standards require that the behavior of each particular implementation be documented. The C and C++ standards define implementation-defined behavior as behavior, for a correct program construct and correct data, that depends on the characteristics of the implementation. The behavior of the Cray C and C++ compilers for these cases is summarized in this chapter. 14.1 Messages All diagnostic messages issued by the compilers are reported through the UNICOS/mp or UNICOS/lc message system. For information about messages issued by the compilers and for information about the UNICOS/mp or UNICOS/lc message system, see Appendix E, page 223. 14.2 Environment When argc and argv are used as parameters to the main function, the array members argv[0] through argv[argc-1] contain pointers to strings that are set by the command shell. The shell sets these arguments to the list of words on the command line used to invoke the compiler (the argument list). For further information about how the words in the argument list are formed, refer to the documentation on the shell in which you are running. For information about UNICOS/mp or UNICOS/lc shells, see the sh(1) or csh(1) man page. A third parameter, char **envp, provides access to environment variables. The value of the parameter is a pointer to the first element of an array of null-terminated strings that matches the output of the env(1) command. The array of pointers is terminated by a null pointer. The compiler does not distinguish between interactive devices and other, noninteractive devices. The library, however, may determine that stdin, stdout, and stderr (cin, cout, and cerr in Cray C++) refer to interactive devices and buffer them accordingly. S–2179–60 187Cray® C and C++ Reference Manual 14.2.1 Identifiers The identifier (as defined by the standards) is merely a sequence of letters and digits. Specific uses of identifiers are called names. The Cray C compiler treats the first 255 characters of a name as significant, regardless of whether it is an internal or external name. The case of names, including external names, is significant. In Cray C++, all characters of a name are significant. 14.2.2 Types Table 13 summarizes Cray C and C++ types and the characteristics of each type. Representation is the number of bits used to represent an object of that type. Memory is the number of storage bits that an object of that type occupies. In the Cray C and C++ compilers, size, in the context of the sizeof operator, refers to the size allocated to store the operand in memory; it does not refer to representation, as specified in Table 13. Thus, the sizeof operator will return a size that is equal to the value in the Memory column of Table 13 divided by 8 (the number of bits in a byte). Table 13. Data Type Mapping Type Representation Size and Memory Storage Size (bits) bool (C++) 8 _Bool (C) 8 char 8 wchar_t 32 short 16 int 32 long 64 long long 64 float 32 double 64 long double 128 188 S–2179–60Implementation-defined Behavior [14] Type Representation Size and Memory Storage Size (bits) float complex 64 (each part is 32 bits) double complex 128 (each part is 64 bits) long double complex 256 (each part is 128 bits) Pointers 64 On Cray X1 series systems, variables with 8-bit char or 16-bit short data types are fully vectorizable when used in one of the following operations within a vector context: • Reads of 8-bit chars and 16-bit shorts • Writes to 8-bit chars and 16-bit shorts, except arrays • Use of 8- and 16-bit variables as targets in a reduction loop. For example, c is a 16-bit object in this program fragment: int i; short c; int a[100]; c=0; for (i=0;i<100;i++) { c = c + a[i]; } On Cray X2 systems, variables with 8-bit char and 16-bit short data types are fully vectorizable but are less efficient than 32- or 64-bit data types. Cray discourages the use of 8-bit chars and 16-bit shorts in contexts other than those listed above because of performance penalties. 14.2.3 Characters The full 8-bit ASCII code set can be used in source files. Characters not in the character set defined in the standard are permitted only within character constants, string literals, and comments. The -h [no]calchars option allows the use of the @ (Cray X1 series only) sign and the $ sign in identifier names. For more information about the -h [no]calchars option, see Section 2.9.3, page 24. S–2179–60 189Cray® C and C++ Reference Manual A character consists of 8 bits. Up to 8 characters can be packed into a 64-bit word. A plain char type (that is, one that is declared without a signed or unsigned keyword) is treated as an unsigned type on Cray X1 series systems and as a signed type on Cray X2 systems. Character constants and string literals can contain any characters defined in the 8-bit ASCII code set. The characters are represented in their full 8-bit form. A character constant can contain up to 8 characters. The integer value of a character constant is the value of the characters packed into a word from left to right, with the result right-justified, as shown in the following table: Table 14. Packed Characters Character constant Integer value 'a' 0x61 'ab' 0x6162 In a character constant or string literal, if an escape sequence is not recognized, the \ character that initiates the escape sequence is ignored, as shown in the following table: Table 15. Unrecognizable Escape Sequences Character constant Integer value Explanation '\a' 0x7 Recognized as the ASCII BEL character '\8' 0x38 Not recognized; ASCII value for 8 '\[' 0x5b Not recognized; ASCII value for [ '\c' 0x63 Not recognized; ASCII value for c 190 S–2179–60Implementation-defined Behavior [14] 14.2.4 Wide Characters Wide characters are treated as signed 64-bit integer types. Wide character constants cannot contain more than one multibyte character. Multibyte characters in wide character constants and wide string literals are converted to wide characters in the compiler by calling the mbtowc(3) function. The current locale in effect at the time of compilation determines the method by which mbtowc(3) converts multibyte characters to wide characters, and the shift states required for the encoding of multibyte characters in the source code. If a wide character, as converted from a multibyte character or as specified by an escape sequence, cannot be represented in the extended execution character set, it is truncated. 14.2.5 Integers All integral values are represented in a twos complement format. For representation and memory storage requirements for integral types, see Table 13, page 188. When an integer is converted to a shorter signed integer, and the value cannot be represented, the result is the truncated representation treated as a signed quantity. When an unsigned integer is converted to a signed integer of equal length, and the value cannot be represented, the result is the original representation treated as a signed quantity. The bitwise operators (unary operator ~ and binary operators <<, >>, &, ^, and |) operate on signed integers in the same manner in which they operate on unsigned integers. The result of E1 >> E2, where E1 is a negative-valued signed integral value, is E1 right-shifted E2 bit positions; vacated bits are filled with 1s. This behavior can be modified by using the -h nosignedshifts option (see Section 2.9.4, page 25). Bits higher than the sixth bit are not ignored. Values higher than 31 cause the result to be 0 or all 1s for right shifts. The result of the / operator is the largest integer less than or equal to the algebraic quotient when either operand is negative and the result is a nonnegative value. If the result is a negative value, it is the smallest integer greater than or equal to the algebraic quotient. The / operator behaves the same way in C and C++ as in Fortran. The sign of the result of the percent (%) operator is the sign of the first operand. Integer overflow is ignored. Because some integer arithmetic uses the floating-point instructions, floating-point overflow can occur during integer operations. Division by 0 and all floating-point exceptions, if not detected as an error by the compiler, can cause a run time abort. S–2179–60 191Cray® C and C++ Reference Manual 14.2.6 Arrays and Pointers An unsigned int value can hold the maximum size of an array. The type size_t is defined to be a typedef name for unsigned long in the headers: malloc.h, stddef.h, stdio.h, stdlib.h, string.h, and time.h. If more than one of these headers is included, only the first defines size_t. A type int can hold the difference between two pointers to elements of the same array. The type ptrdiff_t is defined to be a typedef name for long in the header stddef.h. If a pointer type's value is cast to a signed or unsigned long int, and then cast back to the original type's value, the two pointer values will compare equal. Pointers on UNICOS/mp and UNICOS/lc systems are byte pointers. Byte pointers use the same internal representation as integers; a byte pointer counts the numbers of bytes from the first address. A pointer can be explicitly converted to any integral type large enough to hold it. The result will have the same bit pattern as the original pointer. Similarly, any value of integral type can be explicitly converted to a pointer. The resulting pointer will have the same bit pattern as the original integral type. 14.2.7 Registers Use of the register storage class in the declaration of an object has no effect on whether the object is placed in a register. The compiler performs register assignment aggressively; that is, it automatically attempts to place as many variables as possible into registers. 14.2.8 Classes, Structures, Unions, Enumerations, and Bit Fields Accessing a member of a union by using a member of a different type results in an attempt to interpret, without conversion, the representation of the value of the member as the representation of a value in the different type. 192 S–2179–60Implementation-defined Behavior [14] Members of a class or structure are packed into words from left to right. Padding is appended to a member to correctly align the following member, if necessary. Member alignment is based on the size of the member: • For a member bit field of any size, alignment is any bit position that allows the member to fit entirely within a 64–bit word. • For a member with a size less than 64 bits, alignment is the same as the size. For example, a char has a size and alignment of 8 bits; a float has a size and alignment of 32 bits. • For a member with a size equal to or greater than 64 bits, alignment is 64 bits. • For a member with array type, alignment is equal to the alignment of the element type. A plain int type bit field is treated as an signed int bit field. The values of an enumeration type are represented in the type signed int in C; they are a separate type in C++. 14.2.9 Qualifiers When an object that has volatile-qualified type is accessed, it is simply a reference to the value of the object. If the value is not used, the reference need not result in a load of the value from memory. 14.2.10 Declarators A maximum of 12 pointer, array, and/or function declarators are allowed to modify an arithmetic, structure, or union type. 14.2.11 Statements The compiler has no fixed limit on the maximum number of case values allowed in a switch statement. The Cray C++ compiler parses asm statements for correct syntax, but otherwise ignores them. S–2179–60 193Cray® C and C++ Reference Manual 14.2.12 Exceptions In Cray C++, when an exception is thrown, the memory for the temporary copy of the exception being thrown is allocated on the stack and a pointer to the allocated space is returned. 14.2.13 System Function Calls For a description of the form of the unsuccessful termination status that is returned from a call to exit(3), see the exit(3) man page. 14.3 Preprocessing The value of a single-character constant in a constant expression that controls conditional inclusion matches the value of the same character in the execution character set. No such character constant has a negative value. For each, 'a' has the same value in the two contexts: #if 'a' == 97 if ('a' == 97) The -I option and the method for locating included source files is described in Section 2.20.4, page 54. The source file character sequence in a #include directive must be a valid UNICOS/mp or UNICOS/lc file name or path name. A #include directive may specify a file name by means of a macro, provided the macro expands into a source file character sequence delimited by double quotes or < and > delimiters, as follows: #define myheader "./myheader.h" #include myheader #define STDIO #include STDIO The macros __DATE__ and __TIME__ contain the date and time of the beginning of translation. For more information, refer to the description of the predefined macros in Chapter 10, page 157. The #pragma directives are described in Chapter 3, page 75. 194 S–2179–60Possible Requirements for Non-C99 Code [A] In order to use C code, developed under previous C compilers of the Cray C++ Programming Environment, with the c99 command, your code may require one or more of the following modifications: • Include necessary header files for complete function prototyping. • Add return statements to all non-void functions. • Ensure that all strings in any macro that begins with an underscore are literals. These macros cannot contain other types of strings. • Follow C99 conventions Previous Cray C compilers did not require you to explicitly include header files in many situations because they allowed functions to be implicitly declared. In C99, functions cannot be implicitly declared. S–2179–60 195Cray® C and C++ Reference Manual 196 S–2179–60Libraries and Loader [B] This appendix describes the libraries that are available with the Cray C and C++ compilers and the loader (ld). B.1 Cray C and C++ Libraries Current Programming Environments Libraries that support Cray C and C++ are automatically available when you use the CC, cc, c89, or c99 command to compile your programs. These commands automatically issue the appropriate directives to load the program with the appropriate functions. If your program strictly conforms to the C or C++ standards, you do not need to know library names and locations. If your program requires other libraries or if you want direct control over the loading process, more knowledge of the loader and libraries is necessary. The Standard Template Library (STL) is a C++ library of container classes, algorithms, and iterators; it provides many of the basic algorithms and data structures of computer science. The STL is a generic library, meaning that its components are heavily parameterized: almost every component in the STL is a template. Be sure you have a complete understanding of templates and how they work before using them. B.2 Loader When you use the cc, CC, c89, or c99 command to invoke the compiler, and the program compiles without errors, the loader is called. Specifying the -c option on the command line produces relocatable object files (*.o) without calling the loader. These relocatable object files can then be used as input to the loader command by specifying the file names on the appropriate loader command line. For example, the following command line compiles a file called target.c and produces the relocatable object file called target.o in your current working directory: cc -c target.c You can then use file target.o as input to the loader or save the file to use with other relocatable object files to compile and create a linked executable file (a.out by default). S–2179–60 197Cray® C and C++ Reference Manual Because of the special code needed to handle templates, constructors, destructors, and other C++ language features, object files generated by using the CC command should be linked using the CC command. To link C++ object files using the loader command (ld), the -h keep=files option (see Section 2.9.1, page 23) must be specified on the command line when compiling source files. Note: On Cray X2 systems, use the compiler command, not ld, to link files. The ld command can be accessed by using one of the following methods: • You can access the loader directly by using the ld command. • You can let the cc, CC, c89, or c99 command choose the loader. This method has the following advantages: – You do not need to know the loader command line interface. – You do not need to worry about the details of which libraries to load, or the order in which to load them. – When using CC, you need not worry about template instantiation requirements or about loading the compiler-generated static constructors and destructors. You can control the operation of the loader with the ld command line options. For more information, see the ld(1) man page. 198 S–2179–60Compatibility with Older C++ Code [C] The Standard C++ Library. C++ code developed under the C++ Programming Environment 4.2 release or earlier can still be used with Programming Environment release 6.0. If your code uses nonstandard Cray C++ header files, you can continue to use your code without modification by using the CRAYOLDCPPLIB environment variable. Another way to use your pre-4.x code with the current Programming Environment release is to make changes to your existing code. The following sections explain how to use either of these methods. Note: Other changes to your existing C++ code may be required because of differences between the Cray SV1 or Cray T3E systems and the Cray X1 series systems. For more information, see the Cray X1 User Environment Differences manual. C.1 Use of Nonstandard Cray C++ Header Files The Cray C++ Programming Environment release continues to support some of the nonstandard Cray C++ header files. This allows pre-5.0 code that use these header files to be compiled without modification. These header files are available in the Standard C++ Library at the same location as they were in previous releases. Here are the Cray nonstandard header files that can be used in Programming Environment 5.6: • common.h • complex.h • fstream.h • generic.h • iomanip.h • iostream.h • stdiostream.h • stream.h • strstream.h • vector.h S–2179–60 199Cray® C and C++ Reference Manual The nonstandard header files can be used when you set the CRAYOLDCPPLIB environment variable to a nonzero value. How to set the variable depends on the shell you are using. If you are using ksh or sh, set the variable as this example shows: % export CRAYOLDCPPLIB=1 If you are using csh, set the variable as this example shows: % setenv CRAYOLDCPPLIB 1 C.2 When to Update Your C++ Code You are not required to modify your existing C++ codes in order to compile it with the Cray C++ compiler version 6.0, unless you wish to use the Standard C++ Library. One reason for migrating your code to the Standard C++ Library is that the nonstandard Cray C++ header files of Programming Environment 3.5 may not be supported by future versions of the Cray C++ compiler. Another reason for migrating is your C++ code may already contain support for the Standard C++ Library. Often, third-party code contains a configuration script that tests the features of the compiler and system before building a makefile. This script can determine whether the C++ compiler supports the Standard C++ Library. You can use the following steps to migrate your C++ code: 1. Use the proper header files 2. Add namespace declarations 3. Reconcile header definition differences 4. Recompile all C++ files C.2.1 Use the Proper Header Files The first step in migrating your C++ code to use the Standard C++ Library is to ensure that it uses the correct Standard C++ Library header files. The following tables show each header file used by the C++ library version 3.5 and its likely corresponding header file in the current Standard C++ Library. The older header files do not always map directly to the new files. For example, most of the definitions of the Cray C++ version 3.5 STL alloc.h header file are contained in the Standard C++ Library header files memory and xmemory. Anomalies, such as this are noted in the tables. 200 S–2179–60Compatibility with Older C++ Code [C] The tables divide the header files into three groups: • Run time support library header files • Stream and class library header files • Standard Template Library header files The older header file used by the run time support library originated from Edison Design Group and perform functions such as exception handling and memory allocation and deallocation. Table 16 shows the old and new header files. Table 16. Run time Support Library Header Files Cray C++ 3.5 header file Standard C++ library header file exception.h exception new.h new stdexcept.h stdexcept typeinfo.h typeinfo The header files in the stream and class library originate from AT&T and define the I/O stream classes along the string, complex, and vector classes. Table 17 shows the old and new header files. Table 17. Stream and Class Library Header Files Cray C++ 3.5 header file Standard C++ Library header file common.h No equivalent header file complex.h complex fstream.h fstream iomanip.h iomanip iostream.h iostream stdiostream.h iosfwd stream.h Not available strstream.h strstream vector.h vector S–2179–60 201Cray® C and C++ Reference Manual Note: The use of any of the stream and class library header files from Cray C++ Programming Environment 3.5 requires that you set the CRAYOLDCPPLIB environment variable. For more information, see Section C.1, page 199. Table 18 shows the old and new Standard Template Library (STL) header files. Note: The older STL originated from Silicon Graphics Inc. Table 18. Standard Template Library Header Files Cray C++ 3.5 header file Standard C++ header file algo.h algorithm algobase.h algorightm alloc.h memory bvector.h vector defalloc.h 1 Not available deque.h deque function.h functional hash_map.h hash_map hash_set.h hash_set hashtable.h xhash heap.h algorithm iterator.h iterator list.h list map.h map mstring.h string multimap.h map multiset.h set pair.h pair pthread_alloc.h No equivalent header file rope.h rope ropeimpl.h rope 1 This header file was deprecated in the Cray C++ Programming Environment 3.5 release. 202 S–2179–60Compatibility with Older C++ Code [C] Cray C++ 3.5 header file Standard C++ header file set.h set slist.h slist stack.h stack stl_config.h The Standard C++ Library does not need the STL configuration file. tempbuf.h memory tree.h xtree vector.h vector C.2.2 Add Namespace Declarations The second step in migrating to the Standard C++ Library is adding namespace declarations. Most classes of the Standard C++ Library are declared under the std namespace, so this usually requires that you add this statement to the existing code: using namespace std. For example, the following program returns an error when it is compiled with previous versions of the Standard C++ Library: % cat hello.C #include main() { cout << "hello world\n"; } % CC hello.C CC-20 CC: ERROR File = hello.C, line = 2 The identifier "cout" is undefined. main() { cout <<"hello world\n" ; } ^ Total errors detected in hello.C: 1 S–2179–60 203Cray® C and C++ Reference Manual When you add using namespace std; to the example program, it compiles without error: % cat hello.C #include using namespace std; main() { cout << "hello world\n"; } % CC hello.C % ./a.out hello world C.2.3 Reconcile Header Definition Differences The most difficult process of migrating to the Standard C++ Library is reconciling the differences between the definitions of the Cray C++ version 3.5 header files and the Standard Cray C++ library header files. For example, the definitions for the complex class differs. In Cray C++ version 3.5, the complex class has real and imaginary components of type double. The Standard C++ Library defines the complex class as a template class, where the user defines the data type of the real and imaginary components. For example, here is a program written with the Cray C++ version 3.5 header files: % cat complex.C #include #include main() { complex C(1.0, 2.0); cout << "C = " << C << endl; } % env CRAYOLDCPPLIB=1 CC complex.C % ./a.out C = ( 1, 2) 204 S–2179–60Compatibility with Older C++ Code [C] An equivalent program that uses the Standard C++ Library appears as: % cat complex.C #include #include using namespace std; main() { complex C(1.0, 2.0); cout << "C = " << C << endl; } % CC complex.C % ./a.out C = (1,2) C.2.4 Recompile All C++ Files Finally, when all of the source files that use the Standard C++ Library header files can be built, you must recompile all C++ source files that belong to the program using only the Standard C++ Library. S–2179–60 205Cray® C and C++ Reference Manual 206 S–2179–60Cray C and C++ Dialects [D] This appendix details the features of the C and C++ languages that are accepted by the Cray C and C++ compilers, including certain language dialects and anachronisms. Users should be aware of these details, especially users who are porting codes from other environments. D.1 C++ Language Conformance The Cray C++ compiler accepts the C++ language as defined by the ISO/IEC 14882:1998 standard, with the exceptions listed in Section D.1.1, page 207. The Cray C++ compiler also has a cfront compatibility mode, which duplicates a number of features and bugs of cfront. Complete compatibility is not guaranteed or intended. The mode allows programmers who have used cfront features to continue to compile their existing code (see Section 3.5, page 78). Command line options are also available to enable and disable anachronisms (see Section D.2, page 211) and strict standard-conformance checking (see Section D.3, page 212, and Section D.4, page 213). The command line options are described in Chapter 2, page 7. D.1.1 Unsupported and Supported C++ Language Features The export keyword for templates is not supported. It is defined in the ISO/IEC 14882:1998 standard, but is not in traditional C++. The following features, which are in the ISO/IEC 14882:1998 standard but not in traditional C++1 , are supported: • The dependent statement of an if, while, do-while, or for is considered to be a scope, and the restriction on having such a dependent statement be a declaration is removed. • The expression tested in an if, while, do-while, or for, as the first operand of a ? operator, or as an operand of the &&, ||, or ! operators may have a pointer-to-member type or a class type that can be converted to a pointer-to-member type in addition to the scalar cases permitted by the ARM. • Qualified names are allowed in elaborated type specifiers. 1 As defined in The Annotated C++ Reference Manual (ARM), by Ellis and Stroustrup, Addison Wesley, 1990. S–2179–60 207Cray® C and C++ Reference Manual • A global-scope qualifier is allowed in member references of the form x.::A::B and p->::A::B. • The precedence of the third operand of the ? operator is changed. • If control reaches the end of the main() routine, and the main() routine has an integral return type, it is treated as if a return 0; statement was executed. • Pointers to arrays with unknown bounds as parameter types are diagnosed as errors. • A functional-notation cast of the form A() can be used even if A is a class without a (nontrivial) constructor. The temporary that is created gets the same default initialization to zero as a static object of the class type. • A cast can be used to select one out of a set of overloaded functions when taking the address of a function. • Template friend declarations and definitions are permitted in class definitions and class template definitions. • Type template parameters are permitted to have default arguments. • Function templates may have nontype template parameters. • A reference to const volatile cannot be bound to an rvalue. • Qualification conversions such as conversion from T** to T const * const are allowed. • Digraphs are recognized. • Operator keywords (for example, and or bitand) are recognized. • Static data member declarations can be used to declare member constants. • wchar_t is recognized as a keyword and a distinct type. • bool is recognized. • RTTI (run time type identification), including dynamic_cast and the typeid operator, is implemented. • Declarations in tested conditions (within if, switch, for, and while statements) are supported. • Array new and delete are implemented. 208 S–2179–60Cray C and C++ Dialects [D] • New-style casts (static_cast, reinterpret_cast, and const_cast) are implemented. • Definition of a nested class outside its enclosing class is allowed. • mutable is accepted on nonstatic data member declarations. • Namespaces are implemented, including using declarations and directives. Access declarations are broadened to match the corresponding using declarations. • Explicit instantiation of templates is implemented. • The typename keyword is recognized. • explicit is accepted to declare nonconverting constructors. • The scope of a variable declared in the for-init-statement of a for loop is the scope of the loop (not the surrounding scope). • Member templates are implemented. • The new specialization syntax (using template <>) is implemented. • Cv qualifiers are retained on rvalues (in particular, on function return values). • The distinction between trivial and nontrivial constructors has been implemented, as has the distinction between process overlay directives (PODs) and non-PODs with trivial constructors. • The linkage specification is treated as part of the function type (affecting function overloading and implicit conversions). • A typedef name can be used in an explicit destructor call. • Placement delete is supported. • An array allocated via a placement new can be deallocated via delete. • enum types are considered to be nonintegral types. • Partial specification of class templates is implemented. • Partial ordering of function templates is implemented. • Function declarations that match a function template are regarded as independent functions, not as “guiding declarations” that are instances of the template. S–2179–60 209Cray® C and C++ Reference Manual • It is possible to overload operators using functions that take enum types and no class types. • Explicit specification of function template arguments is supported. • Unnamed template parameters are supported. • The new lookup rules for member references of the form x.A::B and p->A::B are supported. • The notation :: template (and –>template, etc.) is supported. • In a reference of the form f()->g(), with g a static member function, f() is evaluated. Likewise for a similar reference to a static data member. The ARM specifies that the left operand is not evaluated in such cases. • enum types can contain values larger than can be contained in an int. • Default arguments of function templates and member functions of class templates are instantiated only when the default argument is used in a call. • String literals and wide string literals have const type. • Class name injection is implemented. • Argument-dependent (Koenig) lookup of function names is implemented. • Class and function names declared only in unqualified friend declarations are not visible except for functions found by argument-dependent lookup. • A void expression can be specified on a return statement in a void function. • reinterpret_cast allows casting a pointer to a member of one class to a pointer to a member of another class even when the classes are unrelated. • Two-phase name binding in templates as described in the Working Paper is implemented. • Putting a try/catch around the initializers and body of a constructor is implemented. • Template template parameters are implemented. • Universal character set escapes (e.g., \uabcd) are implemented. • extern inline functions are supported. • Covariant return types on overriding virtual functions are supported. 210 S–2179–60Cray C and C++ Dialects [D] D.2 C++ Anachronisms Accepted C++ anachronisms are enabled by using the -h anachronisms command line option (see Section 2.6.7, page 15). When anachronisms are enabled, the following anachronisms are accepted: • overload is allowed in function declarations. It is accepted and ignored. • Definitions are not required for static data members that can be initialized by using the default initialization. The anachronism does not apply to static data members of template classes; they must always be defined. • The number of elements in an array can be specified in an array delete operation. The value is ignored. • A single operator++() and operator--() function can be used to overload both prefix and postfix operations. • The base class name can be omitted in a base class initializer if there is only one immediate base class. • Assignment to the this pointer in constructors and destructors is allowed. This is only allowed if anachronisms are enabled and the assignment to this configuration parameter is enabled. • A bound function pointer (a pointer to a member function for a given object) can be cast to a pointer to a function. • A nested class name may be used as a nonnested class name if no other class of that name has been declared. The anachronism is not applied to template classes. • A reference to a non-const type may be initialized from a value of a different type. A temporary is created, it is initialized from the (converted) initial value, and the reference is set to the temporary. • A reference to a non-const class type may be initialized from an rvalue of the class type or a derived class thereof. No (additional) temporary is used. S–2179–60 211Cray® C and C++ Reference Manual • A function with old-style parameter declarations is allowed and can participate in function overloading as though it were prototyped. Default argument promotion is not applied to parameter types of such functions when checking for compatibility, therefore, the following statements declare the overloading of two functions named f: int f(int); int f(x) char x; { return x; } Note: In C, this code is legal, but has a different meaning. A tentative declaration of f is followed by its definition. D.3 Extensions Accepted in Normal C++ Mode The following C++ extensions are accepted (except when strict standard conformance mode is enabled, in which case a warning or caution message may be issued): • A friend declaration for a class can omit the class keyword, as shown in the following example: class B; class A { friend B; // Should be "friend class B" }; • Constants of scalar type can be defined within classes, as shown in the following example: class A { const int size=10; int a[size]; }; • In the declaration of a class member, a qualified name can be used, as shown in the following example: struct A { int A::f(); // Should be int f(); } 212 S–2179–60Cray C and C++ Dialects [D] • An assignment operator declared in a derived class with a parameter type matching one of its base classes is treated as a “default” assignment operator; that is, such a declaration blocks the implicit generation of a copy assignment operator. This is cfront behavior that is known to be relied upon in at least one widely used library. Here is an example: struct A { }; struct B : public A { B& operator=(A&); }; By default, as well as in cfront compatibility mode, there will be no implicit declaration of B::operator=(const B&), whereas in strict-ANSI mode, B::operator=(A&) is not a copy assignment operator and B::operator=(const B&) is implicitly declared. • Implicit type conversion between a pointer to an extern "C" function and a pointer to an extern "C++" function is permitted. The following is an example: extern "C" void f(); // f's type has extern "C" linkage void (*pf)() // pf points to an extern "C++" function = &f; // error unless implicit conversion allowed • The ? operator, for which the second and third operands are string literals or wide string literals, can be implicitly converted to one of the following: char * wchar_t * In C++ string literals are const. There is a deprecated implicit conversion that allows conversion of a string literal to char *, dropping the const. That conversion, however, applies only to simple string literals. Allowing it for the result of a ? operation is an extension: char *p = x ? "abc" : "def"; D.4 Extensions Accepted in C or C++ Mode The following extensions are accepted in C or C++ mode except when strict standard conformance modes is enabled, in which case a warning or caution message may be issued. • The special lint comments /*ARGSUSED*/, /*VARARGS*/ (with or without a count of nonvarying arguments), and /*NOTREACHED*/ are recognized. S–2179–60 213Cray® C and C++ Reference Manual • A translation unit (input file) can contain no declarations. • Comment text can appear at the ends of preprocessing directives. • Bit fields can have base types that are enum or integral types in addition to int and unsigned int. This corresponds to A.6.5.8 in the ANSI Common Extensions appendix. • enum tags can be incomplete as long as the tag name is defined and resolved by specifying the brace-enclosed list later. • An extra comma is allowed at the end of an enum list. • The final semicolon preceding the closing of a struct or union type specifier can be omitted. • A label definition can be immediately followed by a right brace ( } ). (Normally, a statement must follow a label definition.) • An empty declaration (a semicolon preceded by nothing) is allowed. • An initializer expression that is a single value and is used to initialize an entire static array, struct, or union does not need to be enclosed in braces. ANSI C requires braces. • In an initializer, a pointer constant value can be cast to an integral type if the integral type is large enough to contain it. • The address of a variable with register storage class may be taken. • In an integral constant expression, an integer constant can be cast to a pointer type and then back to an integral type. • In duplicate size and sign specifiers (for example, short short or unsigned unsigned) the redundancy is ignored. • Benign redeclarations of typedef names are allowed. That is, a typedef name can be redeclared in the same scope with the same type. • Dollar sign ($) and, for Cray X1 series systems, at sign (@) characters can be accepted in identifiers by using the -h calchars command line option. This is not allowed by default. • Numbers are scanned according to the syntax for numbers rather than the pp-number syntax. Thus, 0x123e+1 is scanned as three tokens instead of one token that is not valid. If the -h conform option is specified, the pp-number syntax is used. 214 S–2179–60Cray C and C++ Dialects [D] • Assignment and pointer differences are allowed between pointers to types that are interchangeable but not identical, for example, unsigned char * and char *. This includes pointers to integral types of the same size (for example, int * and long *). Assignment of a string constant to a pointer to any kind of character is allowed without a warning. • Assignment of pointer types is allowed in cases where the destination type has added type qualifiers that are not at the top level (for example, int ** to const int **). Comparisons and pointer difference of such pairs of pointer types are also allowed. • In operations on pointers, a pointer to void is always implicitly converted to another type if necessary, and a null pointer constant is always implicitly converted to a null pointer of the right type if necessary. In ANSI C, these are allowed by some operators, and not by others (generally, where it does not make sense). • Pointers to different function types may be assigned or compared for equality (==) or inequality (!=) without an explicit type cast. This extension is not allowed in C++ mode. • A pointer to void can be implicitly converted to or from a pointer to a function type. • External entities declared in other scopes are visible: void f1(void) { extern void f(); } void f2() { f(); /* Using out of scope declaration */ } • In C mode, end-of-line comments (//) are supported. • A non-lvalue array expression is converted to a pointer to the first element of the array when it is subscripted or similarly used. • The fortran keyword. For more information, see Section 9.2, page 154. • Cray hexadecimal floating point constants. For more information, see Section 9.3, page 154. S–2179–60 215Cray® C and C++ Reference Manual D.5 C++ Extensions Accepted in cfront Compatibility Mode The cfront compatibility mode is enabled by the -h cfront command-line option. The following extensions are accepted in cfront compatibility mode: • Type qualifiers on the this parameter are dropped in contexts such as in the following example: struct A { void f() const; }; void (A::*fp)() = &A::f; This is a safe operation. A pointer to a const function can be put into a pointer to non-const, because a call using the pointer is permitted to modify the object and the function pointed to will not modify the object. The opposite assignment would not be safe. • Conversion operators that specify a conversion to void are allowed. • A nonstandard friend declaration can introduce a new type. A friend declaration that omits the elaborated type specifier is allowed in default mode, however, in cfront mode the declaration can also introduce a new type name. An example follows: struct A { friend B; }; • The third operator of the ? operator is a conditional expression instead of an assignment expression. • A reference to a pointer type may be initialized from a pointer value without use of a temporary even when the reference pointer type has additional type qualifiers above those present in the pointer value. For example: int *p; const int *&r = p; // No temporary used • A reference can be initialized to NULL. • Because cfront does not check the accessibility of types, access errors for types are issued as warnings instead of errors. 216 S–2179–60Cray C and C++ Dialects [D] • When matching arguments of an overloaded function, a const variable with a value of 0 is not considered to be a null pointer constant. In general, in overload resolution, a null pointer constant must be spelled “0” to be considered a null pointer constant (e.g., '\0' is not considered a null pointer constant). • An alternate form of declaring pointer-to-member-function variables is supported, as shown in the following example: struct A { void f(int); static void sf(int); typedef void A::T3(int); // nonstd typedef decl typedef void T2(int); // std typedef }; typedef void A::T(int); // nonstd typedef decl T* pmf = &A::f; // nonstd ptr-to-member decl A::T2* pf = A::sf; // std ptr to static mem decl A::T3* pmf2 = &A::f; // nonstd ptr-to-member decl In this example, T is construed to name a function type for a nonstatic member function of class A that takes an int argument and returns void; the use of such types is restricted to nonstandard pointer-to-member declarations. The declarations of T and pmf in combination are equivalent to the following single standard pointer-to-member declaration: void (A::* pmf)(int) = &A::f; A nonstandard pointer-to-member declaration that appears outside of a class declaration, such as the declaration of T, is normally not valid and would cause an error to be issued. However, for declarations that appear within a class declaration, such as A::T3, this feature changes the meaning of a valid declaration. cfront version 2.1 accepts declarations, such as T, even when A is an incomplete type; so this case is also accepted. • Protected member access checking is not done when the address of a protected member is taken. For example: class B { protected: int i; }; class D : public B { void mf()}; void D::mf() { int B::* pmi1 = &B::i; // error, OK in cfront mode int D::* pmi2 = &D::i; // OK } S–2179–60 217Cray® C and C++ Reference Manual Note: Protected member access checking for other operations (such as everything except taking a pointer-to-member address) is done normally. • The destructor of a derived class can implicitly call the private destructor of a base class. In default mode, this is an error but in cfront mode it is reduced to a warning. For example: class A { ~A(); }; class B : public A { ~B(); }; B::~B(){} // Error except in cfront mode • When disambiguation requires deciding whether something is a parameter declaration or an argument expression, the pattern type-name-or-keyword (identifier ...) is treated as an argument. For example: class A { A(); }; double d; A x(int(d)); A(x2); By default, int(d) is interpreted as a parameter declaration (with redundant parentheses), and so x is a function; but in cfront compatibility mode int(d) is an argument and x is a variable. The declaration A(x2) is also misinterpreted by cfront. It should be interpreted as the declaration of an object named x2, but in cfront mode it is interpreted as a function style cast of x2 to the type A. Similarly, the following declaration declares a function named xzy, that takes a parameter of type function taking no arguments and returning an int. In cfront mode, this is interpreted as a declaration of an object that is initialized with the value int(), which evaluates to 0. int xyz(int()); • A named bit field can have a size of 0. The declaration is treated as though no name had been declared. • Plain bit fields (such as bit fields declared with a type of int) are always signed. 218 S–2179–60Cray C and C++ Dialects [D] • The name given in an elaborated type specifier can be a typedef name that is the synonym for a class name. For example: typedef class A T; class T *pa; // No error in cfront mode • No warning is issued on duplicate size and sign specifiers, as shown in the following example: short short int i; // No warning in cfront mode • Virtual function table pointer-update code is not generated in destructors for base classes of classes without virtual functions, even if the base class virtual functions might be overridden in a further derived class. For example: struct A { virtual void f() {} A() {} ~A() {} }; struct B : public A { B() {} ~B() {f();} // Should call A::f according to ARM 12.7 }; struct C : public B { void f() {} } c; In cfront compatibility mode, B::~B calls C::f. • An extra comma is allowed after the last argument in an argument list. For example: f(1, 2, ); • A constant pointer-to-member function can be cast to a pointer-to-function, as in the following example. A warning is issued. struct A {int f();}; main () { int (*p)(); p = (int (*)())A::f; // Okay, with warning } • Arguments of class types that allow bitwise copy construction but also have destructors are passed by value like C structures, and the destructor is not called on the copy. In normal mode, the class object is copied into a S–2179–60 219Cray® C and C++ Reference Manual temporary, the address of the temporary is passed as the argument, and the destructor is called on the temporary after the call returns. Because the argument is passed by value instead of by address, code like this compiled in cfront mode is not calling-sequence compatible with the same code compiled in normal mode. In practice, this is not much of a problem, since classes that allow bitwise copying usually do not have destructors. • A union member may be declared to have the type of a class for which the user has defined an assignment operator (as long as the class has no constructor or destructor). A warning is issued. • When an unnamed class appears in a typedef declaration, the typedef name may appear as the class name in an elaborated type specifier. For example: typedef struct { int i, j; } S; struct S x; // No error in cfront mode • Two member functions may be declared with the same parameter types when one is static and the other is nonstatic with a function qualifier. For example: class A { void f(int) const; static void f(int); // No error in cfront mode }; • The scope of a variable declared in the for-init-statement is the scope to which the for statement belongs. For example: int f(int i) { for (int j = 0; j < i; ++j) { /* ... */ } return j; // No error in cfront mode } • Function types differing only in that one is declared extern "C" and the other extern "C++" can be treated as identical: typedef void (*PF)(); extern "C" typedef void (*PCF)(); void f(PF); void f(PCF); By contrast, in standard C++, PF and PCF are different and incompatible types; PF is a pointer to an extern "C++" function whereas PCF is a pointer to an extern "C" function; and the two declarations of f create an overload set. 220 S–2179–60Cray C and C++ Dialects [D] • Functions declared inline have internal linkage. • enum types are regarded as integral types. • An uninitialized const object of non-POD class type is allowed even if its default constructor is implicitly declared as in the following example: struct A { virtual void f(); int i; }; const A a; • A function parameter type is allowed to involve a pointer or reference to array of unknown bounds. • If the user declares an operator= function in a class, but not one that can serve as the default operator=, and bitwise assignment could be done on the class, a default operator= is not generated. Only the user-written operator= functions are considered for assignments, so bitwise assignment is not done. S–2179–60 221Cray® C and C++ Reference Manual 222 S–2179–60Compiler Messages [E] This appendix describes how to use the message system to control and use messages issued by the compiler. Explanatory texts for messages can be displayed online through the use of the explain command. E.1 Expanding Messages with the explain Command You can use the explain command to display an explanation of any message issued by the compiler. The command takes as an argument, the message number, including the number's prefix. The prefix for Cray C and C++ is CC. In the following sample dialog, the cc(1) command invokes the compiler on source file bug.c. Message CC-24 is displayed. The explain command displays the expanded explanation for this message. % cc bug.c CC-24 cc: ERROR File = bug.c, Line = 1 An invalid octal constant is used. int i = 018; ^ 1 error detected in the compilation of "bug.c". % explain CC-24 An invalid octal constant is used. Each digit of an octal constant must be between 0 and 7, inclusive. One or more digits in the octal constant on the indicated line are outside of this range. To avoid issuing an error for each erroneous digit, the constant will be treated as a decimal constant. Change each digit in the octal constant to be within the valid range. E.2 Controlling the Use of Messages This section summarizes the command line options that affect the issuing of messages from the compiler. S–2179–60 223Cray® C and C++ Reference Manual E.2.1 Command Line Options Option Description -h errorlimit[=n] Specifies the maximum number of error messages the compiler prints before it exits. -h [no]message=n[:...] Enables or disables the specified compiler messages, overriding -h msglevel. -h msglevel_n Specifies the lowest severity level of messages to be issued. -h report=args Generates optimization report messages. E.2.2 Environment Options for Messages The following are used by the message system. Variable Description NLSPATH Specifies the default value of the message system search path environment variable. LANG Identifies your requirements for native language, local customs, and coded character set with regard to the message system. MSG_FORMAT Controls the format in which you receive error messages. 224 S–2179–60Compiler Messages [E] E.2.3 ORIG_CMD_NAME Environment Variable You can override the command name printed in the message. If the environment variable ORIG_CMD_NAME is set, the value of ORIG_CMD_NAME is used as the command name in the message. This functionality is provided for use with shell scripts that invoke the compiler. By setting ORIG_CMD_NAME to the name of the script, any message printed by the compiler appears as though it was generated by the script. For example, the following C shell script is named newcc: # setenv ORIG_CMD_NAME 'basename $0' cc $* A message generated by invoking newcc resembles the following: CC-8 newcc: ERROR File = x.c, Line = 1 A new-line character appears inside a string literal. Because the environment variable ORIG_CMD_NAME is set to newcc, this appears as the command name instead of cc(1) in this message. ! Caution: The ORIG_CMD_NAME environment variable is not part of the message system. It is supported by the Cray C and C++ compilers as an aid to programmers. Other products, such as the Fortran compiler and the loader, may support this variable. However, you should not rely on support for this variable in any other product. You must be careful when setting the environment variable ORIG_CMD_NAME. If you set ORIG_CMD_NAME inadvertently, the compiler may generate messages with an incorrect command name. This may be particularly confusing if, for example, ORIG_CMD_NAME is set to newcc when the Fortran compiler prints a message. The Fortran message will look as though it came from newcc. E.3 Message Severity Each message issued by the compiler falls into one of the following categories of messages, depending on the severity of the error condition encountered or the type of information being reported. Category Meaning COMMENT Inefficient programming practices. S–2179–60 225Cray® C and C++ Reference Manual NOTE Unusual programming style or the use of outmoded statements. CAUTION Possible user error. Cautions are issued when the compiler detects a condition that may cause the program to abort or behave unpredictably. WARNING Probable user error. Indicates that the program will probably abort or behave unpredictably. ERROR Fatal error; that is, a serious error in the source code. No binary output is produced. INTERNAL Problems in the compilation process. Please report internal errors immediately to the system support staff, so a Software Problem Report (SPR) can be filed. LIMIT Compiler limits have been exceeded. Normally you can modify the source code or environment to avoid these errors. If limit errors cannot be resolved by such modifications, please report these errors to the system support staff, so that an SPR can be filed. INFO Useful additional information about the compiled program. INLINE Information about inline code expansion performed on the compiled code. SCALAR Information about scalar optimizations performed on the compiled code. 226 S–2179–60Compiler Messages [E] VECTOR Information about vectorization optimizations performed on the compiled code. STREAM Information about the MSP optimizations performed on the compiled code (Cray X1 series systems only). OPTIMIZATION Information about general optimizations. E.4 Common System Messages The errors in the following list can occur during the execution of a user program. The operating system detects them and issues the appropriate message. These errors are not detected by the compiler and are not unique to C and C++ programs; they may occur in any application program written in any language. • Operand Range Error An operand range error occurs when a program attempts to load or store in an area of memory that is not part of the user's area. This usually occurs when an invalid pointer is dereferenced. • Program Range Error A program range error occurs when a program attempts to jump into an area of memory that is not part of the user's area. This may occur, for example, when a function in the program mistakenly overwrites the internal program stack. When this happens, the address of the function from which the function was called is lost. When the function attempts to return to the calling function, it jumps elsewhere instead. • Error Exit An error exit occurs when a program attempts to execute an invalid instruction. This error usually occurs when the program's code area has been mistakenly overwritten with words of data (for example, when the program stores in a location pointed to by an invalid pointer). S–2179–60 227Cray® C and C++ Reference Manual 228 S–2179–60Intrinsic Functions [F] The C and C++ intrinsic functions either allow for direct access to some hardware instructions or result in generation of inline code to perform some specialized functions. These intrinsic functions are processed completely by the compiler. In many cases, the generated code is one or two instructions. These are called functions because they are invoked with the syntax of function calls. To get access to the intrinsic functions, the Cray C++ compiler requires that either the intrinsics.h file be included or that the intrinsic functions that you want to call be explicitly declared. If the source code does not have an intrinsics.h statement and you cannot modify the code, you can use the -h prototype_intrinsics option instead. If you explicitly declare an intrinsic function, the declaration must agree with the documentation or the compiler treats the call as a call to a normal function, not the intrinsic function. The -h nointrinsics command line option causes the compiler to treat these calls as regular function calls and not as intrinsic function calls. The types of the arguments to intrinsic functions are checked by the compiler, and if any of the arguments do not have the correct type, a warning message is issued and the call is treated as a normal call to an external function. If your intention was to call an external function with the same name as an intrinsic function, you should change the external function name. The names used for the Cray C intrinsic functions are in the name space reserved for the implementation. Note: Several of these intrinsic functions have both a vector and a scalar version. If a vector version of an intrinsic function exists and the intrinsic is called within a vectorized loop, the compiler uses the vector version of the intrinsic. For details on whether it has a vector version, refer to the appropriate intrinsic function man page. The following sections groups the C and C++ intrinsics according to function and provides a brief description of each intrinsic in that group. For more information, see the corresponding man page. S–2179–60 229Cray® C and C++ Reference Manual F.1 Atomic Memory Operations The following intrinsic functions perform various atomic memory operations: Note: In this discussion, an object is an entity that is referred to by a pointer. A value is an actual number, bit mask, etc. that is not referred to by a pointer. Intrinsic Description _amo_aadd Adds a value to an object that is referred to by a pointer and stores the results in the object; fully vectorizable on Cray X2 systems. _amo_aax Adds a value to an object that is referred to by a pointer, performs an XOR on the result with a third value, and stores the results in the object. _amo_afadd Adds a value to an object that is referred to by a pointer and stores the result in the object; fully vectorizable on Cray X2 systems. The intrinsic returns the original value of the object. _amo_afax Adds a value to an object that is referred to by a pointer, XORs the result with a second value, and stores the result in the object. The intrinsic returns the original value of the object. _amo_acswap (Compare and swap) Compares an object that is referenced by a pointer against a value. If equal, a specified value is stored in the object. The intrinsic returns the original value of object. F.2 BMM Operations The following intrinsic functions perform operations on the bit matrix multiply (BMM) unit: _mclr Logically undefines the BMM unit. _mld Loads the BMM functional unit with a matrix vector in transposed form. _mldmor Combines the load and inclusive OR BMM functions. 230 S–2179–60Intrinsic Functions [F] _mldmx Combines the load and multiply functions. _mmor Performs an inclusive OR bit matrix multiply. _mmx Performs a bit matrix multiply. _mtilt Inverts a bit matrix. _mul Unloads the BMM unit. F.3 Bit Operations The following intrinsic functions copy, count, or shift bits or computes the parity bit: _dshiftl Move the left most n bits of an integer into the right side of another integer, and return that integer. _dshiftr Move the right most n bits of an integer into the left side of another integer and return that integer. _pbit Copies the rightmost bit of a word to the n th bit, from the right, of another word. _pbits Copies the rightmost m bits of a word to another word beginning at bit n. _poppar Computes the parity bit for a variable. _popcnt _popcnt32 _popcnt64 Counts the number of set bits in 32-bit and 64-bit integer words. _leadz _leadz32 _leadz64 Counts the number of leading 0 bits in 32-bit and 64-bit integer words. _gbit _gbit returns the value of the n th bit from the right. _gbits Returns a value consisting of m bits extracted from a variable, beginning at n th bit from the right. S–2179–60 231Cray® C and C++ Reference Manual F.4 Function Operations The following intrinsic functions return information about function arguments: _argcount Returns the number of arguments explicitly passed to a function, excluding any "hidden" arguments added by the compiler. _numargs Returns the total number of words in the argument list passed to the function including any "hidden" arguments added by the compiler. F.5 Mask Operations The following intrinsic functions create bit masks: _mask Creates a left-justified or right-justified bit mask with all bits set to 1. _maskl Returns a left-justified bit mask with i bits set to 1. _maskr Returns a right-justified bit mask with i bits set to 1. F.6 Memory Operations The following intrinsic function assures that memory references synchronize memory: _gsync Performs global synchronization of all memory. F.7 Miscellaneous Operations The following intrinsic functions perform various functions: _int_mult_upper Multiplies integers and returns the uppermost bits. For more information, see the int_mult_upper(3i) man page. _ranf _ranf, compute a pseudo-random floating-point number ranging from 0.0 through 1.0. _rtc Return a real-time clock value expressed in clock ticks. 232 S–2179–60Intrinsic Functions [F] F.8 Streaming Operations Note: Streaming operations are not supported on Cray X2 systems. The following intrinsic functions return streaming information: __sspid Indicates which SSP is being used by the code. This intrinsic applies to MSP-mode applications, not SSP-mode applications. __streaming Indicates whether the code is capable of multistreaming. S–2179–60 233Cray® C and C++ Reference Manual 234 S–2179–60Glossary application node For UNICOS/mp systems, a node that is used to run user applications. Application nodes are best suited for executing parallel applications and are managed by the strong application placement scheduling and gang scheduling mechanism Psched. See also node; node flavor. cache line A division of cache. Each cache line can hold multiple data items. For Cray X1 systems, a cache line is 32 bytes, which is the maximum size of a hardware message. co-array A syntactic extension to Fortran that offers a method for programming data passing; a data object that is identically allocated on each image and can be directly referenced syntactically by any other image. compute module For a Cray X1 series mainframe, the physical, configurable, scalable building block. Each compute module contains either one node with 4 MCMs/4MSPs (Cray X1 modules) or two nodes with 4 MCMs/8MSPs (Cray X1E modules). Sometimes referred to as a node module. Refer to also node. compute node A node that runs user applications. For Cray XT series systems, compute nodes run the Catamount or CNL compute node operating system. For Cray X2 systems, compute nodes run CNL. System services cannot run on compute nodes. See also node; service node. Cray streaming directives (CSDs) Nonadvisory directives that allow you to more closely control multistreaming for key loops. Cray X1 series system The Cray system that combines the single-processor performance and single-shared address space of Cray parallel vector processor (PVP) systems with S–2179–60 235Cray® C and C++ Reference Manual the highly scalable microprocessor-based architecture that is used in Cray T3E systems. Cray X1 and Cray X1E systems utilize powerful vector processors, shared memory, and a modernized vector instruction set in a highly scalable configuration that provides the computational power required for advanced scientific and engineering applications. Cray X2 system A Cray system that uses Cray X2 compute nodes for user application processing and Cray XT series service nodes for login, network, I/O, and boot functions. CrayDoc Cray's documentation system for accessing and searching Cray books, man pages, and glossary terms from a web browser. deferred implementation The label used to introduce information about a feature that will not be implemented until a later release. loopmark listing A listing that is generated by invoking the Cray Fortran Compiler with the -rm option. The loopmark listing displays what optimizations were performed by the compiler and tells you which loops were vectorized, streamed, unrolled, interchanged, and so on. multistreaming processor (MSP) For UNICOS/mp systems, a basic programmable computational unit. Each MSP is analogous to a traditional processor and is composed of four single-streaming processors (SSPs) and E-cache that is shared by the SSPs. See also node. node For UNICOS/mp systems, the logical group of four multistreaming processors (MSPs), cache-coherent shared local memory, high-speed interconnections, and system I/O ports. A Cray X1 system has one node with 4 MSPs per compute module. A Cray X1E system has two nodes of 4 MSPs per node, providing a total of 8 MSPs on its compute module. Software controls how a node is used: as an OS node, application node, or support node. Refer to also compute module. 236 S–2179–60Glossary node flavor For UNICOS/mp systems, software controls how a node is used. A node's software-assigned flavor dictates the kind of processes and threads that can use its resources. The three assignable node flavors are application, OS, and support. Refer to also application node; OS node; support node; system node. OS node For UNICOS/mp systems, the node that provides kernel-level services, such as system calls, to all support nodes and application nodes. Refer to also node; node flavor. page size The unit of memory addressable through the Translation Lookaside Buffer (TLB). For a UNICOS/mp system, the base page size is 65,536 bytes, but larger page sizes (up to 4,294,967,296 bytes) are also available. service node A Cray XT series node that performs support functions for applications and system services. Service nodes run SUSE LINUX and perform specialized functions. There are four types of service nodes: login, IO, network, and system. single-streaming processor (SSP) For UNICOS/mp systems, a basic programmable computational unit. Refer to also node. Software Problem Report (SPR) A Cray customer service form and process that tracks software problems from first report to resolution. SPR resolution results either from a written reply, the release of software containing the fix to the problem, or the implementation of the requested design change. support node For UNICOS/mp systems, the node that is used to run serial commands, such as shells, editors, and other user commands (ls, for example). Refer to also node; node flavor. S–2179–60 237Cray® C and C++ Reference Manual system node For UNICOS/mp systems, the node that is designated as both an OS node and a support node; this node is often called a system node; however, there is no node flavor of "system." Refer to also node; node flavor. trigger A command that a user logged into a Cray X1 series system uses to launch Programming Environment components residing on the CPES. Examples of trigger commands are ftn, CC, and pat_build. UNICOS/lc The operating system for Cray XT series and Cray X2 systems. UNICOS/mp The operating system for Cray X1 series (Cray X1 and Cray X1E) systems. 238 S–2179–60Index -#, 50 -##, 50 -###, 50 A Advisory directives defined, 82 _amo_aadd, 230 _amo_aax, 230 _amo_acswap, 230 _amo_afadd, 230 _amo_afax, 230 Anachronisms C++, 211 aprun, 163 _argcount, 232 Argument passing, 173 Arithmetic See math Array storage, 173 Arrays, 192 dependencies, 107 asm statements, 193 Assembly language functions, 171 output, 50 Assembly source expansions, 7 Auto aprun (see CRAY_AUTO_APRUN_OPTIONS), 68 B Bit fields, 192 Blank common block, 177 bounds directive, 78 btol conversion utility, 174 C -c, 197 C extensions, 153 See also Cray C extensions C interoperability, 182 C libraries, 197 -c option, 50 -C option, 52 C++ libraries, 141 templates, 143 Cache management automatic cache management options, 33 -h cachen, 33 Calls, 169 can_instantiate directive, 90, 149 Cfront, 216 compatibility mode, 207 compilers, 14 option, 14 Character data, 174 Character set, 189 Characters wide, 191 CIV See Constant increment variables Classes, 192 Command line options -# option, 50 -## option, 50 -### option, 50 -c option, 7, 50 -C option, 52 compiler version, 63 conflicting with directives, 13 conflicting with other options, 13 -D macro[=def], 52 defaults, 10 -E option, 7, 49 examples, 64 -g option, 45, 167–168 S–2179–60 239Cray® C and C++ Reference Manual -G option, 45, 167–168 -h [no] conform, 14 -h [no]abort, 49 -h [no]aggress, 25 -h [no]anachronisms, 15 -h [no]autoinstantiate, 20 -h [no]bounds, 46, 167 -h [no]c99, 13 -h [no]calchars, 24 -h [no]exceptions, 15 –h [no]fusion, 26 -h [no]implicitinclude, 22 -h [no]interchange, 41 -h [no]intrinsics, 27 -h [no]ivdep, 36 -h [no]message=n, 48 -h [no]overindex, 29 -h [no]pattern, 30 -h [no]pragma=name[:name...], 53 -h [no]signedshifts, 25 -h [no]tolerant, 16 –h [no]unroll, 31 -h [no]zeroinc, 42 -h anachronisms, 211 -h cfront, 14, 216 -h display_opt, 25 -h errorlimit[=n], 49 -h feonly, 50 -h forcevtbl, 22 -h ident=name, 61 -h ieee_nonstop, 45 -h instantiate=mode, 21 -h instantiation_dir, 20 -h keep=file, 23 -h matherror=method, 45 -h msglevel_n, 47 -h new_for_init, 15 -h one_instantiation_per_object, 20 -h options errorlimit, 223 -h prelink_local_copy, 22 -h remove_instantiation_flags, 22 -h report=args, 48 -h restrict=args, 23 -h scalarn, 41 -h simple_templates, 20 -h suppressvtbl, 23 -h vectorn, 36 -h zero, 46 -hgcpn, 26 -I option, 54 -L libdir option, 56 -l libfile option, 55 -M option, 55 macro definition, 52 -N option, 55 -nostdinc option, 55 -O level, 32 -o option, 57 -P option, 7, 50 prelink_copy_if_nonlocal, 22 preprocessor options, 49 remove macro definition, 55 -S option, 7, 50 -U macro option, 55 -V option, 63 -W option, 51 -Y option, 51 Command mode -h command, 57 Commands c89, 4, 7 files, 9 format, 9 c99, 4 files, 8 format, 8 cc, 4, 7 files, 8 format, 8 CC, 4, 7 files, 8 format, 8 compiler, 7 240 S–2179–60Index cpp, 7 format, 9 ld, 23 options, 10 Comments preprocessed, 52 Common block, 177 Common blocks, dynamic, 68 Common system messages, 227 Compilation phases -#, 50 -##, 50 -###, 50 -c option, 50 -E option, 49 -h feonly, 50 -P option, 50 -S option, 50 -Wphase,"opt...", 51 -Yphase,dirname, 51 Compiler Cray C, 5 Cray C++, 4 Compiler messages, 223 _Complex incrementing or decrementing, 153 concurrent directive, 107 Conformance C++, 207 Constant increment variables (CIVs), 42 Constructs accepted and rejected, 14 old, 16 Conversion utility _btol, 174 _ltob, 175 Cray C Compiler, 5 Cray C extensions, 153, 213 See also extensions Imaginary constants, 153 incrementing or decrementing _Complex data, 153 _Pragma, 77 Cray C++ Compiler, 4 Cray streaming directives See CSDs Cray X1 series system, 58 Cray X2 system, 58 CRAY_AUTO_APRUN_OPTIONS, 68 CRAY_PE_TARGET, 66 CRAYOLDCPPLIB, 66 CRAYOLDCPPLIB environment variable, 15 CRI_c89_OPTIONS, 67 CRI_cc_OPTIONS, 67 CRI_CC_OPTIONS, 66–67 CRI_cpp_OPTIONS, 67 critical directive, 122 CSDs, 115 chunk size, optimal, 119 chunk_size, 119 chunks, defined, 119 compatibility, 115 critical, 122 CSD parallel region, defined, 116 for, 118 functions called from parallel regions, 116 functions in, 116 options to enable, compiler, 127 ordered, 123 parallel, 116 parallel directive, 125 parallel directives, 116 parallel for, 121 parallel region, 116 parallel regions, multiple, 116 placement of, 125 private data, precautions for, 117 stand-alone CSD directives defined, 125 sync, 122 D -D macro[=def], 52 Data types, 188 logical data, 174 S–2179–60 241Cray® C and C++ Reference Manual mapping (table), 188 __DATE__ , 194 Debugging, 45 features, 167 -G level , 45 -g option, 45 -h [no]bounds, 46 -h zero, 46 options, 168 Declarators, 193 Declared bounds, 29 Decompiling -h decomp, 59 Defaults -h fp2, 42 Dialects, 207 Directives advisory, defined, 82 C++, 77 conflicts with options, 13 #define, 52 diagnostic messages, 76 disabling, 53 general, 78 #include, 54–55 inlining, 111 instantiation, 90 loop, 77 macro expansion, 75 MSP, 106 examples, 106 pragma OpenMP, 129 #pragma, 75 alternative form, 77 arguments to instantiate, 150 can_instantiate, 90, 149 concurrent, 107 critical, 122 do_not_instantiate, 90, 149 duplicate, 79 for, 118 format, 75 ident, 89 in C++, 77 instantiate, 90, 149 ivdep, 91 loop_info, 92 message, 82, 167 [no]bounds, 78 [no]bounds directive, 167 no_cache_alloc, 82 [no]opt, 84, 167 nointerchange, 108 nopattern, 95 noreduction, 108 nostream, 106 [nounroll], 109 novector, 96–97, 108 novsearch, 97 ordered, 123 parallel, 116 parallel for, 121 permutation, 97 preferstream, 106 prefervector, 98 safe_address, 99 shortloop, 102 shortloop128, 102 ssp_private, 104 suppress, 108 sync, 122 [unroll], 109 usage, 75 vfunction, 88 weak, 87 preprocessing, 194 protecting, 76 scalar, 107 vectorization, 90 Directories #include files, 54–55 library files, 55–56 phase execution, 51 242 S–2179–60Index do_not_instantiate directive, 90, 149 _dshiftl, 231 _dshiftr, 231 duplicate directive, 79 Dynamic common blocks, 68 E -E option, 49 Enumerations, 192 Environment, 187 Environment variable CRAYOLDCPPLIB, 15 Error Exit, 227 Error messages, 223 Examples command line, 64 Exception construct, 15 Exception handling, 15 Exceptions, 194 explain, 223 Extensions C++ mode, 212 Cfront compatibility mode, 216 Cray C, 153 _Pragma, 77 #pragma directives, 75 extern "C" keyword, 169 External functions declaring, 169 F Features C++, 207 Cfront compatibility, 207 Files a.out, 7 constructor/destructor, 23 default library, 55 dependencies, 55 .ii file, 146 intrinsics.h, 229 library directory, 56 linking, 23 output, 57 personal libraries, 56 Floating-point constants, 154 overflow, 191 for directive, 118 Fortran common block, 177 fortran keyword, 154 Freeing up memory, 70 friend declaration, 216 Functions, 229 mbtowc, 191 G -G level , 45 -g option, 167–168 -G option, 167–168 _gbit, 231 _gbits, 231 GCC language extensions C and C++, 16 C++ only, 19 General command functions -h ident=name, 61 -V option, 63 Global constant propagation, 26 gnu GCC language extensions, 16 _gsync, 232 H -h [no]conform, 14 –h [no]fusion, 26 -h [no]implicitinclude, 22 -h [no]message=n[:...], 224 -h [no]message=n[:n...], 48 -h [no]mpmd, 61 -h [no]pragma=name[:name...], 53 -h [no]unroll, 31 -h abort, 49 -h aggress, 25 S–2179–60 243Cray® C and C++ Reference Manual -h anachronisms, 15, 211 -h autoinstantiate, 20 -h bounds, 46, 167 -h c99, 13 -h cachen, 33 -h calchars, 24 -h cfront, 14 -h command, 57 -h conform, 14 -h const_string_literals, 16 -h cpu=target_system, 58 -h decomp, 59 -h display_opt, 25 -h errorlimit, 223 -h errorlimit[=n], 49, 224 -h exceptions, 15 -h feonly, 50 -h forcevtbl, 22 -h gen_private_callee, 27 -h gnu, 16 -h ident=name, 61 -h ieee_nonstop option, 45 -h implicitinclude, 22 -h infinitevl, 35 -h instantiate=mode, 21 -h instantiation_dir, 20 –h interchange, 41 -h intrinsics, 27 -h ipafrom=source[:source], 40 -h ipan, 39 -h ivdep, 36 -h keep=file, 23 -h list, 28 -h matherror=method, 45 -h mpmd, 61 -h msglevel_n, 47, 224 -h msp, 29 -h new_for_init, 15 -h noabort, 49 -h noaggress, 25 -h noanachronisms, 15 -h noautoinstantiate, 20 -h nobounds, 46, 167 -h noc99, 13 -h nocalchars, 24 -h noconst_string_literals, 16 -h noexceptions, 15 -h nognu, 16 -h noinfinitevl, 35 –h nointerchange, 41 -h nointrinsics, 27, 229 -h noivdep, 36 -h noomp, 61 -h nooverindex, 29 -h nopattern, 30 -h nosignedshifts, 25 -h notolerant, 16 -h nozeroincn, 42 -h omp, 61 -h one_instantiation_per_object, 20 -h overindex, 29 -h pattern, 30 -h prelink_copy_if_nonlocal, 22 -h prelink_local_copy, 22 -h prototype intrinsics, 61 -h prototype_intrinsics, 229 -h remove_instantiation_flags, 22 -h report=args, 48, 224 -h restrict=args, 23 -h scalarn, 41 -h signedshifts, 25 -h simple_templates, 20 -h stream, 34 -h streamn, 103 -h suppressvtbl, 23 -h taskn, 62 -h tolerant, 16 -h upc, 62 -h vectorn, 36 -h zero, 46 -h zeroincn, 42 Hardware intrinsic functions, 27 Hexadecimal floating constant, 154 244 S–2179–60Index -hgcpn, 26 I -I incldir, 54 ident directive, 89 Identifier names allowable, 24 Identifiers, 188 Imaginary constants, 153 Implementation-defined behavior, 187 Implicit inclusion, 22, 151 inline_always directive, 113 inline_disable directive, 112 inline_enable directive, 112 inline_never directive, 113 inline_reset directive, 112 Inlining directives, 111 Inlining options, 37 instantiate directive, 90, 149 Instantiation directives, 90, 149 directory for template instantiation object files, 20 enable or disable automatic, 20 local files, 22 modes, 21, 148 nonlocal object file recompiled, 22 one per object file, 20, 147, 149 prelinker, 143 remove flags, 22 simple, 20, 144 template, 143 _int_mult_upper, 232 Integers overflow, 191 representation, 191 Interchange loops, 41 Interlanguage communication, 169 argument passing, 173 array storage, 173 assembly language functions, 171 blank common block, 177 calling a C and C++ function from Fortran, 182 calling a C program from C++, 169 calling a Fortran program from C++, 181 calling Fortran routines, 172 logical and character data, 174 Intermediate translations, 7 Intrinsic functions, 26 argument types, 229 summary, 229 Intrinsics, 27 intrinsics.h, 229 ivdep directive, 91 K K & R preprocessing, 55 Keywords extern "C", 169 fortran, 154 L -L libdir, 56 -l libfile, 55 LANG, 67, 224 Language general -h [no]calchars, 24 -h keep=file, 23 -h restrict=args, 23 standard conformance -h [no] conform, 14 -h [no]anachronisms, 15 -h [no]c99, 13 -h [no]exceptions, 15 -h [no]tolerant, 16 -h cfront, 14 -h new_for_init, 15 templates -h [no]autoinstantiate, 20 -h [no]implicitinclude, 22 -h instantiate=mode, 21 -h instantiation_dir, 20 -h one_instantiation_per_object, 20 S–2179–60 245Cray® C and C++ Reference Manual -h prelink_copy_if_nonlocal, 22 -h prelink_local_copy, 22 -h remove_instantiation_flags, 22 -h simple_templates, 20 virtual functions -h forcevtbl, 22 -h suppressvtbl, 22 Launching applications, 163 ld, 7 _leadz, 231 Lexical block, defined, 76 Libraries default, 56 Standard C, 197 Library, Standard Template, 197 Limits, 187 Linking files, 23 Loader default, 197 -L libdir, 56 -l libfile, 55 ld, 7 -o outfile, 57 Logical data, 174 Loop directives, 77 fusion, 110 no unrolling, 109 unrolling, 109 Loop optimization –h [no]unroll, 31 safe_address, 99 loop_info directive, 92 Loopmark listings, 28 _ltob conversion utility, 175 M -M option, 55 Macros, 171 expansion in directives, 75 removing definition, 55 Macros, predefined, 157 _ADDR64, 160 __cplusplus, 158 cray, 161 CRAY, 161 _CRAY, 160 _CRAYC, 161 _CRAYIEEE, 160 __craynv, 161 _CRAYSV2, 160 __crayx1, 160 __crayx2, 161 __DATE__, 158 __FILE__, 158 __gnu_linux__, 159 __LINE__, 158 linux, 159 __linux, 159 __linux__, 159 __LITTLE_ENDIAN, 160 __LITTLE_ENDIAN__, 160 _MAXVL, 161 _RELEASE, 161 _RELEASE_MINOR, 161 _RELEASE_STRING, 161 __STDC__, 158 __sv, 160 __sv2, 160 __TIME__, 158 _UNICOSMP, 159 unix, 159 _unix, 159 __UPC__, 162 __UPC_DYNAMIC_THREADS__, 162 __UPC_STATIC_THREADS__, 162 _mask, 232 _maskl, 232 _maskr, 232 Math -h matherror=method, 45 mbtowc, 191 _mclr, 230 246 S–2179–60Index Memory, freeing up, 70 message directive, 82, 167 Messages, 187, 223 common system, 227 Error Exit, 227 Operand Range Error, 227 Program Range Error, 227 for _CRI directives, 76 -h [no]abort, 49 -h [no]message=n[:n...], 48 -h errorlimit[=n], 49 -h msglevel_n, 47 -h report=args, 48 option summary, 223 severity, 225 CAUTION, 226 COMMENT, 225 ERROR, 226 INFO, 226 INLINE, 226 INTERNAL, 226 LIMIT, 226 NOTE, 226 SCALAR, 226 VECTOR, 227 WARNING, 226 _mld, 230 _mldmor, 230 _mldmx, 231 _mmor, 231 _mmx, 231 mpirun, 163 MPMD, 61, 164 MSG_FORMAT, 67, 224 MSP, 103 directives, 106 -h streamn, 103 MSP-mode -h msp, 29 _mtilt, 231 _mul, 231 Multiple Program, Multiple Data -h [no]mpmd, 61 Multiple Program, Multiple Data (MPMD), 164 Multistreaming, 34 -h stream, 34 Multistreaming processor See MSP N -N option, 55 Names, 188 NLSPATH, 67, 224 No unrolling See unrolling nobounds directive, 78 nointerchange directive, 108 noopt directive, 84, 167 nopattern directive, 95 noreduction directive, 108 -nostdinc, 55 nostream directive, 106 novector directive, 96–97, 108 novsearch directive, 97 NPROC, 67 _numargs, 232 O -O level, 32 -o outfile, 57 OpenMP, 71 directives, 129 disable directive recognition, 61, 132 enable directive recognition, 61, 132 memory considerations, 74, 131 OMP_DYNAMIC environment variable, 73 OMP_NESTED environment variable, 74 OMP_NUM_THREADS environment variable, 73 OMP_SCHEDULE environment variable, 72 Operand Range Error, 227 Operators bitwise and integers, 191 opt directive, 84, 167 Optimization S–2179–60 247Cray® C and C++ Reference Manual automatic scalar, 41 general –h [no] unroll, 31 -h [no]aggress, 25 –h [no]fusion, 26 -h [no]intrinsics, 27 -h [no]overindex, 29 -h [no]pattern, 30 -hgcpn, 26 -O level, 32 Global constant propagation, 26 –h [no]unroll, 31 -h ipan, 39 -h list, 28 inlining, 39 interchange loops, 41 level, 32 limitations, 25 loopmark listings, 28 MSP, 103 [no]fusion, 26 scalar -h [no]interchange, 41 -h scalarn, 41 vector -h [no]ivdep, 36 -h [no]zeroincn, 42 -h vectorn, 36 Options See Command line options conflicts, 13 vectorization, 35 ordered directive, 123 ORIG_CMD_NAME, 225 Overindexing, 29 P -P option, 50 parallel directive, 116 parallel for directive, 121 Parallel programming models UPC, 135 Pattern matching enable or disable, 30 _pbit, 231 _pbits, 231 Performance improvement, 36 permutation directive, 97 Pointers, 192 function parameter, 24 restricted, 23 _popcnt, 231 _poppar, 231 Porting code, 16, 207 #pragma directives See Directives Predefined macros, 157 preferstream directive, 106 prefervector directive, 98 Prelinker, 145 Prelinker instantiation, 143 Preprocessing, 194 -C option, 52 -D macro[=def], 52 -h [no]pragma=name[:name...] , 53 -I incldir, 54 -M, 55 -N option, 55 -nostdinc, 55 old style (K & R), 55 retain comments, 52 -U macro, 55 Preprocessor, 50 passing arguments to, 51 Preprocessor phase, 7 Processing elements, 63 Program Range Error, 227 Programming environment description, 1 Protected member access checking, 217 Q Qualifiers, 193 248 S–2179–60Index R _ranf, 232 Reduction loop, 108 Registers, 192 Relocatable object file, 7, 50 Restricted pointers, 23 _rtc, 232 Running applications, 163 S -S option, 50 safe_address directive, 99 Scalar directives, 107 Search library files, 56 Shift operator, 191 shortloop directive, 102 shortloop128 directive, 102 Simple instantiation, 144 Single-streaming Processor See ssp mode sizeof, 188 ssp mode, 31 ssp_private directive, 104 __sspid, 233 Stack size changing default, 71 Standard Template Library, 197 Standards, 187 arrays and pointers, 192 bit fields, 192 C violation, 16 character set, 189 example, 190 classes, 192 conformance to C99, 13 conformance to ISO, 14 data types, 188 mapping, 188 declarators, 193 enumerations, 192 environment, 187 exceptions, 194 extensions, 153 identifiers, 188 implementation-defined behavior, 187 integers, 191 messages, 187 pointers, 192 preprocessing, 194 qualifiers, 193 register storage class, 192 statements, 193 structures, 192 system function calls, 194 unions, 192 wide characters, 191 Statements, 193 STL See Standard Template Library Storage class, 154 __streaming, 233 Streaming intrinsics, 233 String literals, 16 Structures, 192 suppress directive, 108 sync directive, 122 Syntax checking, 50 System function calls, 194 T Target system, 58 Template instantiation, 143 directives, 149 implicit inclusion, 151 modes, 148 one per object file, 147, 149 prelinker, 143 simple, 144 Templates, 143 Throw expression, 15 Throw specification, 15 __TIME__, 194 TotalView debugger, 168 S–2179–60 249Cray® C and C++ Reference Manual Try block, 15 Types, 188 U -U macro, 55 Unified Parallel C See UPC Unions, 192 unrolling no unrolling, 109 [no] directive, 109 UPC, 135 Cray specific upc_all_free, 136 upc_all_lock_free, 137 upc_global_lock_free, 137 upc_local_free, 136 -h upc, 62, 138 V -V option, 63 Vectorization, 35 automatic, 36 dependency analysis, 36 directives, 90 level, 36 Vectorization options, 35 vfunction directive, 88 Virtual function table, 23 volatile qualifier, 109 W weak directive, 87 Weak externals, 87 -Wphase,"opt...", 51 X -X npes option, 63 X1_COMMON_STACK_SIZE, 69 X1_DYNAMIC_COMMON_SIZE environment variable, 68 X1_HEAP_SIZE, 69 X1_LOCAL_HEAP_SIZE, 69 X1_PRIVATE_STACK_GAP, 69 X1_PRIVATE_STACK_SIZE, 69 X1_STACK_SIZE, 69 X1_SYMMETRIC_HEAP_SIZE, 69 Y -Yphase,dirname, 51 250 S–2179–60 TM Cray Performance Analysis Tools 5.3 Release Overview and Installation Guide S–2474–53© 2011 Cray Inc. All Rights Reserved. This document or parts thereof may not be reproduced in any form unless permitted by contract or by written permission of Cray Inc. U.S. GOVERNMENT RESTRICTED RIGHTS NOTICE The Computer Software is delivered as "Commercial Computer Software" as defined in DFARS 48 CFR 252.227-7014. All Computer Software and Computer Software Documentation acquired by or for the U.S. Government is provided with Restricted Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7014, as applicable. Technical Data acquired by or for the U.S. Government, if any, is provided with Limited Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7013, as applicable. Cray, LibSci, and PathScale are federally registered trademarks and Active Manager, Cray Apprentice2, Cray Apprentice2 Desktop, Cray C++ Compiling System, Cray CX, Cray CX1, Cray CX1-iWS, Cray CX1-LC, Cray CX1000, Cray CX1000-C, Cray CX1000-G, Cray CX1000-S, Cray CX1000-SC, Cray CX1000-SM, Cray CX1000-HN, Cray Fortran Compiler, Cray Linux Environment, Cray SHMEM, Cray X1, Cray X1E, Cray X2, Cray XD1, Cray XE, Cray XEm, Cray XE5, Cray XE5m, Cray XE6, Cray XE6m, Cray XK6, Cray XMT, Cray XR1, Cray XT, Cray XTm, Cray XT3, Cray XT4, Cray XT5, Cray XT5 h , Cray XT5m, Cray XT6, Cray XT6m, CrayDoc, CrayPort, CRInform, ECOphlex, Gemini, Libsci, NodeKARE, RapidArray, SeaStar, SeaStar2, SeaStar2+, Sonexion, The Way to Better Science, Threadstorm, uRiKA, and UNICOS/lc are trademarks of Cray Inc. AMD and AMD Opteron are trademarks of Advanced Micro Devices, Inc. CUDA is a trademark of NVIDIA Corporation. FlexNet is a trademark of Flexera Software. GNU is a trademark of The Free Software Foundation. Intel is a trademark of Intel Corporation in the United States and/or other countries. Linux is a trademark of Linus Torvalds. Lustre is a trademark of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. PGI is a trademark of The Portland Group Compiler Technology, STMicroelectronics, Inc. SUSE is a trademark of Novell, Inc. UNIX, the “X device,” X Window System, and X/Open are trademarks of The Open Group in the United States and other countries. Windows is a trademark of Microsoft Corporation. All other trademarks are the property of their respective owners.Contents Page Part I: Release Overview Introduction [1] 7 1.1 Emphasis for the 5.3 Release . . . . . . . . . . . . . . . . . . . . . 7 Software Enhancements [2] 9 2.1 Improved Cray XK Support . . . . . . . . . . . . . . . . . . . . . 9 2.2 Automatic MPI Rank-Order Analysis . . . . . . . . . . . . . . . . . . 9 2.3 Run Time Library Changes . . . . . . . . . . . . . . . . . . . . . 10 2.4 Online Help . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.5 Bug Reports Addressed Since the Last Release . . . . . . . . . . . . . . . . 11 Compatibilities and Differences [3] 13 3.1 Users Must Recompile Applications . . . . . . . . . . . . . . . . . . . 13 3.2 Data File Compatibility . . . . . . . . . . . . . . . . . . . . . . 13 3.3 FlexNet License Server Update Required . . . . . . . . . . . . . . . . . 14 Documentation [4] 15 4.1 Accessing Product Documentation . . . . . . . . . . . . . . . . . . . 15 4.2 Cray-developed Books Provided with This Release . . . . . . . . . . . . . . . 16 4.3 Additional Documentation Resources . . . . . . . . . . . . . . . . . . 16 4.4 New or Changed Cray Man Pages . . . . . . . . . . . . . . . . . . . 16 Release Package [5] 17 5.1 Hardware and Software Requirements . . . . . . . . . . . . . . . . . . 17 5.2 Contents of the Release Package . . . . . . . . . . . . . . . . . . . . 18 5.3 Licensing . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Part II: Installation Guide Installation [6] 21 6.1 Installing the Performance Analysis Tools on Cray Systems . . . . . . . . . . . . 21 S–2474–53 3Cray Performance Analysis Tools 5.3 Release Overview and Installation Guide Page 6.2 Activating the FlexNet License Key . . . . . . . . . . . . . . . . . . . 23 6.3 Installing the Cray Performance Analysis Tools on Standalone Linux Systems . . . . . . . 24 6.3.1 System Requirements . . . . . . . . . . . . . . . . . . . . . 25 6.3.2 Installation Procedure . . . . . . . . . . . . . . . . . . . . . 25 6.4 Installing Cray Apprentice2 on Microsoft Windows Systems . . . . . . . . . . . . 26 6.5 Using CrayPat and Cray Apprentice2 . . . . . . . . . . . . . . . . . . 26 6.5.1 Loading Modules . . . . . . . . . . . . . . . . . . . . . . 27 6.5.2 Using CrayPat and Cray Apprentice2 . . . . . . . . . . . . . . . . . 27 Procedures Procedure 1. Installing the Performance Tools rpm files . . . . . . . . . . . . . 21 Procedure 2. Adding a new key to a license file . . . . . . . . . . . . . . . . 24 Procedure 3. Installing from a tar file . . . . . . . . . . . . . . . . . . 25 Procedure 4. Installing on Microsoft Windows . . . . . . . . . . . . . . . . 26 Tables Table 1. Books Provided with This Release . . . . . . . . . . . . . . . . . 16 Table 2. Additional Documentation Resources . . . . . . . . . . . . . . . . 16 Table 3. Commonly Used Module Arguments . . . . . . . . . . . . . . . . 27 Table 4. CrayPat Man Pages . . . . . . . . . . . . . . . . . . . . . 28 4 S–2474–53Part I: Release OverviewIntroduction [1] This document provides an overview of the 5.3.0 release of CrayPat and Cray Apprentice2 for systems running the Cray Linux Environment (CLE) operating system, including Cray XE and Cray XK systems. This document also provides instructions for installing these products on your system. 1.1 Emphasis for the 5.3 Release This release provides the following key enhancements: • Improved support for Cray XK systems with Graphics Processing Unit (GPU) accelerators • Automatic MPI rank-order analysis and rank-order placement file generation • Updated entry points for improved tracing of SHMEM, Chapel, and PGAS code • New Cray Apprentice2 client for Microsoft Windows S–2474–53 7Cray Performance Analysis Tools 5.3 Release Overview and Installation Guide 8 S–2474–53Software Enhancements [2] This chapter outlines the enhancements provided with this release. For a full list of the changes implemented in pat_build, the run time library (RTL), and pat_report, load the perftools module and use the module help perftools command. For compatibility issues and differences that you may encounter when installing or using this release, see Chapter 3, Compatibilities and Differences on page 13. 2.1 Improved Cray XK Support GPU statistics are now available in full trace mode, in addition to the default run time summary mode. Full trace is enabled by setting the PAT_RT_SUMMARY environment variable to 0. Note: Enabling full trace can produce enormous data files. Users should consider reading the section titled "Controlling Data File Size" in Using Cray Performance Analysis Tools. In addition, new reporting options have been added to improve the understanding of GPU statistics. For more information, see Using Cray Performance Analysis Tools or the pat_report(1) man page. 2.2 Automatic MPI Rank-Order Analysis By default, MPI program ranks are placed on compute node cores sequentially in SMP style, as described in the intro_mpi(3) man page. You can use the MPICH_RANK_ORDER_METHOD environment variable to override this default placement, and in some cases achieve significant improvements in performance. With this release, the Cray Performance Analysis Tools now provide two ways to help optimize MPI rank order. If you already understand your program's patterns of communications well enough to specify an optimized rank order without further assistance, you can use the grid_order utility to manually generate a rank order list that can be used as an input to the MPICH_RANK_REORDER_METHOD environment variable. S–2474–53 9Cray Performance Analysis Tools 5.3 Release Overview and Installation Guide Alternatively, you can use CrayPat to analyze MPI sent-message data, detect a grid topology, report rank-order optimization information, and automatically generate a recommended rank-order placement file for use with MPICH_RANK_REORDER_METHOD. For more information, see Using Cray Performance Analysis Tools. 2.3 Run Time Library Changes As a result of changes implemented to improve support for Cray XK systems, the following run time environment variables have been added to the run time library. PAT_RT_ACCPC PAT_RT_ACCPC_FILE PAT_RT_ACCPC_FILE_GROUP PAT_RT_ACC_FORCE_SYNC The following obsolete run time environment variables are removed in this release. PAT_RT_TRACE_ARCHIVE PAT_RT_TRACE_LOOPS PAT_RT_DOFORK PAT_RT_OMP_SYNC_TRIES For more information about run time environment variables, see the intro_craypat(1) man page or Using Cray Performance Analysis Tools. 2.4 Online Help CrayPat includes an extensive online help system, which features many examples and the answers to many frequently asked questions. To access the help system, enter this command: > pat_help The pat_help command accepts options. For example, to jump directly into the FAQ, enter this command: > pat_help FAQ Once the help system is launched, navigation is by one-key commands (e.g., / to return to the top-level menu) and text menus. It is not necessary to enter entire words to make a selection from a text menu; only the significant letters are required. For example, to select "Building Applications" from the FAQ menu, it is sufficient to enter Buil. Help system usage is documented further in the pat_help(1) man page. 10 S–2474–53Software Enhancements [2] Likewise, Cray Apprentice2 features an online help system as well as numerous pop-ups and tool-tips that are displayed by hovering the cursor over an area of interest on a chart or graph. To access the online help system, click the Help button, or right-click on any report tab and then select Panel Help from the pop-up menu. Feel free to experiment with the Cray Apprentice2 user interface and to leftor right-click on any area that looks like it might be interesting. Because Cray Apprentice2 does not write any data files, you cannot corrupt, truncate, or otherwise damage your original .ap2 data file using Cray Apprentice2. 2.5 Bug Reports Addressed Since the Last Release The following bug reports are addressed in this release. 776439 craypat trace instrumented VASP5.2 aborts if built with pat_build -g lapack 777545 CrayPat issues "Expected but did not find event tag" 771619 IO data is not passed from xf to ap2 774673 No write statistics show in pat_report from perftools 5.2.x 776614 perftools/5.2.3: pat_report -Ompi_rank_order no longer works 778603 The pat_report option, -s filter_input=..., has no effect. 779131 pat_report -d cum_sa,cum_sa% does not work S–2474–53 11Cray Performance Analysis Tools 5.3 Release Overview and Installation Guide 12 S–2474–53Compatibilities and Differences [3] This chapter describes compatibility issues and functionality changes to be aware of when using this software after upgrading from earlier releases of this software. 3.1 Users Must Recompile Applications Because of changes made in this release, users must recompile and re-instrument applications that were compiled, linked, and instrumented using earlier CrayPat modules. ! Caution: Not recompiling and re-instrumenting an application will result in an undefined behavior. 3.2 Data File Compatibility Data file compatibility is not maintained between versions. Programs instrumented using earlier versions of CrayPat must be recompiled, relinked, and reinstrumented using the current version of CrayPat. Likewise, .xf and .ap2 data files created using earlier versions of CrayPat cannot be read using the release 5.3.0 version of pat_report, nor can data files created using release 5.3.0 be read using earlier versions of pat_report or Cray Apprentice2. Note: .ap2 data files created using earlier versions of pat_report can be read using Cray Apprentice2 release 5.3.0, but cannot take advantage of the improved infrastructure and therefore may appear to load slowly. If you have upgraded to release 5.3.0 from an earlier version of CrayPat, the earlier version likely remains on your system in the /opt/cray/modulefiles/perftools directory. (This may vary depending on your site's software administration and default version policies.) To revert to the earlier version, you will need to explicitly unload the newer version and load the earlier version. For example, assuming the older version is 5.2.0: > module unload perftools > module load perftools/5.2.0 S–2474–53 13Cray Performance Analysis Tools 5.3 Release Overview and Installation Guide 3.3 FlexNet License Server Update Required Use of Cray performance analysis tools requires a FlexNet license. For information on activating the software license see Activating the FlexNet License Key Activating the FlexNet License Key on page 23. Because of changes made in this release, sites must obtain and install a new FlexNet license for this product. This product will not work with older versions of the FlexNet license. 14 S–2474–53Documentation [4] This chapter describes the documentation that supports the Performance Tools 5.3.0 release. 4.1 Accessing Product Documentation With each software release, Cray provides books and man pages, and in some cases, third-party documentation. These documents are provided in the following ways: CrayPort CrayPort is the external Cray website for registered users that offers documentation for each product. CrayPort has portal pages for each product that contains links to all of the documents that are associated to that product. CrayPort enables you to quickly access and search Cray books, man pages, and in some cases, third-party documentation. You access CrayPort by using the following URL: http://crayport.cray.com CrayDoc CrayDoc is the Cray documentation delivery system. CrayDoc enables you to quickly access and search Cray books, man pages, and in some cases, third-party documentation. Access the HTML and PDF documentation via CrayDoc at the following locations. • The local network location defined by your system administrator • The CrayDoc public website: http://docs.cray.com Man pages Man pages are textual help files available from the command line on Cray machines. To access man pages, enter the man command followed by the name of the man page. For more information about man pages, see the man(1) man page by entering: % man man Third-party documentation Third-party documentation that is not provided through CrayPort or CrayDoc is included with the third-party product. S–2474–53 15Cray Performance Analysis Tools 5.3 Release Overview and Installation Guide 4.2 Cray-developed Books Provided with This Release The books provided with this release are listed in Table 1, which also notes whether each book was updated. Books are provided in HTML and PDF formats. Table 1. Books Provided with This Release Book Title Number Updated Cray Performance Analysis Tools Release Overview and Installation Guide (this document) S–2474–53 Yes Using Cray Performance Analysis Tools S–2376–53 Yes CrayDoc Installation and Administration Guide S–2340–411 No 4.3 Additional Documentation Resources Table 2 lists additional resources for obtaining documentation not included with this release package. Table 2. Additional Documentation Resources Product Documentation Source GNU compilers Documentation for the GNU C and Fortran compilers is available at http://gcc.gnu.org/onlinedocs/ Lustre Additional Lustre documentation is available at http://wiki.lustre.org/index.php/Lustre_Documentation PAPI PAPI documentation is available at http://icl.cs.utk.edu/papi/ RPM RPM documentation is available at http://www.rpm.org 4.4 New or Changed Cray Man Pages The following Cray man pages are new or updated with this release: • intro_craypat(1) • pat_build(1) • pat_report(1) • pat_help(1) • grid_order(1) • accpc(5) • nwpc(5) • papi_counters(5) • app2(1) 16 S–2474–53Release Package [5] 5.1 Hardware and Software Requirements The Performance Analysis Tools 5.3.0 release requires the following environment in order to run. • Cray Linux Environment (CLE) 3.1 or later • Microsoft Windows 7 (optional: for Cray Apprentice2 only) • Cray xt-asyncpe 3.2 or later • System Management Workstation 2.0 or later • CRMS 2.0 or later • At least one of the following compilers: – Cray CCE 8.0 or later Note: Required for Cray XK systems – GCC 4.5.x Note: GCC 4.6.x does not currently support user function tracing in CrayPat – Intel 12.0 or later – PGI 10.9 or later S–2474–53 17Cray Performance Analysis Tools 5.3 Release Overview and Installation Guide 5.2 Contents of the Release Package The release package includes: • CrayPat 5.3.0 • Cray Apprentice2 5.3.0 • PAPI 4.2.0 • CUDA 4.0.17 (for Cray XK systems only) • CUDATOOLS 4.0.17 (for Cray XK systems only) • CrayDoc software suite and the documentation, described in Chapter 4, Documentation on page 15 • A printed copy of this release overview 5.3 Licensing The Cray Performance Measurement and Analysis Tools product uses the software license agreement for Cray software. Upgrades to this product are provided only when a software support agreement for this Cray software product is in place. For more information about licensing and pricing, contact your Cray sales representative, or send e-mail to crayinfo@cray.com. To request FlexNet license manager keys for Cray Compiling Environment and Cray Performance Measurement and Analysis Tools releases, contact license_keys@cray.com. 18 S–2474–53Part II: Installation GuideInstallation [6] The Cray Performance Analysis Tools package is distributed on CD-ROM. It is also available as downloadable rpm files. The instructions in this chapter assume that you are working with a CD-ROM. Note: Cray Inc. no longer provides the DWARF and ELF libraries, which previously were included in the toolsup rpm. You can download the latest version of DWARF from http://reality.sgiweb.org/davea/dwarf.html and the latest version of ELF from http://www.mr511.de/software/english.html. 6.1 Installing the Performance Analysis Tools on Cray Systems You must have root permissions in order to install this software on Cray systems. Procedure 1. Installing the Performance Tools rpm files 1. Log on to the SMW as root. % ssh root@smw 2. Load and mount the distribution media, if necessary. smw:~# mount /dev/cdrom /media/cdrom 3. Create a temporary directory on the boot node for the installation files, if one does not already exist. smw:~# ssh boot mkdir /tmp/install.perftools 4. Copy the installation files from the distribution media to the boot node. smw:~# scp -pr /media/cdrom/perftools-version.x86_64.rpm \ boot:/tmp/install.perftools smw:~# scp -pr /media/cdrom/perftools-clients-version.x86_64.rpm \ boot:/tmp/install.perftools 5. Unmount and remove the distribution media. smw:~# umount /media/cdrom 6. Log into the boot node as root. smw:~# ssh root@boot 7. Change to your temporary directory. boot001:~# cd /tmp/install.perftools S–2474–53 21Cray Performance Analysis Tools 5.3 Release Overview and Installation Guide 8. Create a target directory on the shared root and copy the installation files from your temporary directory to the shared root. boot001/:/tmp/install.perftools # mkdir -p /rr/current/software/install.perftools boot001/:/tmp/install.perftools # cp -p perftools-version.x86_64.rpm \ /rr/current/software/install.perftools boot001/:/tmp/install.perftools # cp -p perftools-clients-version.x86_64.rpm \ /rr/current/software/install.perftools 9. Open an xtopview session. boot001/:/tmp/install.perftools # xtopview 10. Change to the temporary directory you created on the shared root. default/:/# cd /software/install.perftools 11. (Optional) If you want the versions you are about to install to become the new default versions, set the environment variable. default/:/software/install.cpat # export CRAY_INSTALL_DEFAULT=1 If you do not set this environment variable, any previously installed default version remains the default version, and your users will need to load a specific module in order to select the newly installed version. 12. Use the rpm command to install the files. Note: When running rpm from within xtopview, the rpm utility issues a warning that it cannot find /rr/current. This warning may safely be ignored. To install the performance analysis tools for use on a Cray system, use these commands: default/:/software/install.perftools # rpm -ivh --oldpackage perftools-version.x86_64.rpm default/:/software/install.perftools # rpm -ivh --oldpackage \ perftools-clients-version.x86_64.rpm 13. (Optional) After RPM file installation is complete, if you set the CRAY_INSTALL_DEFAULT environment variable earlier, unset it now: default/:/software/install.perftools # unset CRAY_INSTALL_DEFAULT 14. Exit from the xtopview session: default/:/software/install.perftools # exit 22 S–2474–53Installation [6] 15. Log out of the boot node: boot001/:/tmp/install.perftools # exit logout Connection to boot closed. smw:~# 16. Log out of the SMW. smw:~# exit logout % 6.2 Activating the FlexNet License Key Note: Cray Performance Analysis Tools release 5.3.0 requires a new FlexNet software license key for all installations. Even if you are upgrading from an earlier version of the Cray Performance Analysis Tools, you must obtain and install a new license key. To activate your software license, insert the FlexNet software license key information provided by Cray into a FlexNet license file on your system. The FlexNet license file contains data that determines whether a licensed software product is allowed to run. The license file contains the following information: • The FlexNet software license key for your Cray Inc. product • Initial installation instructions • Update instructions • License manager utilities • Technical Support information Cray Inc. recommends that you name your license file /opt/cray/perftools/perftools.lic. These instructions assume that the FlexNet license manager is already running, that your license file is located in the directory /opt/cray/perftools, and that the file is named perftools.lic. The FlexNet license manager should already be installed on your system. If it is not, follow the installation instructions in Appendix A, Installing FlexNet, in Cray Compiling Environment Release Overview and Installation Guide (S–5212). S–2474–53 23Cray Performance Analysis Tools 5.3 Release Overview and Installation Guide Procedure 2. Adding a new key to a license file 1. Login to your license server as admin or superuser. 2. Locate your existing license file, if any. # ls /opt/cray/perftools If the directory does not exist, create it. # mkdir -p /opt/cray/perftools 3. In /opt/cray/perftools, create the plain text file perftools.lic. Copy your FlexNet license key you received from Cray (typically in an email message) to perftools.lic. 4. Set the file access permissions to 644. # chmod 644 /opt/cray/perftools/perftools.lic 5. Update your FlexNet license server to use the new key. Verify that the license server is running. # lmstat If the server is not running, follow the installation instructions in Appendix A, Installing FlexNet, in Cray Compiling Environment Release Overview and Installation Guide (S–5212). Assuming the server is running, re-read the license file. # lmreread Your license is now ready to use. 6.3 Installing the Cray Performance Analysis Tools on Standalone Linux Systems The Cray Performance Analysis tools can also be installed on many common Linux desktop systems. Follow the instructions in this section to install the desktop version. Note: In this release, only Cray Apprentice2 and some selected pat_report functions are supported on standalone Linux systems. You are still required to instrument your programs using pat_build, execute your programs using aprun, and perform the initial conversion of .xf data to .ap2 report files using pat_report, on a Cray system. 24 S–2474–53Installation [6] 6.3.1 System Requirements Note: Because of the high degree of variability in common Linux desktop installations, you may be required to install or update other libraries and utilities in order to address dependencies. Before installing the desktop version, verify that the following requirements are met. • Yo u h ave root permissions on the Linux system. • The Linux system must use at least one 64-bit, x86-based processor (AMD Opteron, Intel Pentium 4, or equivalent). • The Linux system must have at least 1 GB of RAM. More is preferable. • The Linux system must have at least 70 GB of total disk space and at least 3 GB of free disk space. More is preferable. • The Linux system must run SUSE Linux Enterprise Desktop (SLED) 11 or later. Note: The desktop version of the Cray Performance Analysis Tools may be usable on other Linux systems. However, behavior on other versions of Linux is untested at this time. • Modules 3.1.6 or later must be installed. If you do not already have Modules installed on your desktop system, you can install it from the file cray-modules-3.1.6-14.x86_64.rpm, which is distributed with the Cray XT operating system. • The /opt file system must exist and be mounted in the root of the file system. • The /tmp file system must have sufficient space to hold the temporary files created during installation. • Root must have write permissions into /opt. 6.3.2 Installation Procedure Procedure 3. Installing from a tar file 1. Log on to the Linux system as root. 2. Copy the perftools-clients-.tar file to a working directory. 3. (Optional) If you want the application you are about to install to become the default version, set the environment variable: $ export CRAY_INSTALL_DEFAULT=1 4. Use the tar command to extract the distribution file: $ tar -xvf perftools-clients-version.tar S–2474–53 25Cray Performance Analysis Tools 5.3 Release Overview and Installation Guide 5. Change to the installation directory: $ cd perftools-clients-version 6. Execute the installation script: $ ./Install The end-user license agreement displays. You must type Yes to accept the license agreement in order to continue installation. 7. (Optional) After the installation script completes, if you set the CRAY_INSTALL_DEFAULT environment variable earlier, unset it now. 8. Resume normal operations. 6.4 Installing Cray Apprentice2 on Microsoft Windows Systems This release includes a version of Cray Apprentice2 that can be installed and used on Microsoft Windows systems. Note: The Windows version works on Windows 7 only. It is not supported on earlier versions of the Microsoft Windows operating system. To install this version of Cray Apprentice2 on Windows, follow these steps. Procedure 4. Installing on Microsoft Windows 1. Locate the installer file on your distribution media. The Windows installer file is named Apprentice2Installer_version.exe. 2. Copy the installer to the Windows system on which you want to use it. 3. Double-click on the installer file to begin the installation. 4. Follow the on-screen prompts to complete the installation process. After Cray Apprentice2 is installed on your Windows system, you can launch it either by double-clicking on the Cray Apprentice2 desktop icon, or by double-clicking on an .ap2 file. 6.5 Using CrayPat and Cray Apprentice2 Assuming your site has the correct licenses, use the module command to load the tools. Man pages are included in the associated Module files and become available only after the Module file is loaded. 26 S–2474–53Installation [6] 6.5.1 Loading Modules The module command can accept a number of arguments. The arguments most commonly used are listed in Table 3. Table 3. Commonly Used Module Arguments Argument Description list View the list of modules that are currently loaded avail View the list of modules currently available to be loaded load Load a module file swap Swap a currently loaded module for another module unload Unload a currently loaded module file without swapping it for another module use Use a different set of module files help Release notes and module command usage information To use the Cray Performance Analysis Tools load the perftools module: > module load perftools 6.5.2 Using CrayPat and Cray Apprentice2 CrayPat and Cray Apprentice2 are described in Using Cray Performance Analysis Tools, as well as in man pages and online help systems. The three essential command line commands are: pat_build Instrument your program for data collection. pat_report After your program has completed execution, post-process the resulting data files for text reports and further analysis. app2 [Cray_system_name] Launch the Cray Apprentice2 client to conduct in-depth graphical analysis of the processed data files. Note: If you are running the Cray Apprentice2 client on a standalone Linux system, also specify the Cray_system_name where the .ap2 files you wish to open reside. These commands are their options are discussed in the following man pages. S–2474–53 27Cray Performance Analysis Tools 5.3 Release Overview and Installation Guide Table 4. CrayPat Man Pages Man Page Description intro_craypat(1) A quick introduction to CrayPat usage and detailed information about runtime environment variables that affect the kind, quality, and quantity of information captured during program execution pat_build(1) Detailed information about preparing your programs for performance analysis experiments pat_report(1) Detailed information about the reports that can be generated from performance analysis data after it has been captured app2(1) A quick introduction to the Cray Apprentice2 graphical data analysis tool CrayPat also includes pat_help(1), an extensive online help system and tutorial that includes many practical examples of CrayPat usage as well as the answers to many frequently asked questions. The most common cause of confusion when getting started with CrayPat is losing track of where you are in the Cray system. For example, some commands can be run on either service or compute nodes, while others return valid results only when run on compute nodes. To further complicate matters, some commands can be run only on specific types of nodes, and then only if launched from a mount-point on a Lustre file system and launched using the correct utility. If a CrayPat or PAPI command does not seem to return the expected results, always verify that you have the correct modules loaded for the system you are using and that you are running the command from the correct location in the file system. 28 S–2474–53 TM Cray XMT™ System Overview S–2466–20© 2007–2009, 2011 Cray Inc. All Rights Reserved. This document or parts thereof may not be reproduced in any form unless permitted by contract or by written permission of Cray Inc. Copyright (c) 2008, 2011 Cray Inc. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name Cray Inc. nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. Your use of this Cray XMT release constitutes your acceptance of the License terms and conditions. U.S. GOVERNMENT RESTRICTED RIGHTS NOTICE The Computer Software is delivered as "Commercial Computer Software" as defined in DFARS 48 CFR 252.227-7014. All Computer Software and Computer Software Documentation acquired by or for the U.S. Government is provided with Restricted Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7014, as applicable. Technical Data acquired by or for the U.S. Government, if any, is provided with Limited Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7013, as applicable. Cray, LibSci, and PathScale are federally registered trademarks and Active Manager, Cray Apprentice2, Cray Apprentice2 Desktop, Cray C++ Compiling System, Cray CX, Cray CX1, Cray CX1-iWS, Cray CX1-LC, Cray CX1000, Cray CX1000-C, Cray CX1000-G, Cray CX1000-S, Cray CX1000-SC, Cray CX1000-SM, Cray CX1000-HN, Cray Fortran Compiler, Cray Linux Environment, Cray SHMEM, Cray X1, Cray X1E, Cray X2, Cray XD1, Cray XE, Cray XEm, Cray XE5, Cray XE5m, Cray XE6, Cray XE6m, Cray XMT, Cray XR1, Cray XT, Cray XTm, Cray XT3, Cray XT4, Cray XT5, Cray XT5 h , Cray XT5m, Cray XT6, Cray XT6m, CrayDoc, CrayPort, CRInform, ECOphlex, Gemini, Libsci, NodeKARE, RapidArray, SeaStar, SeaStar2, SeaStar2+, The Way to Better Science, Threadstorm, and UNICOS/lc are trademarks of Cray Inc. AMD is a trademark of Advanced Micro Devices, Inc. Linux is a trademark of Linus Torvalds. NFS is a trademark of Sun Microsystems, Inc. in the United States and other countries. UNIX, the “X device,” X Window System, and X/Open are trademarks of The Open Group in the United States and other countries. All other trademarks are the property of their respective owners. RECORD OF REVISION S–2466–20 Published May 2011 Supports software release 2.0 GA running on Cray XMT Series compute nodes and Cray XT service nodes. This release uses CLE version 3.1UP02 System Management Workstation (SMW) version 5.1UP03.1.4 Published December 2009 Supports release 1.4 running on Cray XMT compute nodes and CLE 2.2.UP01 on Cray XT service nodes. This release uses the System Management Workstation (SMW) version 4.0.UP02. 1.3 Published March 2009 Supports release 1.3 running on Cray XMT compute nodes and on Cray XT 2.1.5HD service nodes. This release uses the System Management Workstation (SMW) version 3.1.09. 1.2 Published August 2008 Supports release 1.2 running on Cray XMT compute nodes and on Cray XT 2.0.49 service nodes. This release uses the System Management Workstation (SMW) version 3.1.04. 1.1 Published March 2008 Supports limited availability (LA) release 1.1.01 running on Cray XMT compute nodes and on Cray XT 2.0 service nodes. 1.0 Published July 2007 Supports the 1.0 limited availability (LA) release of the Cray XMT.Contents Page Introduction [1] 9 1.1 Cray XMT Features . . . . . . . . . . . . . . . . . . . . . . . 9 1.2 Related Publications . . . . . . . . . . . . . . . . . . . . . . . 10 1.2.1 Publications for Application Developers . . . . . . . . . . . . . . . . 10 1.2.2 Publications for System Administrators . . . . . . . . . . . . . . . . 11 Hardware Overview [2] 13 2.1 Basic Hardware Components . . . . . . . . . . . . . . . . . . . . . 13 2.1.1 Threadstorm 4.0 Processor . . . . . . . . . . . . . . . . . . . . 13 2.1.2 DIMM Memory . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.3 Cray SeaStar2 Chip . . . . . . . . . . . . . . . . . . . . . . 14 2.1.4 System Interconnection Network . . . . . . . . . . . . . . . . . . 16 2.1.5 RAID Disk Storage Subsystems . . . . . . . . . . . . . . . . . . 16 2.2 Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.1 Compute Nodes . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.2 Service Nodes . . . . . . . . . . . . . . . . . . . . . . . 17 2.3 Blades, Chassis, and Cabinets . . . . . . . . . . . . . . . . . . . . 18 2.3.1 Blades . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.2 Chassis and Cabinets . . . . . . . . . . . . . . . . . . . . . 19 Software Overview [3] 21 3.1 Cray SeaStar High-speed Network Communication Interfaces . . . . . . . . . . . . 22 3.2 Cray Linux Environment (CLE) Operating System . . . . . . . . . . . . . . . 24 3.2.1 SUSE LINUX Operating System . . . . . . . . . . . . . . . . . . 24 3.2.2 MTK Operating System . . . . . . . . . . . . . . . . . . . . . 24 3.3 File Systems . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3.1 Lustre File System . . . . . . . . . . . . . . . . . . . . . . 25 3.3.2 Random Access Memory File System . . . . . . . . . . . . . . . . . 26 3.3.3 Network File System . . . . . . . . . . . . . . . . . . . . . 27 3.4 User Environment . . . . . . . . . . . . . . . . . . . . . . . . 27 S–2466–20 5Cray XMT™ System Overview Page 3.5 System Administration . . . . . . . . . . . . . . . . . . . . . . 27 3.5.1 System Management Workstation . . . . . . . . . . . . . . . . . . 28 3.5.2 Shared-root File System . . . . . . . . . . . . . . . . . . . . . 28 3.5.3 Configuration and Source Files . . . . . . . . . . . . . . . . . . . 28 3.5.4 System Monitoring . . . . . . . . . . . . . . . . . . . . . . 29 3.5.5 System Log . . . . . . . . . . . . . . . . . . . . . . . . 29 Application Development [4] 31 4.1 User Runtime Library . . . . . . . . . . . . . . . . . . . . . . . 31 4.2 Lightweight User Communication Library (LUC) API . . . . . . . . . . . . . . 31 4.3 Compiling Programs . . . . . . . . . . . . . . . . . . . . . . . 32 4.3.1 Compiler Commands . . . . . . . . . . . . . . . . . . . . . 32 4.4 Running Applications . . . . . . . . . . . . . . . . . . . . . . . 32 4.5 Debugging Applications . . . . . . . . . . . . . . . . . . . . . . 32 4.6 Monitoring Applications . . . . . . . . . . . . . . . . . . . . . . 33 4.7 Measuring Performance . . . . . . . . . . . . . . . . . . . . . . 33 4.7.1 Cray Apprentice2 . . . . . . . . . . . . . . . . . . . . . . 33 4.7.2 Canal . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.7.3 Tview . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.7.4 Bprof . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.7.5 pproc . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.7.6 ap2view . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.7.7 Tprof . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Cray Hardware Supervisory System (HSS) [5] 37 5.1 HSS Hardware . . . . . . . . . . . . . . . . . . . . . . . . 37 5.1.1 HSS Network . . . . . . . . . . . . . . . . . . . . . . . 38 5.1.2 System Management Workstation . . . . . . . . . . . . . . . . . . 38 5.1.3 Hardware Controllers . . . . . . . . . . . . . . . . . . . . . 39 5.2 HSS Software . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.2.1 Software Monitors . . . . . . . . . . . . . . . . . . . . . . 39 5.2.2 HSS Administrator Interfaces . . . . . . . . . . . . . . . . . . . 39 5.3 HSS Actions . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.3.1 System Startup and Shutdown . . . . . . . . . . . . . . . . . . . 40 5.3.2 Event Probing . . . . . . . . . . . . . . . . . . . . . . . 40 5.3.3 Event Logging . . . . . . . . . . . . . . . . . . . . . . . 41 5.3.4 Event Handling . . . . . . . . . . . . . . . . . . . . . . . 41 6 S–2466–20Contents Page Glossary 43 Figures Figure 1. Threadstorm Processor Architecture . . . . . . . . . . . . . . . . 14 Figure 2. Cray SeaStar2 Chip . . . . . . . . . . . . . . . . . . . . . 15 Figure 3. Cray XMT Hardware System Architecture . . . . . . . . . . . . . . . 17 Figure 4. Chassis and Cabinet (front view) . . . . . . . . . . . . . . . . . 19 Figure 5. Software Stack for Service and Compute Nodes . . . . . . . . . . . . . 21 Figure 6. LUC Software Stack . . . . . . . . . . . . . . . . . . . . . 23 Figure 7. Lustre Architecture on Cray XMT . . . . . . . . . . . . . . . . . 26 Figure 8. HSS Components . . . . . . . . . . . . . . . . . . . . . 38 S–2466–20 7Introduction [1] This document provides an overview of the second generation of Cray XMT Series systems. The software portion of this document applies also to first generation Cray XMT systems running version 2.0 of the Cray XMT system software. For an overview of the first generation Cray XMT hardware, please refer to an earlier version of this document. The intended audience is application developers and system administrators. Familiarity with the concepts of high-performance computing and the architecture of parallel processing systems is assumed. 1.1 Cray XMT Features The Cray XMT Series of supercomputers are scalable, massively multithreaded platforms with a globally shared memory architecture. The second generation Cray XMT system is based on the Cray XT5 infrastructure and uses the Cray massively parallel processing (MPP) system design. The difference is that the Cray XMT compute blades use Threadstorm processors, which are designed to perform multithreaded operations. The second generation Cray XMT platform has the following features: • Performs large-scale data analysis. • Uses Cray Threadstorm 4.0 processors. Each processor is directly connected to a dedicated Cray SeaStar2 interconnect chip, resulting in a high-bandwidth, low-latency network characteristic. • Scales from 16 to 512 processors providing over half a million threads, using 16 terabytes of system memory. – Uses nodes as the most basic scalable unit. There are two types of nodes. Service nodes provide support functions, such as managing the user's environment, handling I/O, and booting the system. Compute nodes run user applications. – Uses a global memory model. Applications have access to memory on any compute processor on the machine. – Uses the system interconnection network to connect compute and service nodes to maintain high communication rates as the number of nodes increases. Contains Seastar2 chips connected in a full 3-D torus network. S–2466–20 9Cray XMT™ System Overview • Contains separately dedicated compute, service, and I/O nodes. – Service nodes have AMD Opteron processors and can be configured for I/O, login, network, or system functions. – Compute nodes have Threadstorm processors. • Runs the Cray Linux Environment (CLE) operating system which distributes a multithreaded kernel (MTK) to the compute blades and standard Linux on the service and I/O blades. This enables the compute nodes to focus on the application without being hampered by system administrative functions. • Includes a development environment that provides compilers, libraries, parallel programming models, debuggers, and performance measurement tools. 1.2 Related Publications The Cray XMT system runs with a combination of proprietary, third-party, and open-source products, as documented in the following publications. 1.2.1 Publications for Application Developers For information about the Cray XMT Programming Environment see the following Cray guides. • Cray XMT Programming Model • Cray XMT Programming Environment User's Guide • Cray XMT Performance Tools User's Guide • Cray XMT Debugger Reference Guide • Cray XMT man pages 10 S–2466–20Introduction [1] 1.2.2 Publications for System Administrators The following publications are available for system administrators. • Installing and Configuring Cray XMT System Software • Installing and Configuring the Cray XMT System Management Workstation • Cray XMT System Management • Managing System Software for Cray XE and Cray XT Systems • Cray XT System Overview • Cray XT System Software Release Overview • Cray XMT man pages • System Management Workstation (SMW) man pages for Cray XT and Cray XMT S–2466–20 11Cray XMT™ System Overview 12 S–2466–20Hardware Overview [2] 2.1 Basic Hardware Components The second generation Cray XMT platform includes the following hardware components: • Threadstorm 4.0 processors on compute nodes and Opteron processors on service nodes • Dual inline memory modules (DIMMs) • Cray SeaStar2 chips • System interconnection network • RAID disk storage subsystems 2.1.1 Threadstorm 4.0 Processor The second generation Cray XMT platform uses Threadstorm 4.0 processors on compute nodes. Threadstorm processors feature: • Multithreaded processors that support parallel operations. • Ability to perform remote memory access. • 128 streams on each processor with 31 general purpose 64-bit registers, 8 target registers, and a status word that includes the program counter. A stream is the hardware used to execute a single thread. • 16 protection domains on each processor which provide address spaces. Each running stream belongs to one protection domain. • Three functional units to support operations: the M unit which issues a memory operation, the A unit which executes an arithmetic operation, and the C unit which executes a control or simple arithmetic operation. The Threadstorm ISA is a large instruction word (LIW) where each instruction can specify up to three operations, one for each functional unit. S–2466–20 13Cray XMT™ System Overview The Threadstorm architecture includes the following elements: • Instruction execution logic. • Double data rate (DDR2) memory controller and data cache. • HyperTransport (HT) logic and physical interface. • A switch that connects these three components. Figure 1. Threadstorm Processor Architecture AMO Logic A-Pipe Instruction Issue M-Pipe Register File C-Pipe Data Buffer Switch HT Interface DDR Memory Interface 200 MHz 800 MHz 500 MHz HT PHY Instruction Cache Hyper Transport (HT) 300 MHz Memory Controller DDR PHY DDR Controller 2.1.2 DIMM Memory The Cray XMT supports double data rate dual inline memory modules (DIMMs). The second generation Cray XMT include 8 4-GB or 8-GB DDR2 DIMMs for a maximum physical memory of 64 GB per node. The minimum amount of memory for service nodes is 2 GB. The Cray XMT use Error-Correcting Code (ECC) memory protection technology. 2.1.3 Cray SeaStar2 Chip The Cray XMT systems use Cray SeaStar2 chips. The Cray SeaStar2 application-specific integrated circuit (ASIC) chip is the system's message processor. 14 S–2466–20Hardware Overview [2] SeaStar2 offloads communications functions from the Threadstorm processor. A SeaStar2 chip contains: • A HyperTransport Link, which connects SeaStar2 to the Threadstorm processor. • A 3-D router that connects the chip to the system interconnection network using six high-speed serial links. • A Remote Memory Access (RMA) block that converts Threadstorm remote memory references to network transactions and back. • Two Direct Memory Access (DMA) engines, one for sending and the other for receiving, that manage the movement of data to and from node memory. The DMA engines are controlled by an embedded PowerPC processor (described in the next bullet). • An embedded PowerPC processor to support the network interconnect. The processor programs the DMA engines and assists with other network-level processing needs, such as supporting the Portals message-passing layer of the Cray XT. • A Portals message passing interface, which provides a data path from an application to memory. Portions of the interface are implemented in Cray SeaStar2 firmware, which transfers data directly to and from user memory without operating system intervention. • A link to a blade control processor (also known as an L0 controller). Blade control processors are used for booting, monitoring, and maintenance. For more information, see Hardware Controllers on page 39. Figure 2. Cray SeaStar2 Chip R o u t e r RMA HyperTransport Link RAM Processor Cray SeaStar Chip DMA Engine Link to L0 Controller S–2466–20 15Cray XMT™ System Overview 2.1.4 System Interconnection Network The system interconnection network is the communications center of the second generation Cray XMT system. The network consists of Cray SeaStar2 router links and the cables that connect the compute and service nodes. RMA requests and I/O data are transferred over the network. The network uses a Cray proprietary protocol to provide fast access to globally shared memory, and Fast I/O (FIO) data transfers. 2.1.5 RAID Disk Storage Subsystems Cray XMT systems use two types of RAID subsystems for data storage. The System RAID stores the boot image and system files. The Data RAID is configured as a Lustre file system that is accessible from the service nodes. The Lustre file system is not directly accessible from the compute nodes. Data on system RAID is globally accessible to the service partition. It is not accessible to the compute partition. 2.2 Nodes Cray XMT systems use processing components combine to form a node. There are two types of nodes: compute nodes and service nodes. Each node is a logical grouping of a processor, memory, and a data routing resource. The following diagram shows a conceptual view of the 3-D torus network topology (torus links are not shown) for compute and service nodes. The service nodes connect to the compute nodes through the system interconnection network. 16 S–2466–20Hardware Overview [2] Figure 3. Cray XMT Hardware System Architecture MTK Linux Compute Service & IO RAID Controllers Network 10 GigE Fiber Channel PCI-XPCI-XService Partition • Linux OS • Specialized Linux nodes Login PEs IO Server PEs Network Server PEs FS Metadata Server PEs Database Server PEs Compute Partition MTK (BSD) 2.2.1 Compute Nodes Compute nodes run application programs. Each second generation Cray XMT compute node consists of a Threadstorm 4.0 processor, DIMM memory, and a Cray SeaStar2 chip. All compute nodes in a logical system use the same processor type. 2.2.2 Service Nodes Service nodes handle support functions such as user login, I/O, and network management. Each service node contains an Opteron processor, DIMM memory, and a SeaStar2 chip. In addition, each service node may be configured with one or two PCI-X or PCIe Ethernet Network Interface Cards (NIC). S–2466–20 17Cray XMT™ System Overview Cray XMT systems include several types of service nodes, defined by the function they perform. • Login nodes. A login node may have one or two PCI-X or PCIe cards that connect to your network. PCI-X and PCIe cards are supported for both Gigabit Ethernet and 10-Gigabit Ethernet, as well as Fibre Channel. Login nodes that do not have a NIC are accessed by first logging into a login node that has a connection to the network. All login nodes are Lustre clients, and mount the Lustre file system from the disk nodes. You can also use login nodes to run snapshot file system workers. These workers assist in moving data from the Threadstorm memory to the Lustre file system. • Network service nodes: Each network service node has one or two network interface cards that are connected to the network. • I/O nodes: Each I/O node uses one or two Fibre Channel cards to connect to RAID storage. • Boot nodes: Each system requires one boot node. A boot node contains one Fibre Channel card and one Gigabit Ethernet Card. The Fibre Channel card connects to the RAID subsystem, and the PCI-X or PCIe card connects to the SMW (see Chapter 5, Cray Hardware Supervisory System (HSS) on page 37 for further information). • SDB nodes: Each SDB node contains a Fibre Channel card to connect to the SDB file system. The SDB node manages the state of the Cray XT system. 2.3 Blades, Chassis, and Cabinets This section describes the main physical components of the Cray XMT system and their configurations. 2.3.1 Blades A compute blade consists of four compute nodes, voltage regulator modules, and an L0 controller. Each compute blade within a logical machine is populated with Threadstorm processors of the same type and speed and memory chips of the same speed. The L0 controller is a Hardware Supervisory System (HSS) component; for more information about HSS hardware, see Chapter 5, Cray Hardware Supervisory System (HSS) on page 37. A service blade consists of two service nodes, voltage regulator modules, up to two PCI-X or PCIe cards per node, and an L0 controller. A service blade has four SeaStar2 chips to allow for a common board design and to simplify the interconnect configurations. Several different PCI-X or PCIe cards are available to provide Fibre Channel, GigE, and 10 GigE interfaces to external devices. 18 S–2466–20Hardware Overview [2] 2.3.2 Chassis and Cabinets Each cabinet contains three vertically stacked chassis, and each chassis contains eight vertically mounted blades. A cabinet can contain compute blades, service blades, or a combination of compute and service blades. A single variable-speed blower in the base of the cabinet cools the components. Customer-provided three-phase power is supplied to the cabinet Power Distribution Unit (PDU). The PDU routes power to the cabinet's power supplies, which distribute 48 VDC to each of the chassis in the cabinet. All cabinets have redundant power supplies. The PDU, power supplies, and the cabinet control processor (L1 controller) are located at the rear of the cabinet. Figure 4. Chassis and Cabinet (front view) Fan Chassis 2 Compute or service blade Chassis 1 Chassis 0 COMPUTE MODULE COMPUTE MODULE COMPUTE MODULE COMPUTE MODULE COMPUTE MODULE COMPUTE MODULE HOT SWAP L0 HOT SWAP L0 L0 HOT SWAP L0 HOT SWAP L0 HOT SWAP L0 HOT SWAP L0 CONSOLE 9600:8N1 CONSOLE 9600:8N1 CONSOLE 9600:8N1 CONSOLE 9600:8N1 CONSOLE 9600:8N1 CONSOLE 9600:8N1 (...) L0 (...) L0 (...) L0 (...) L0 (...) L0 (...) Slot 0 Slot 1 Slot 2 Slot 3 Slot 4 Slot 5 Slot 6 COMPUTE MODULE HOT SWAP L0 CONSOLE 9600:8N1 L0 (...) HOT SWAP L0 CONSOLE 9600:8N1 CONSOLE 9600:8N1 CONSOLE 9600:8N1 CONSOLE 9600:8N1 CONSOLE 9600:8N1 CONSOLE 9600:8N1 CONSOLE 9600:8N1 CONSOLE 9600:8N1 CONSOLE 9600:8N1 L0 (...) Slot 7 COMPUTE MODULE COMPUTE MODULE COMPUTE MODULE COMPUTE MODULE COMPUTE MODULE COMPUTE MODULE COMPUTE MODULE COMPUTE MODULE HOT SWAP L0 L0 HOT SWAP L0 HOT SWAP L0 HOT SWAP L0 HOT SWAP L0 HOT SWAP L0 HOT SWAP L0 (...) L0 (...) L0 (...) L0 (...) L0 (...) L0 (...) L0 (...) Slot 0 Slot 1 Slot 2 Slot 3 Slot 4 Slot 5 Slot 6 COMPUTE MODULE HOT SWAP L0 L0 (...) Slot 7 COMPUTE MODULE HOT SWAP L0 L0 CONSOLE 9600:8N1 CONSOLE 9600:8N1 L0 L0 (...) Slot 7 COMPUTE MODULE COMPUTE MODULE COMPUTE MODULE HOT SWAP L0 HOT SWAP L0 HOT SWAP L0 CONSOLE 9600:8N1 CONSOLE 9600:8N1 CONSOLE 9600:8N1 L0 (...) L0 (...) L0 (...) Slot 2 Slot 3 Slot 4 Slot 5 COMPUTE MODULE HOT SWAP L0 CONSOLE 9600:8N1 L0 (...) COMPUTE MODULE HOT SWAP (...) Slot 6 COMPUTE MODULE COMPUTE MODULE HOT SWAP L0 L0 HOT SWAP L0 CONSOLE 9600:8N1 CONSOLE 9600:8N1 (...) L0 (...) Slot 0 Slot 1 S–2466–20 19Cray XMT™ System Overview 20 S–2466–20Software Overview [3] Cray XMT Series systems run a combination of Cray developed software, third-party software, and open-source software. The software is optimized for applications that have fine-grain synchronization requirements, large processor counts, and significant communication requirements. This chapter provides an overview of the Cray XMT operating system, the random access memory file system (RAMFS) file system, the application development environment, and system administration tools. The software stack for Cray XMT is shown in the following figure. The stack on the left applies to the service nodes, and the stack on the right applies to compute nodes. Figure 5. Software Stack for Service and Compute Nodes gcc/g++ compiled application libluc pthreads Linux SeaStar driver Opteron SeaStar XMT compiled application libluc librt MTK SeaStar driver Threadstorm SeaStar SeaStar High-speed Network S–2466–20 21Cray XMT™ System Overview 3.1 Cray SeaStar High-speed Network Communication Interfaces The Cray SeaStar high-speed network supports three communication interfaces: • The Lightweight User Communication Library (LUC) (user level) for communicating between service and compute nodes • The Portals API (kernel level) supports communication between the service nodes and is used by Lustre clients and servers for file system data transfers. The compute nodes implement a subset of the Portals API to support Fast I/O (FIO) and LUC communication between the compute nodes and service nodes. • The FIO API (kernel level) supports LUC on compute nodes. FIO is layered over the MTK Portals SeaStar driver. Figure 6 describes the LUC software protocol stack. The column on the left represents the software stack on the Linux service nodes. The column on the right represents the software stack on the MTK compute nodes. GDBS is the sample application in this figure. The application is linked with the LUC library. The LUC library presents a uniform application interface for both the service nodes and compute nodes, and abstracts the details of the actual implementation. 22 S–2466–20Software Overview [3] Figure 6. LUC Software Stack Us er The GDBS application - The primary user of Fast I/O services. New functionality. Existing functionality on the Cray XMT service nodes. Portals messages over HSN GDBS Application (Linux) LUC Library (Using Portals) Portals API Fast I/O API LUC Library (Using Fast I/O API) GDBS Application (MTK) Portals Driver Multiplexing Driver XT3 Portals SeaStar Firmware MTK Accelerated SeaStar Kernel Device Firmware On service nodes, the LUC library is implemented using the Linux Portals library and system call interface. The standard Cray XT Portals driver is used with no modifications. The Portals driver interacts with the Portals firmware that executes on the Cray SeaStar chip associated with the service node. The firmware is responsible for sending and receiving Portals messages over the Cray SeaStar chip network. As with the Portals driver, there are no modifications to the Portals firmware. S–2466–20 23Cray XMT™ System Overview On compute nodes, the LUC library is implemented using the FIO system call interface. The FIO API is optimized for large data transfers over the Cray SeaStar chip and provides a customized interface for the LUC library. The system call layer is the layer above the FIO multiplexor driver. The multiplexor driver is responsible for multiplexing the operations from multiple FIO streams through a single Cray SeaStar chip. The multiplexor interacts with the firmware executing on the Cray SeaStar chip. The firmware is customized for FIO, and supports a subset of the Portals interface. Limiting the Portals interface reduces the complexity of the MTK driver and greatly simplifies the driver implementation, while providing the necessary operations for supporting fast and efficient transfers of large data packets. 3.2 Cray Linux Environment (CLE) Operating System The base operating system for the Cray XMT Series is CLE. CLE is a distributed system of service node and compute node components. The service nodes run the SUSE LINUX operating system and the compute nodes run the MTK operating system. 3.2.1 SUSE LINUX Operating System Service nodes perform the functions needed to support users, administrators, and applications running on compute nodes. Each service node runs a full-featured version of SUSE LINUX. The operating systems on each service node run independently from the others. Above the operating system level are specialized daemons and applications that perform functions unique to each service node. There are five basic types of service nodes: login, network, I/O, boot, and service database (SDB) service nodes; see Service Nodes on page 17 for further information. 3.2.2 MTK Operating System The MTK is a single instance of an operating system that runs as a monolithic operating system across all compute nodes on the Cray XMT Series system. This differs from other Cray systems where the operating system runs independently on each compute node. The system calls for this operating system are based on the Berkeley Software Distribution of UNIX (BSD) and include Cray extensions. 24 S–2466–20Software Overview [3] The MTK includes the following features: • Support for standard UNIX system operations such as fork/exec, signals, and so on. • Scheduling for protection domains rather than for threads. The User Runtime Library provides thread management. • Memory management that provides a global virtually contiguous data address space. • Ability to share memory between processors. • Native support for the RAMFS and network file system (NFS) client. 3.3 File Systems Cray XMT Series systems use three types of file systems: • The Lustre file system is used for data storage and is only accessible from the service nodes. • The RAMFS is used for the root file system for the compute node operating system and has no corresponding physical storage. • The NFS is used to transfer user applications from the service nodes to the compute node and it is accessible from both types of nodes. 3.3.1 Lustre File System The Lustre file system is hosted by the I/O service nodes. Lustre is a high-performance, highly scalable, POSIX-compliant parallel shared file system. Lustre is based on Linux and uses the Portals lightweight message passing API and an object-oriented architecture for storing and retrieving data. Cray XMT compute nodes access the Lustre file system directly by means of fsworkers, or indirectly by means of server and client applications that pass the data between the compute and service nodes. Use the LUC API to make remote procedure calls between the compute and service nodes and to allocate nearby memory buffers to store data. Lustre separates file metadata from data objects. Each instance of a Lustre file system consists of one or more Object Storage Servers (OSSs) and a single active Metadata Server (MDS). Each OSS hosts one or more Object Storage Targets (OSTs). Lustre OSTs are backed by RAID storage. Applications store data on OSTs, and files can be striped across multiple OSTs. S–2466–20 25Cray XMT™ System Overview Figure 7. Lustre Architecture on Cray XMT Metadata Server Service I/O Node Object Storage Server Lustre File System High-speed Network Service I/O Node Service I/O Node Object Storage Server 3.3.2 Random Access Memory File System The Cray XMT compute partition has access to the random access memory file system (RAMFS) file system. The RAMFS is allocated in memory and serves as the root file system for the compute nodes. It has no disk storage associated with it because it is used primarily for system utilities and shared libraries. RAMFS is loaded into Threadstorm memory at boot time and is accessed using standard I/O calls such as read and write. Although RAMFS appears to be a normal file system, when the system goes down, all files in RAMFS are lost because it has no disk backup. If you create files in RAMFS that you want to retain across system boots, you must create a copy of the files in the NFS available on the service nodes. In general, do not create large files in RAMFS. It is not a high-speed file system, and it is limited in size. 26 S–2466–20Software Overview [3] 3.3.3 Network File System The network file system (NFS) is the only file system on the Cray XMT that is directly accessible by both the service and compute nodes. The MTK operating system on Threadstorm compute nodes provides an NFS filesystem client that can directly access files on an NFS server. The NFS is the primary means of transferring user applications from the service node to the compute node. By default, one login node exports all user directories to the compute node over NFS. User applications may read and write files in the user's home directory using standard system calls. NFS is not a high-speed file system, but may be used to provide a persistent storage space. 3.4 User Environment The user environment is similar to the environment on a typical Linux workstation. Users log on to a Cray XMT login node. Before working on the system, the user must do the following: 1. Set up a secure shell. The Cray XMT system uses ssh and ssh-enabled applications for secure, password-free remote access to login nodes. Before using ssh commands, the user must generate an RSA authentication key. 2. Load the appropriate modules. The Cray XMT system uses the Modules utility to support multiple versions of software, such as compilers, and to create integrated software packages. As new versions of the supported software become available, they are added automatically to the Programming Environment, and earlier versions are retained to support legacy applications. By specifying the module to load, the user can choose the default or another version of one or more Programming Environment tools. For details, see Cray XMT Programming Environment User's Guide and the module(1) and modulefile(4) man pages. 3.5 System Administration The system administration environment provides the tools that administrators use to manage system functions, view and modify the system state, and maintain system configuration files. System administration components are a combination of Cray XMT Series system hardware, SUSE LINUX, and the Cray XT and Cray XMT utilities and resources. Note: For information about standard SUSE LINUX administration, see http://www.tldp.org or http://www.suse.com. S–2466–20 27Cray XMT™ System Overview Many of the components used for system administration are also used for system monitoring and management (such as powering up and down and monitoring the health of hardware components). For details, see Cray XMT System Management. 3.5.1 System Management Workstation The System Management Workstation (SMW) provides a single-point interface to an administrator's environment. The SMW provides a terminal window from which the administrator performs tasks like adding user accounts, changing passwords, and monitoring applications. The SMW accesses system components through the administration/HSS network; it does not use the system interconnection network. 3.5.2 Shared-root File System The Cray XMT Series service nodes have a shared-root file system where the root directory is shared read-only on the service nodes. All nodes have the same default directory structure. However, the /etc directory is specially mounted on each service node as a node-specific directory of symbolic links. The administrator can change the symbolic links in the /etc directory by the process of specialization, which changes the symbolic link to point to a non-default version of a file. The administrator can specialize files for individual nodes or for a class (type) of nodes. The administrator's interface includes commands to view file layout from a specified node, determine the type of specialization, and create a directory structure for a new node or class based on an existing node or class. For details, see the Managing System Software for Cray XE and Cray XT Systems manual and the Cray XMT System Management. 3.5.3 Configuration and Source Files The administrator uses the boot node to view files, maintain configuration files, and manage the processes of executing programs. Boot nodes connect to the SMW and are accessible through a login shell. The xtopview utility runs on login nodes and allows the administrator to view files as they would appear on a specified node. The xtopview utility also maintains a database of files to monitor as well as file state information such as checksum and modification dates. Messages about file changes are saved through a Revision Control System (RCS) utility. 28 S–2466–20Software Overview [3] 3.5.4 System Monitoring There are three tools available in Cray XMT that you can use to monitor activity on the compute nodes. • mtatop — Displays the processes and statistics for CPUs currently running on the Cray XMT for all available processors. Information is displayed in textual form on the command line. See the mtatop(1) man page and the Cray XMT System Management. • Dashboard2 — Displays the processes and statistics for CPUs currently running on the Cray XMT for all available processors. Information is displayed using the Dashboard2 graphical user interface. Information is displayed two views: Resources, which shows all the CPU activity on the system as a whole for a particular type of activity, and Processors, which shows all activity on each processor. See the dash(1) man page and theCray XMT System Management. • xmtconsole — Displays all output from the Cray XMTcompute node and accepts input from the command line and passes it to the MTK. The xmtconsole is useful for monitoring warning messages from the system. See the xmtconsole(8) man page and the Cray XMT System Management. 3.5.5 System Log After the system is booted, console messages are sent to the system log and are written to the boot RAID system. System log messages generated by service node kernels and daemons are gathered by syslog daemons running on all service nodes. Kernel errors and panic messages are sent directly to the SMW. S–2466–20 29Cray XMT™ System Overview 30 S–2466–20Application Development [4] The Cray XMT application development software is the set of software products and services that programmers use to build and run applications on Cray XMT compute nodes. 4.1 User Runtime Library The user runtime library (librt) contains runtime software support for future variables, synchronization, scheduling, event logging, compiler-generated parallelism, and debugging. The Cray XMT compilers link the user runtime library into your application at compile time. For a list functions contained in the user runtime library, see the runtime(3) man page. 4.2 Lightweight User Communication Library (LUC) API The Cray XMT Programming Environment contains a user-level library for LUC, libluc.a, that uses a C++ interface. The Linux client program and the Cray XMT both use the same sources and interfaces offered by the LUC library. LUC implementation uses a client/server remote procedure call (RPC) paradigm where communication occurs between endpoints that sit on both the client and server. When the client has a package of data that it needs to deliver to the server, it sends a message to the server endpoint over the high-speed network. On the client side, the LUC library allocates nearby memory in the user buffer for data storage. On the server side, the LUC library allocates memory in nearby memory for the transfer of data, later copies the data to distributed memory, or leaves it in nearby memory. Data is transferred over Cray Seastar chips using Cray Fast I/O (FIO). When creating a LUC application, you implement the endpoints as objects that are instantiated as either server-only, client-only, or both server and client objects that have a corresponding set of methods. The client and server can be created on either the compute-node or service-node side. For more information about programming for LUC, see the Cray XMT Programming Model and the Cray XMT Programming Environment User's Guide. S–2466–20 31Cray XMT™ System Overview 4.3 Compiling Programs The Cray XMT Programming Environment includes C and C++ compilers. The compilers translate C and C++ programs into Cray XMT object files. The command used to invoke a compiler is called a compilation driver; it can be used to apply options at the compilation unit level. C or C++ pragmas apply options to selected portions of code or alter the effects of command-line options. The mta-pe module contains the XMT programming environment, which compiles for the compute node. This module is loaded by default when you log in. The mta-linux-lib module contains the linux version of the LUC library. For other service node applications, use the standard Linux libraries. 4.3.1 Compiler Commands The following compiler commands are available: Compiler Command C cc C++ c++ For further information, see the Cray XMT Programming Environment User's Guide. 4.4 Running Applications Applications are launched by using the mtarun command. The mtarun command connects to a daemon (mtarund) on the compute nodes. With this connection, your user environment changes so that your directories on the service node are shared on the compute nodes. The application that you specify in the program_executable option is then launched to run on the compute nodes. After the application is launched, mtarun serves as the front end for the program, supplying it with the following services: • Forwarding of standard I/O, such as stdin, stdout, stderr • Signal forwarding for all sending signals • Termination management, which redirects errors that caused the application to terminate or kills the application if mtarun terminates For more information, see the mtarun(1) man page. 4.5 Debugging Applications You use the mdb debugger to debug your applications. The mdb debugger is based on the Free Software Foundation GDB debugger (version 3.5). 32 S–2466–20Application Development [4] For more information, see the Cray XMT Debugger Reference Guide and the mdb(1) man page. 4.6 Monitoring Applications The mtatop command displays the processes and CPU statistics currently running on the machine for all available processes on Threadstorm compute nodes. When run with the p option, mtatop shows process information for a specified process ID. The following information is provided for this option: • The name and ID of the process • The current state of the process, such as running, sleeping, and so on • The scheduling priority of the process • The number of CPUs that the process is using • The amount of memory that the process is using • The amount of CPU time accumulated by the process This command can also be run using the c option to monitor CPU usage. For more information, see the mtatop(1) man page. 4.7 Measuring Performance The Cray XMT Series system provides tools for the collection, display, and analysis of performance data. You can use the resultant analyses to optimize your application. For more information about the tools described in this section, see Cray XMT Performance Tools User's Guide. 4.7.1 Cray Apprentice2 Cray Apprentice2 is an interactive X Window System tool for displaying performance analysis data captured during program execution. It allows you to view traces that are generated when you compile with the -trace option, and it provides GUI versions of the canal and bprof tools. Note that Cray Apprentice2 is not a debugger, nor is it a simulator. S–2466–20 33Cray XMT™ System Overview Cray Apprentice2 identifies many potential performance problem areas, including the following conditions: • Load imbalance • Excessive serialization • Excessive communication • Network contention • Poor use of the memory hierarchy • Poor functional unit use Cray Apprentice2 has the following capabilities: • Post-execution performance analysis tool that provides information about a program by examining data files that were created during program execution. • Displays many types of performance data contingent on the data that was captured during execution. • Reports time statistics for all processing elements and for individual routines. • Shows total execution time, synchronization time, execution time for a subroutine, communication time, and the number of calls to a subroutine. 4.7.2 Canal Canal is a compiler analysis tool. It uses information captured during compilation to produce an annotated source code listing showing the optimizations performed automatically by the compiler. You use the Canal listing to identify and correct code that the compiler cannot optimize. See the canal(1) man page for more information. 4.7.3 Tview The tview command displays a trace file in one of three formats: XML, Apprentice2, or compressed (ZIP). It uses information captured during program execution to produce graphical displays showing performance metrics over time. Use the tview graphs to identify when a program is running slowly. See the tview(1) man page for more information. 4.7.4 Bprof Bprof is a block profiling tool. It uses information captured during program execution to identify which functions are performing what amounts of work. When used with Tview, this can help you to identify the functions which consume the most time while producing the least work. See the bprof(1) man page for more information. 34 S–2466–20Application Development [4] 4.7.5 pproc Pproc is a post-processing data conversion tool. It converts the data generated by Canal, Tview, and Bprof into a format that can be displayed within Cray Apprentice2. See the pproc(1) man page for more information. 4.7.6 ap2view Ap2view is an XML data file viewer. It displays data from .ap2 files that are created when you run pproc. See the ap2view(1) man page for more information. 4.7.7 Tprof Tprof is a trace profiling tool. It provides a simple profile of the functions and parallel regions in the code, based on traces. S–2466–20 35Cray XMT™ System Overview 36 S–2466–20Cray Hardware Supervisory System (HSS) [5] The Cray Hardware Supervisory System (HSS) is an integrated, independent system of hardware and software that monitors Cray XMT Series system components, manages hardware and software failures, controls startup and shutdown processes, manages the system interconnection network, and displays the system state to the administrator. The HSS interfaces with all major hardware and software components of the system. Because the HSS is a completely separate system with its own processors and network, the services that it provides do not take resources from running applications. In addition, if a component fails, the HSS continues to provide fault identification and recovery services and enables the functioning parts of the system to continue operating. For detailed information about the HSS, see the Managing System Software for Cray XE and Cray XT Systems manual. 5.1 HSS Hardware The hardware components of HSS are the HSS network, the SMW, the blade control processors (L0 controllers), and the cabinet control processors (L1 controllers). HSS hardware monitors compute and service node components, operating system heartbeats, power supplies, cooling fans, voltage regulators, and RAID systems. S–2466–20 37Cray XMT™ System Overview Figure 8. HSS Components System Management Workstation HSS Network L1 B al de L0 Blade L0 Blade L0 Cabinet L1 Blade L0 Blade L0 Blade L0 Cabinet L1 Blade L0 Blade L0 Blade L0 Cabinet 5.1.1 HSS Network The HSS network is an Ethernet connection between the SMW and the components that the HSS monitors. The network's function is to provide an efficient means of collecting status from and broadcasting messages to system components. The HSS network is separate from the system interconnection network. Traffic on the HSS network is normally low, with occasional peaks of activity when major events occur. Even during peak activity, the level of traffic is well within the capacity of the network. There is a baseline level of traffic to and from the hardware controllers. All other traffic is driven by events, either those due to hardware or software failures or those initiated by the administrator. The highest level of network traffic occurs during the initial booting of the entire system as console messages from the booting images are transmitted onto the network. 5.1.2 System Management Workstation The System Management Workstation (SMW) is the administrator's single-point interface for booting, monitoring, and managing Cray XMT system components. The SMW consists of a server and a display device. Multiple administrators can use the SMW, locally or remotely over an internal LAN or WAN. Note: The SMW is also used to perform system administration functions (see System Administration on page 27). 38 S–2466–20Cray Hardware Supervisory System (HSS) [5] 5.1.3 Hardware Controllers At the lowest level of the HSS are the L0 and L1 controllers that monitor the hardware and software of the components on which they reside. Every compute blade and service blade has a blade control processor (L0 controller) that monitors the components on the blade, checking status registers of the AMD Opteron processors, the Control Status Registers (CSRs) of the Cray SeaStar chip, and the voltage regulation modules (VRMs). The L0 controllers also monitor board temperatures and the CLE heartbeat. Each cabinet has a cabinet control processor (L1 controller) that communicates with the L0 controllers within the cabinet and monitors the power supplies and the temperature of the air cooling the blades. Each L1 controller also routes messages between the L0 controllers in its cabinet and the SMW. 5.2 HSS Software The HSS software consists of software monitors; the administrator's HSS interfaces; and event probes, loggers, and handlers. This section describes the software monitors and administrator interfaces. For a description of event probes, loggers, and handlers, see HSS Actions on page 40. 5.2.1 Software Monitors The System Environment Data Collections (SEDC) HSS manager monitors the system health and records the environmental data (such as temperature) and the status of hardware components (such as power supplies, processors, and fans). SEDC can be set to run at all times or only when a client is listening. By default, SEDC is configured to scan the system hardware components automatically. Resiliency communication agents (RCAs) run on the first compute node to boot. Each RCA generates a periodic heartbeat message, enabling HSS to know when an RCA has failed. Failure of an RCA heartbeat is interpreted as a failure of the CLE operating system on that node. 5.2.2 HSS Administrator Interfaces The HS provides both a command-line and a graphical interface. The xtcli command is the command line interface for managing the Cray XT system from the SMW. The xtgui command launches the graphical interface. In general, the administrator can perform any xtcli function with xtgui except boot. S–2466–20 39Cray XMT™ System Overview The SMW is used to monitor data, view status reports, and execute system control functions. If any component of the system detects an error, it sends a message to the SMW. The message is logged and displayed for the administrator. HSS policy decisions determine how the fault is handled. The SMW logs all information it receives from the system to the SMW disk to ensure the information is not lost due to component failures. 5.3 HSS Actions The HSS manages the startup and shutdown processes and event probing, logging, and handling. The HSS collects data about the system (event probing and logging) that is then used to determine which components have failed and in what manner. After determining that a component has failed, the HSS initiates some actions (event handling) in response to detected failures that, if left unattended, could cause worse failures. The HSS also initiates actions to prevent failed components from interfering with the operations of other components. 5.3.1 System Startup and Shutdown The administrator starts a Cray XMT Series system by powering up the system and booting the software on the service nodes and compute nodes. Booting the system sets up the system interconnection network. Starting the operating system brings up the RCA. A script, set up by the administrator, shuts the system down. For logical machines, the administrator can boot, run diagnostics, run user applications, and power down without interfering with other logical machines as long as the HSS is running on the SMW and the machines have separate file systems. For details about the startup and shutdown processes, see Managing System Software for Cray XE and Cray XT Systems and the xtcli(8) man page. 5.3.2 Event Probing The HSS probes are the primary means of monitoring hardware and software components of a Cray XT system. The HSS probes that are hosted on the SMW collect data from HSS probes running on the L0 and L1 controllers and RCA daemons running on the compute nodes. In addition to dynamic probing, the HSS provides an offline diagnostic suite that probes all HSS-controlled components. 40 S–2466–20Cray Hardware Supervisory System (HSS) [5] 5.3.3 Event Logging The event logger preserves data that the administrator uses to determine the reason for reduced system availability. It runs on the SMW and logs all status and event data generated by: • HSS probes • Processes communicating through RCA daemons on compute and service nodes • Other HSS processes running on L0 and L1 controllers Event messages are time stamped and logged. If a compute or service blade fails, the HSS notifies the administrator. 5.3.4 Event Handling The event handler evaluates messages from HSS probes and determines what to do about them. The HSS is designed to prevent single-point failures of either hardware or system software from interrupting the system. Examples of single-point failures that are handled by the HSS system are: • Compute node failure. If a compute node fails, the entire compute node partition goes down and needs to be rebooted. • Power supply failure. Power supplies have an N+1 configuration for each chassis in a cabinet; failure of an individual power supply does not cause an interrupt of a compute node. In addition, the HSS transmits failure events over the HSS network to those components that have subscribed for the particular event, so that each component can make a local decision about how to deal with the fault. For example, both the L0 and L1 controllers contain code to react to critical faults without administrator intervention. S–2466–20 41Cray XMT™ System Overview 42 S–2466–20Glossary blade 1) A Cray XMT compute blade consists of Threadstorm processors, memory, Cray SeaStar chips, and a blade control processor. 2) From a system management perspective, a logical grouping of nodes and blade control processor that monitors the nodes on that blade. blade control processor A microprocessor on a blade that communicates with a cabinet control processor through the HSS network to monitor and control the nodes on the blade. See also blade, L0 controller, Hardware Supervisory System (HSS). cabinet control processor A microprocessor in the cabinet that communicates with the HSS via the HSS network to monitor and control the devices in a system cabinet. See also Hardware Supervisory System (HSS). CLE The operating system for Cray XMT systems. fork Occurs when processors allocate additional streams to a thread at the point where it is creating new threads for a parallel loop operation. future Implements user-specified or explicit parallelism by starting new threads. A future is a sequence of code that can be executed by a newly created thread that is running concurrently with other threads in the program. Futures delay the execution of code if the code is using a value that is computed by a future, until the future completes. The thread that spawns the future uses parameters to pass information from the future to the waiting thread, which then executes. In a program, the term future is used as a type qualifier for a synchronization variable or as a keyword for a future statement. S–2466–20 43Cray XMT™ System Overview Hardware Supervisory System (HSS) Hardware and software that monitors the hardware components of the system and proactively manages the health of the system. It communicates with nodes and with the management processors over the private Ethernet network. See also system interconnection network. logical machine An administrator-defined portion of a physical Cray XMT system, operating as an independent computing resource. login node The service node that provides a user interface and services for compiling and running applications. metadata server (MDS) The component of the Lustre file system that manages Metadata Targets (MDT) and handles requests for access to file system metadata residing on those targets. node For CLE systems, the logical group of processor(s), memory, and network components acting as a network end point on the system interconnection network. See also processing element. phase A set of one or more sections of code that the stream executes in parallel. Each section contains an iteration of a loop. Phases and sections are contained in control flow code generated by the compiler to control the parallel execution of a function. processing element The smallest physical compute group. There are two types of processing elements: a compute processing element consists of an AMD Opteron processor, memory, and a link to a Cray SeaStar chip. A service processing element consists of an AMD Opteron processor, memory, a link to a Cray SeaStar chip, and PCI-X or PCIe links. System Management Workstation (SMW) The workstation that is the single point of control for system administration. See also Hardware Supervisory System (HSS). 44 S–2466–20 TM Cray XMT™ Programming Environment User's Guide S–2479–20© 2007–2011 Cray Inc. All Rights Reserved. This document or parts thereof may not be reproduced in any form unless permitted by contract or by written permission of Cray Inc. Copyright (c) 2008, 2010, 2011 Cray Inc. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name Cray Inc. nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. Your use of this Cray XMT release constitutes your acceptance of the License terms and conditions. U.S. GOVERNMENT RESTRICTED RIGHTS NOTICE The Computer Software is delivered as "Commercial Computer Software" as defined in DFARS 48 CFR 252.227-7014. All Computer Software and Computer Software Documentation acquired by or for the U.S. Government is provided with Restricted Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7014, as applicable. Technical Data acquired by or for the U.S. Government, if any, is provided with Limited Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7013, as applicable. Cray, LibSci, and PathScale are federally registered trademarks and Active Manager, Cray Apprentice2, Cray Apprentice2 Desktop, Cray C++ Compiling System, Cray CX, Cray CX1, Cray CX1-iWS, Cray CX1-LC, Cray CX1000, Cray CX1000-C, Cray CX1000-G, Cray CX1000-S, Cray CX1000-SC, Cray CX1000-SM, Cray CX1000-HN, Cray Fortran Compiler, Cray Linux Environment, Cray SHMEM, Cray X1, Cray X1E, Cray X2, Cray XD1, Cray XE, Cray XEm, Cray XE5, Cray XE5m, Cray XE6, Cray XE6m, Cray XMT, Cray XR1, Cray XT, Cray XTm, Cray XT3, Cray XT4, Cray XT5, Cray XT5 h , Cray XT5m, Cray XT6, Cray XT6m, CrayDoc, CrayPort, CRInform, ECOphlex, Gemini, Libsci, NodeKARE, RapidArray, SeaStar, SeaStar2, SeaStar2+, The Way to Better Science, Threadstorm, and UNICOS/lc are trademarks of Cray Inc. GNU is a trademark of The Free Software Foundation. ISO is a trademark of International Organization for Standardization (Organisation Internationale de Normalisation). Linux is a trademark of Linus Torvalds. Lustre and NFS are trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Opteron is a trademark of Advanced Micro Devices, Inc. Platform is a trademark of Platform Computing Corporation. RSA is a trademark of RSA Security Inc. UNIX, the “X device,” X Window System, and X/Open are trademarks of The Open Group in the United States and other countries. All other trademarks are the property of their respective owners.RECORD OF REVISION S–2479–20 Published May 2011 Supports release 2.0 GA running on Cray XMT compute nodes and on Cray XT 3.1UP02 service nodes. This release uses the System Management Workstation (SMW) version 5.1UP03. 1.5 Published December 2010 Supports release 1.5 running on Cray XMT compute nodes and Cray Linux Environment (CLE) release 2.241A on Cray XT service nodes. This release requires the System Management Workstation (SMW) version 4.0.UP02, which is based on the SLES10 SP3 base operating system. 1.4 Published December 2009 Supports release 1.4 running on Cray XMT compute nodes and Cray Linux Environment (CLE) release 2.241A on Cray XT service nodes. This release requires the System Management Workstation (SMW) version 4.0.UP02, which is based on the SLES10 SP3 base operating system. 1.3 Published March 2009 Supports release 1.3 running on Cray XMT compute nodes and on Cray XT 2.1.50HD service nodes. This release requires the System Management Workstation (SMW) version 3.1.09 that is based on the SLES10 SP1 base operating system. 1.2 Published August 2008 Supports general availability (GA) release 1.2 running on Cray XMT compute nodes and on Cray XT 2.0.49 service nodes. This release uses the System Management Workstation (SMW) version 3.1.04 that is based on the SLES9 SP2 base operating system. 1.1 LA Published March 2008 Supports limited availability (LA) release 1.1.01 running on Cray XMT compute nodes and on Cray XT 2.0 service nodes. 1.0 LA Published August 2007 Draft documentation to support Cray XMT limited-availability (LA) systems.Changes to this Document Cray XMT™ Programming Environment User's Guide S–2479–20 This rewrite of Cray XMT Programming Environment User's Guide supports the 2.0 release of the Cray XMT operating system and programming environment. For more information see the release announcement that accompanies this release. Added information • Two new pragmas: #pragma mta max n processors and #pragma mta max concurrency c. See Compilation Directives on page 109. • Additional programming examples. Revised information • The snapshot documentation has been revised extensively. See Chapter 6, Managing Lustre I/O with the Snapshot Library on page 67. • Technical and editorial corrections. The conceptual content that made up the first chapters of previous versions of this guide have been moved to a new document, Cray XMT Programming Model.Contents Page Introduction [1] 13 1.1 The Cray XMT Programming Environment . . . . . . . . . . . . . . . . . 13 Setting Up the User Environment [2] 15 2.1 Setting Up a Secure Shell . . . . . . . . . . . . . . . . . . . . . . 15 2.1.1 RSA Authentication . . . . . . . . . . . . . . . . . . . . . . 15 2.1.2 Additional Information . . . . . . . . . . . . . . . . . . . . . 16 2.2 Using Modules . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.1 Modifying the PATH Variable . . . . . . . . . . . . . . . . . . . 17 2.2.2 Software Locations . . . . . . . . . . . . . . . . . . . . . . 17 2.2.3 Module Commands . . . . . . . . . . . . . . . . . . . . . . 18 Developing an Application [3] 19 3.1 The Cray XMT Programming Environment . . . . . . . . . . . . . . . . . 19 3.2 Overview of Cray XMT Generic and Intrinsic Functions . . . . . . . . . . . . . 20 3.2.1 Generic Functions . . . . . . . . . . . . . . . . . . . . . . 20 3.2.1.1 Generic Write Functions . . . . . . . . . . . . . . . . . . . 21 3.2.1.2 Generic Read Functions . . . . . . . . . . . . . . . . . . . 22 3.2.2 Intrinsic Functions . . . . . . . . . . . . . . . . . . . . . . 24 3.3 Adding Synchronization to an Application . . . . . . . . . . . . . . . . . 24 3.3.1 Synchronizing Data Using int_fetch_add . . . . . . . . . . . . . . 25 3.3.2 Avoiding Deadlock . . . . . . . . . . . . . . . . . . . . . . 25 3.4 Programming Considerations for Floating-point Operations . . . . . . . . . . . . 26 3.4.1 Differences from IEEE Floating-point Arithmetic . . . . . . . . . . . . . . 28 3.4.2 Differences from Cray Floating-point Arithmetic . . . . . . . . . . . . . . 29 3.4.3 32-bit and 64-bit Implementation of Floating-point Arithmetic . . . . . . . . . . 30 3.4.4 Rounding Results of Floating-point Operations . . . . . . . . . . . . . . 30 3.5 Using Futures in an Application . . . . . . . . . . . . . . . . . . . . 31 3.5.1 Improving Performance of Future Statements . . . . . . . . . . . . . . . 32 3.5.2 Anonymous futures . . . . . . . . . . . . . . . . . . . . . . 34 S–2479–20 7Cray XMT™ Programming Environment User’s Guide Page 3.6 Testing Expressions Using Condition Codes . . . . . . . . . . . . . . . . 34 3.7 File I/O . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.7.1 Language-level I/O . . . . . . . . . . . . . . . . . . . . . . 36 3.7.2 System-level I/O . . . . . . . . . . . . . . . . . . . . . . . 39 3.8 Porting Programs to the Cray XMT . . . . . . . . . . . . . . . . . . . 43 3.9 Debugging the Program . . . . . . . . . . . . . . . . . . . . . . 45 Shared Memory Between Processes [4] 47 4.1 Mapping a Memory Region for Data Sharing . . . . . . . . . . . . . . . . 47 4.2 Persisting Shared Memory . . . . . . . . . . . . . . . . . . . . . 49 Developing LUC Applications [5] 53 5.1 Programming Considerations for LUC Applications . . . . . . . . . . . . . . 53 5.2 Creating and Using a LUC Client . . . . . . . . . . . . . . . . . . . 53 5.3 Creating and Using a LUC Server . . . . . . . . . . . . . . . . . . . 56 5.4 Communication Between LUC Objects . . . . . . . . . . . . . . . . . . 57 5.5 LUC Client/Server Example . . . . . . . . . . . . . . . . . . . . . 60 5.6 Fast I/O Memory Usage . . . . . . . . . . . . . . . . . . . . . . 65 Managing Lustre I/O with the Snapshot Library [6] 67 6.1 About the Snapshot Library . . . . . . . . . . . . . . . . . . . . . 67 6.2 The Snapshot Library Interface . . . . . . . . . . . . . . . . . . . . 68 6.3 Maintaining File System and I/O Parallelism . . . . . . . . . . . . . . . . 70 6.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.5 Managing File I/O on File Systems Other Than Lustre . . . . . . . . . . . . . . 74 Compiler Overview [7] 75 7.1 The Compilation Process . . . . . . . . . . . . . . . . . . . . . . 76 7.1.1 File Types Accepted by the Compiler . . . . . . . . . . . . . . . . . 79 7.2 Invoking the Compiler . . . . . . . . . . . . . . . . . . . . . . 80 7.3 Setting the Compiler Mode . . . . . . . . . . . . . . . . . . . . . 80 7.3.1 Whole-program Mode . . . . . . . . . . . . . . . . . . . . . 81 7.3.2 Separate-module Mode . . . . . . . . . . . . . . . . . . . . . 82 7.3.3 Mixed Mode . . . . . . . . . . . . . . . . . . . . . . . . 83 7.4 Inlining Functions . . . . . . . . . . . . . . . . . . . . . . . . 84 7.5 Optimizing Parallelization . . . . . . . . . . . . . . . . . . . . . 85 7.6 Incremental Recompilation and Relinking . . . . . . . . . . . . . . . . . 86 7.7 Creating New Libraries . . . . . . . . . . . . . . . . . . . . . . 87 7.8 Compiler Messages . . . . . . . . . . . . . . . . . . . . . . . 88 8 S–2479–20Contents Page 7.9 Setting Debugger Options during Compilation . . . . . . . . . . . . . . . . 88 7.10 Using Compiler Directives and Assertions . . . . . . . . . . . . . . . . . 89 Running an Application [8] 91 8.1 Launching the Application . . . . . . . . . . . . . . . . . . . . . 91 8.2 User Runtime Environment Variables . . . . . . . . . . . . . . . . . . 92 8.3 Improving Performance . . . . . . . . . . . . . . . . . . . . . . 93 Optional Optimizations [9] 95 9.1 Scalar Replacement of Aggregates . . . . . . . . . . . . . . . . . . . 95 9.2 Optimizing Calls to memcpy and memset . . . . . . . . . . . . . . . . . 98 Appendix A Error Messages 99 Appendix B User Runtime Functions 103 Appendix C Compiler Directives and Assertions 109 C.1 Compilation Directives . . . . . . . . . . . . . . . . . . . . . . 109 C.2 Parallelization Directives . . . . . . . . . . . . . . . . . . . . . . 124 C.3 Semantic Assertions . . . . . . . . . . . . . . . . . . . . . . . 125 C.4 Implementation Hints . . . . . . . . . . . . . . . . . . . . . . 130 Appendix D Condition Codes 133 Appendix E Data Types 137 Appendix F Keywords 139 Appendix G MTA_PARAMS 143 Appendix H LUC API Reference 147 H.1 LucEndpoint Class . . . . . . . . . . . . . . . . . . . . . . 147 H.2 luc_allocate_endpoint Function . . . . . . . . . . . . . . . . 149 H.3 LUC Methods . . . . . . . . . . . . . . . . . . . . . . . . . 149 H.3.1 startService Method . . . . . . . . . . . . . . . . . . . 149 H.3.2 stopService Method . . . . . . . . . . . . . . . . . . . . 150 H.3.3 getMyEndpointID Method . . . . . . . . . . . . . . . . . . 150 H.3.4 remoteCall Method . . . . . . . . . . . . . . . . . . . . 151 H.3.5 remoteCallSync Method . . . . . . . . . . . . . . . . . . 153 H.3.6 registerRemoteCall Method . . . . . . . . . . . . . . . . . 154 H.3.7 setConfigValue Method . . . . . . . . . . . . . . . . . . 155 H.3.8 getConfigValue Method . . . . . . . . . . . . . . . . . . 158 S–2479–20 9Cray XMT™ Programming Environment User’s Guide Page H.4 LUC Type Definitions . . . . . . . . . . . . . . . . . . . . . . 159 H.5 LUC Callback Functions . . . . . . . . . . . . . . . . . . . . . . 160 H.5.1 LUC_RPC_Function_InOut . . . . . . . . . . . . . . . . . 160 H.5.2 LUC_Mem_Avail_Completion . . . . . . . . . . . . . . . . 161 H.5.3 LUC_Completion_Handler . . . . . . . . . . . . . . . . . 162 H.6 LUC Return Codes . . . . . . . . . . . . . . . . . . . . . . . 162 Glossary 167 Procedures Procedure 1. Setting up RSA authentication with a passphrase . . . . . . . . . . . . 15 Procedure 2. Using RSA authentication without a passphrase . . . . . . . . . . . . 16 Procedure 3. Creating and using a LUC client object . . . . . . . . . . . . . . . 54 Procedure 4. Creating and using a LUC server object . . . . . . . . . . . . . . 56 Examples Example 1. Testing a shift-left operation for a carried number . . . . . . . . . . . . 34 Example 2. Retrieving a condition code and result of a previous operation . . . . . . . . . 35 Example 3. Retrieving a condition code set by a previous operation . . . . . . . . . . 35 Example 4. Calling standard I/O functions from parallel code . . . . . . . . . . . . 37 Example 5. Calling record-oriented I/O functions from parallel code . . . . . . . . . . 37 Example 6. Preventing racing when calling I/O functions . . . . . . . . . . . . . 38 Example 7. Calling UNIX I/O functions from parallel code . . . . . . . . . . . . . 40 Example 8. Using synchronization with UNIX I/O functions . . . . . . . . . . . . 41 Example 9. Using synchronization with UNIX record-oriented I/O functions . . . . . . . . 41 Example 10. Mapping memory to share among multiple processes . . . . . . . . . . . 47 Example 11. LUC client code example . . . . . . . . . . . . . . . . . . 55 Example 12. LUC Server code example . . . . . . . . . . . . . . . . . . 57 Example 13. Allocating and using LucEndpoint objects to communicate . . . . . . . . 57 Example 14. Using dslr_snapshot and dslr_restore to save and restore data in a file. . . 71 Example 15. Using dslr_pwrite to write data to a file and dslr_pread to read back the data . . 72 Tables Table 1. mta-pe Utilities . . . . . . . . . . . . . . . . . . . . . . 19 Table 2. Condition Codes . . . . . . . . . . . . . . . . . . . . . . 133 Table 3. Condition Masks . . . . . . . . . . . . . . . . . . . . . . 133 Table 4. C/C++ Keywords Recognized by the Cray XMT Compiler . . . . . . . . . . 139 Table 5. Standard C++ Keywords Recognized by the Cray XMT Compiler . . . . . . . . . 139 10 S–2479–20Contents Page Figures Figure 1. Snapshot Library Data Paths . . . . . . . . . . . . . . . . . . 67 Figure 2. Comparison of Whole-program and Separate-module Modes . . . . . . . . . . 78 S–2479–20 11Introduction [1] This guide describes the Cray XMT Programming Environment. It includes procedures and examples that show you how to set up your user environment and build and run optimized applications. The intended audience is application programmers and users of the Cray XMT system. For information about debugging your application, see Cray XMT Debugger Reference Guide. For information about performance analysis tools that you can use to tune your application, see Cray XMT Performance Tools User's Guide. This chapter presents a general overview of the Cray XMT. Subsequent chapters of this manual cover the details for how to write programs for the Cray XMT. 1.1 The Cray XMT Programming Environment The Cray XMT Programming Environment (XMT-PE) includes the following: • Cray XMT compilers for C and C++ • Cray mdb debugger, which is an adaptation of the Free Software Foundation's gdb debugger • Apprentice2 performance analysis tool The XMT-PE runs on a Linux operating system on a service node. You write and compile your program on the service partition and launch it from the service partition onto the compute partition. S–2479–20 13Cray XMT™ Programming Environment User’s Guide 14 S–2479–20Setting Up the User Environment [2] Configuring your user environment on a Cray XMT system is similar to configuring a typical Linux workstation. 2.1 Setting Up a Secure Shell Cray XMT systems use ssh and ssh-enabled applications such as scp for secure, password-free remote access to the login nodes. Before you can use the ssh commands, you must generate an RSA authentication key. The process for generating the key depends on the authentication method you use. There are two methods of passwordless authentication: with or without a passphrase. Although both methods are described here, you must use the latter method to access the compute nodes through a script or when using a single-system view (SSV) command. 2.1.1 RSA Authentication You can set up RSA authentication with or without a passphrase. Procedure 1. Setting up RSA authentication with a passphrase To enable ssh with a passphrase, complete the following steps. 1. Generate the RSA keys by typing the following command and follow the prompts. The program requests you to supply a passphrase. % ssh-keygen -t rsa 2. Create a $HOME/.ssh directory and set permissions so that only the file's owner can access them by typing the following commands: % mkdir $HOME/.ssh % chmod 700 $HOME/.ssh 3. The public key is stored in your $HOME/.ssh directory. Copy the key to your home directory on the remote host (or hosts) by typing the following command: % scp $HOME/.ssh/key_filename.pub \ username@system_name:.ssh/authorized_keys S–2479–20 15Cray XMT™ Programming Environment User’s Guide 4. Connect to the remote host by typing the following commands. If you are using a C shell, type: % eval s` sh-agent % ` ssh-add If you are using a bash shell, type: $ eval s` sh-agent -s $ ` ssh-add 5. Enter your passphrase when prompted, followed by: % ssh remote_host_name Procedure 2. Using RSA authentication without a passphrase To enable ssh without a passphrase, complete the following steps. 1. Generate the RSA keys by typing the following command: % ssh-keygen -t rsa -N "" 2. Create a $HOME/.ssh directory and set permissions so that only the file's owner can access them by typing the following command: % mkdir $HOME/.ssh % chmod 700 $HOME/.ssh 3. The public key is stored in your $HOME/.ssh directory. Copy the key to your home directory on the remote host (or hosts) by typing the following command: % scp $HOME/.ssh/key_filename.pub \ username@system_name:.ssh/authorized_keys Note: This step is not required if your home directory is shared. 4. Connect to the remote host by typing the following command: % ssh remote_host_name 2.1.2 Additional Information For more information about setting up and using a secure shell, see the ssh(1), ssh-keygen(1), ssh-agent(1), ssh-add(1), and scp(1) man pages. 16 S–2479–20Setting Up the User Environment [2] 2.2 Using Modules The Cray XMT system uses modules in the user environment to support multiple versions of software, such as compilers, and to create integrated software packages. As new versions of the supported software and associated man pages become available, they are added automatically to the Programming Environment, while earlier versions are retained to support legacy applications. By specifying the module to load, you can choose the default version of an application or another version. The modules for the compilers and associated products are: • mta-pe for the C and C++ compilers. This is the default environment. Modules also provide a simple mechanism for updating certain environment variables, such as PATH, MANPATH, and LD_LIBRARY_PATH. In general, you should make use of the modules system rather than embedding specific directory paths into your startup files, makefiles, and scripts. The following subsections describe the information you need to manage your user environment. 2.2.1 Modifying the PATH Variable Do not reinitialize the system-defined PATH. The following example shows how to modify it for a specific purpose (in this case to add $HOME/bin to the path). If you are using a C shell, type: % set path = ($path $HOME/bin) If you are using bash, type: $ export $PATH=$PATH:$HOME/bin 2.2.2 Software Locations On a typical Linux system, compilers and other software packages are located in the /bin or /usr/bin directories. However, on Cray XMT systems these files are in versioned locations under the /opt directory. Cray software is self-contained and is installed as follows: • Base prefix: /opt/pkgname/pkgversion/, such as /opt/mta-pe/default • Package environment variables: /opt/pkgname/pkgversion/var • Package configurations: /opt/pkgname/pkgversion/etc Note: To run a Programming Environment product, specify the command name (and arguments) only; do not enter an explicit path to the Programming Environment product. Likewise, job files and makefiles should not have explicit paths to Programming Environment products embedded in them. S–2479–20 17Cray XMT™ Programming Environment User’s Guide 2.2.3 Module Commands The mta-pe modules are loaded by default. To find out what modules have been loaded, type: % module list To switch from one Programming Environment to another, type: % module swap switch_from_module switch_to_module For example, to switch from the Cray XMT Programming Environment to the GNU Programming Environment, type: % module swap mta-pe PrgEnv-gnu For further information about the module utility, see the module(1) and modulefile(4) man pages. 18 S–2479–20Developing an Application [3] This chapter provides an overview of some Cray XMT functions and describes how to perform some common programming tasks, such as floating-point operations, sorting, dataflow, searching, and I/O. Before you begin developing your program, you must log in to the login node using ssh. You develop, compile, debug, and launch your program from the login node. Before developing your application, review the data types and keywords that are supported by the Cray XMT compilers. For a list of data types, see Appendix E, Data Types on page 137. For a list of keywords, see Appendix F, Keywords on page 139. 3.1 The Cray XMT Programming Environment The Cray XMT Programming Environment (XMT-PE) contains the following modules: • mta-pe • xmt-tools • mta-man The mta-pe module contains the C/C++ compilers and some utilities that are useful during the development process. The following table lists the commands for mta-pe utilities and provides a brief description. Table 1. mta-pe Utilities Utility Name Description dis Disassembles object code. header Displays a Cray XMT Executable and Linking File (ELF) header for a specified object, exec, or library file. mdb Starts debugger for Cray XMT programs. nm Lists symbols from object files. The mta-pe module also contains support for functions that are specific to the Cray XMT environment. For more information, see Overview of Cray XMT Generic and Intrinsic Functions on page 20. S–2479–20 19Cray XMT™ Programming Environment User’s Guide The xmt-tools module contains the tools that you use to run and monitor a program. To run a program, use the mtarun command. For more information, see Launching the Application on page 91 or the mtarun(1) man page. To monitor the program, use the mtatop or dash command. For more information, see Cray XMT System Management or the mtatop(1) man page. The mta-man module contains the man pages for all the utilities, tools, and functions that you find in the XMT-PE. 3.2 Overview of Cray XMT Generic and Intrinsic Functions The Cray XMT Programming Environment (XMT-PE) supports a number of Cray XMT functions. For a list of these functions, see the generics(1) and mta_intrinsics(3) man pages. You can refer to the man page for each function for details about how to use that function. Man pages for functions list the names of the header files you must include in your program when using that function. 3.2.1 Generic Functions The Cray XMT compiler provides a number of generic functions that operate atomically on scalar variables (variables that hold a single value). The generic functions perform read and write, purge, touch, and int_fetch_add operations on variables. The most common use of the generic functions is to manipulate sync and future variables, but you can also use all of the generic functions, except for the touch function, on other types of variables. Generic functions frequently affect, or have behavior that is dependent upon, the full-empty state of the variable. Because of this, you must know the initial full-empty state of the variable before you allocate it. For sync variables, this state is full if you initialize the variable in the declaration, and empty if you do not initialize the variable. For future variables, the initial state is full. For all other variables, the initial state is full if you initialize the variable in the declaration and undefined if you do not initialize the variable. You should avoid using generic functions on a variable (other than a sync or future variable) that is less than a word in length. Each 8-byte word of memory is associated with only one full-empty bit. If two or more variables share the same word, they share a single full-empty bit; using a generic function to modify the full-empty state of one of the variables also changes the state of the other variable(s). You must be careful when using multiword scalars. When you use ordinary language constructs, a read or write operation of a sync or future multiword variable occurs as if the multiple words are fused and have a single full-empty bit, even when there are other read or write operations that use the same variable. 20 S–2479–20Developing an Application [3] When a set of generic functions access a multiword variable simultaneously, the resulting behavior depends on the generic functions that constitute the set. If all the generic functions in the set require the variable to be in either a full or empty state, the functions access the variable in a serialized manner and the user-visible state is consistent. However, if any generic function in the set does not depend on the full-empty state (such as the purge, readxx, and writexf functions), the ability to serialize the set is not guaranteed. If the set is not serialized, generic functions may access words in the variable in a different order, resulting in inconsistencies in one or more of the following: the state of the value returned by one or more of the generics; the memory holding the variable; the data value; or the full-empty bits. Accessing an individual memory word that is part of a multiword variable (for example, using a cast or a union) could result in inconsistent full-empty states and a data value partially composed of both current and obsolete memory contents. It may also cause a deadlock to occur. 3.2.1.1 Generic Write Functions The generic write functions write new values to variables, depending upon the full-empty state of the variable. If the type for a value does not match the type for the variable that stores the value, the value is cast to the correct type before being written. For example, in the following: int i; writeff(&i, 2.0); The value 2.0 (float) is cast to 2 (int) before being written to i. S–2479–20 21Cray XMT™ Programming Environment User’s Guide The Cray XMT compiler recognizes the following generic write functions. writeef(&v, value) Writes value in variable v when v is in an empty state and sets v to a full state. This allows one or more threads waiting for v to change to a full state to resume execution. If v is in a full state, the write operation is blocked until v changes to an empty state. This generic function behaves like a write access to a sync variable. writeff(&v, value) Writes value in variable v when v is in a full state and leaves v in a full state. If v is in an empty state, the write is blocked until v changes to a full state. This generic function behaves like a write access to a future variable that occurs outside the body of a future statement. writexf(&v, value) Writes value in variable v and sets v to a full state. This allows one or more threads waiting for v to change to a full state to resume execution. This generic function behaves like the write of a return value that occurs at the end of the body of a future statement but is not like a write access to a variable declared with the future qualifier. int_fetch_add(&v, i) Atomically adds integer i to the value at address v, stores the sum at v, and returns the original value from v (setting v to a full state). Regardless of its type, i is cast as an 8-byte integer. Neither parameter can be a multiword object. If v is less than the size of a word, the compiler generates a warning diagnostic. If v is an empty sync or future variable, the operation is blocked until v changes to a full state. purge(&v) Writes 0, using the appropriate data type, to variable v and sets v to an empty state. For more information, see the generics(1) man page. 3.2.1.2 Generic Read Functions Generic read functions return the value for a variable, depending upon the full-empty state of the variable. When you invoke these functions, the data type of the return value is determined by the type of the first argument in the function call. 22 S–2479–20Developing an Application [3] The Cray XMT compiler recognizes the following generic read functions. readfe(&v) Returns the value of variable v when v is in a full state and sets v to an empty state. This allows one or more threads waiting for v to change to an empty state to resume execution. If v is in an empty state, the read operation is blocked until v changes to a full state. This generic function behaves like a read access to a sync variable. readff(&v) Returns the value of variable v when v is in a full state and leaves v in a full state. If v is in an empty state, the read operation is blocked until v changes to a full state. This generic function behaves like a read access to a future variable. readxx(&v) Returns the value of variable v but does not interact with the full-empty memory state. touch(&v) The touch function returns the value of future variable v, where v is associated with a future statement that has been spawned, but whose body may or may not have already begun execution. If the future body that writes v has not begun executing, the thread calling touch executes the future body. If the future body associated with v is currently being executed or has finished executing, touch(&v) acts like a readff(&v) function. You use the touch function with future variables that are filled by the execution of code in the body of a future statement. Using Futures in an Application on page 31 Touching a future variable that is in an empty state but not bound to a future results in an execution-time error. For more information, see the generics(1) man page. S–2479–20 23Cray XMT™ Programming Environment User’s Guide 3.2.2 Intrinsic Functions Cray provides intrinsic functions for the Cray XMT system that allow direct access to machine operations from high-level languages. You can find a list of the C intrinsic functions and the machine functions in the mta_intrinsics(3) man page. The C intrinsic function names use the name of the machine operation and add a prefix of MTA_. So, for example, the machine operation named FLOAT_ROUND becomes the C intrinsic function named MTA_FLOAT_ROUND. When you use an intrinsic function, it calls its associated machine operation to perform the task at the processor level using assembly language. The result of a machine operation is passed back and becomes the return value of the intrinsic function. For parameters, when the assembly language version of an instruction names two input registers and an output register, the associated intrinsic function has only two input parameters and returns a result. For example, the machine operation that you use to multiply bit matrices, (BIT_MAT_OR t u v), uses the intrinsic C function _int64 MTA_BIT_MAT_OR (_int64 u, _int64 v) where the t parameter in the machine operation becomes the return value for MTA_BIT_MAT_OR and the u and v parameters are the operands. Invoke this intrinsic function by using the following command: t = MTA_BIT_MAT_OR(u, v); For the previous statement, declare t, u, and v as integer variables by using the _int64 data type. The intrinsic functions use the _int64 data type for 64-bit signed integers and the _uint64 data type for 64-bit unsigned integers. The intrinsic functions that may be most useful are the bit matrix arithmetic functions. For example, if you want to count 1-bits or 0-bits, use the MTA_BIT_RIGHT_ONE, MTA_BIT_LEFT_ONE, MTA_BIT_RIGHT_ZERO, or MTA_BIT_LEFT_ZERO intrinsic functions. You can use the MTA_BIT_OR and MTA_BIT_AND intrinsic functions to perform bitwise OR and AND operations. Intrinsic functions support most machine operations that use signed or unsigned integers (int), floating-point numbers (float), or bit vectors (bit) as variables. If you do not use a constant argument where required, it results in an unresolved reference to the intrinsic function at link time. For example, the intrinsic MTA_TEST_CC requires a compile-time constant for its second parameter. If you supply a variable instead, the compiler issues a warning and the invocation is compiled as a call, resulting in a link-time failure. 3.3 Adding Synchronization to an Application The tasks in this section explain how to add synchronization in your application. 24 S–2479–20Developing an Application [3] 3.3.1 Synchronizing Data Using int_fetch_add Use the int_fetch_add generic function to synchronize updates to data that represents shared counters without using locks. This function has the following signature: int_fetch_add (&v, i) The int_fetch_add function provides access to the underlying atomic int_fetch_add machine operation. This function atomically adds i to the value at address v, stores the sum at v, returns the original value of v, and sets the state bit to full. In short, it does the following, as a single atomic operation: t = v; v = v+i; return t; You can use int_fetch_add to identify the last of a group of threads to complete a task, to partition data into groups, or to maintain a stack or queue index. 3.3.2 Avoiding Deadlock Using sync variables can introduce deadlock into a program if, when the program executes, threads attempt to do more reads than writes to a sync variable. When you are trying to determine how many read operations the program performs, it is important to remember that every reference to a sync variable results in a separate read of that variable, even when the references occur in the same source code statement. For example, in the following cases: • Your program references a sync variable two or more times on the right side of an assignment statement. For example, if x$ is a sync variable: sum = x$ + x$; • Your program references a sync variable two or more times in a conditional test. For example, if x$ is a sync variable: if ((x$ >= 10)&&(x$ <= 100)){} S–2479–20 25Cray XMT™ Programming Environment User’s Guide In these two cases, each reference to x$ results in a separate read of that variable and requires a separate write to x$. The second write to x$ must be performed by a thread other than the one executing the code in the example. In the first case, it might have been the intention of the programmer to add together two successive values of x$. If so, this code presents no problems provided the program contains additional code that executes concurrently with the code in the example and performs the second write to x$. In the second case, it is doubtful that the programmer's intention was to compare two different values of x$. Also, due to the short-circuiting rules in C and C++, there is no guarantee that the second read will occur. Thus, you could end up with a deadlock whether or not have two writes to x$. If you have two writes, but the second read does not occur due to short-circuiting, your code will deadlock due to too many writes. On the other hand, if you have one write, and the second read does occur, your code will deadlock due to too many reads. In both of these cases, if the intention is to read only one value for x$, a temporary variable should be used, as in this example: tmpx = x$; if ((tmpx >= 10) && (tmpx <= 100)){} Deadlock can also occur when two or more concurrent functions access global sync variables in a different order. For example, if a$ and b$ are global sync variables, and the function fnc1 first loads a$ and then loads b$. tmp_a = a$; tmp_b = b$; In the same program, function fnc2 first loads b$ and then loads a$. tmp_b = b$; tmp_a = a$; If the functions run concurrently, then there is a chance of deadlock. If fnc2 loads b$ after fnc1 loads a$, but before fnc1 loads b$, then neither function can continue unless a third concurrently running function eventually writes to either a$ or b$. You can avoid this problem by always accessing a$ and $b in the same order each time you use them in functions that may be concurrent. 3.4 Programming Considerations for Floating-point Operations The base arithmetic for floating-point operations on the Cray XMT uses the IEEE Standard 754 format double precision (64-bit). A 64-bit floating-point number, known as a Float64 on the Cray XMT, consists of a sign bit, an 11-bit exponent, and 52 bits of fraction. Ordinary numbers (those with a biased exponent not equal to zero or 0x7FF) have an exponent bias of 1023 (0x3FF) and their absolute value can be expressed using the following equation: (1.0 + fraction) << (exponent - 0x3FF) The value is negative if the sign bit is set, positive if it is not set. 26 S–2479–20Developing an Application [3] A number with a biased exponent of 2047 (0x7FF) is a special floating-point number, known as a SpecialFloat64 on the Cray XMT. If all the fraction bits are zero, the value of the number is plus or minus infinity. Infinity generally occurs in calculations as a result of an overflow or division by zero. For example, 1.0/0.0 is positive infinity, while 1.e300*-1.e300 is negative infinity. Calculations such as 0.0/0.0 create a result that is called not a number (NaN). Any 64-bit floating-point number with a biased exponent of 0x7FF and a non-zero fraction represents NaN. After NaN enters a computation, it persists through addition, subtraction, multiplication, and division. When a calculation produces a NaN, it indicates an error in your program or data. In arithmetic comparisons, NaN is not equal to any number, including itself. NaN is neither less than nor greater than any number. In fact, such comparisons raise an exception when one of the numbers being compared is NaN. This implies that the opposite of less than is not greater than but greater than, equal to, or unordered. In this case, unordered allows for the possibility that one of the numbers in the comparison is NaN. The Cray XMT hardware supports comparisons such as less than, equal to, or unordered, and the compilers use these comparisons as necessary when reversing the sense of a test. There are two representations of zero in the Cray XMT hardware. The number 0x0000000000000000 represents +0.0 while 0x8000000000000000 represents -0.0. Although +0.0 and -0.0 appear to be equal to each other, you can distinguish between them when using them in computations. In particular, 1.0/0.0 equals positive infinity while 1.0/-0.0 equals negative infinity. These values obey computational rules under multiplication, as shown in the following example. 0.0*(-1.) = -0.0 (-0.0)*(-1.0) = 0.0 and so on. For any finite nonzero x$, x - x = +0.0. This implies that b - a is not equivalent to -(a - b). For computations with zero, the following rules hold: +0.0 - (+0.0) = +0.0 - (-0.0) = (-0.0) - (-0.0) = +0.0 However... -0.0 - (+0.0) = -0.0 Underflow in the Cray XMT hardware is gradual in accordance with the IEEE 754 standard. Computations that underflow, producing a rounded result smaller in magnitude than 0x0010000000000000, or about 2.225e-308, do not all flush to zero. If the result has an absolute value greater than or equal to min_denorm, such as 0x0000000000000001, or about 4.94e-324, it is a subnormal number. A subnormal number is one with a zero-biased exponent and a nonzero fraction such as 0x0000000000000001 or 0x800FFFFFFFFFFFFF. The absolute value for such a subnormal number is the following: (0.0 + fraction) >> 1022 S–2479–20 27Cray XMT™ Programming Environment User’s Guide Subnormal numbers are less precise than normalized numbers. The smallest subnormal number, min_denorm, has only one significant bit while the largest has 52 significant bits. However, whenever 0.5 <= x/y <= 2.0, the difference x - y is exact, even though it may have less precision than x and y. This is not true for machines that flush underflow to zero. The Cray XMT floating-point hardware handles gradual underflow transparently. Unlike many systems, the Cray XMT is not slowed by the presence (or possibility) of subnormal numbers and gradual underflow in a computation. 3.4.1 Differences from IEEE Floating-point Arithmetic The Cray XMT processors do not have 32-bit floating-point instructions. If you are performing an operation on 32-bit floating-point numbers, you must first use the MTA_FLOAT_REAL intrinsic function to convert each 32-bit number in the operation to a 64-bit number. After the operation is complete, you can use the MTA_REAL_FLOAT intrinsic function to round the results to 32-bit numbers. This double rounding (first to 64 bits and then to 32 bits) is not the same as a single rounding to 32 bits. For more information about how to use MTA_FLOAT_REAL and MTA_REAL_FLOAT, see the mta_intrinsics(3) man page. The Cray XMT does not provide you with control over rounding precision for floating-point operations. The level of rounding precision is set on the processor during the manufacturing process. Traps on the Cray XMT are precise, but operands can be overwritten by the results of an operation performed on the same or a different functional unit. This can make the implementation of post-substitution difficult. There is no exponent wrapping when an operation enables or takes an overflow or underflow trap. The intent of wrapping is to provide for automatic rescaling when products or quotients are used in subsequent operations. On the Cray XMT, you must use care when rescaling. The hardware supports fused multiply-add operations that only require a single issue of an instruction. This operation facilitates certain computations by making it easy to extract the lower half of the product of two 64-bit doubles. The problem is that the compiler can evaluate statements such as the following in several different ways, each of which may produce a different result: x = a*b + c*d; The previous statement can be evaluated as either: temp = a*b; x = temp + c*d; // For multiply-add operation Or temp = c*d; x = a*b + temp; // For multiply-add operation 28 S–2479–20Developing an Application [3] Or temp1 = a*b; temp2 = c*d; x = temp1 + temp2; The only way to override the compiler instructions for a particular multiply-add operation is to put each multiply operation on a separate line, as in the third example. You can use the -no_mul_add compiler flag to disable multiply-add operations. Rather than using a multiply-add operation, the compiler may use a common subexpression, as shown in the following example. x = a*b; //For multiply y = a*b + c; //Essentially y = x + c In cases like this, you can use the #pragma mta single round required pragma in a C program to indicate to the compiler that it must use a multiply-add operation. The Cray XMT does not support signaling NaNs. For all data types, the Cray XMT identifies uninitialized floating-point data by throwing poison errors rather than using signaling NaNs. See Appendix A, Error Messages on page 99. 3.4.2 Differences from Cray Floating-point Arithmetic There are several versions of floating-point arithmetic on Cray systems. Newer Cray systems, such as the Cray XMT, use formats based on IEEE 754. Older Cray systems used a proprietary format that differs from IEEE 754 (and from the Cray XMT implementation of IEEE 754) in significant ways. This older format is known as Cray floating-point arithmetic. Cray floating-point arithmetic uses a 48-bit significand, which has less precision than the 53-bit significand used by the Cray XMT. The significand is the part of a floating-point number that contains its significant digits. Cray floating-point arithmetic has a 15-bit exponent with exponents that contain values between -8192 and 8191. This is a much larger range than the exponents for the Cray XMT that contain values between -1022 and 1023. Cray floating-point operations lack guard digits for subtraction and are known to have certain anomalies in computations. In general, older Cray code that does not rely on the extra-large exponent range runs without modification on the Cray XMT. Otherwise, some rescaling is required for the Cray XMT. In addition, programs designed for older Cray systems may contain work-around code to handle Cray floating-point anomalies. This code is not necessary on the Cray XMT. S–2479–20 29Cray XMT™ Programming Environment User’s Guide 3.4.3 32-bit and 64-bit Implementation of Floating-point Arithmetic The double data type in C uses the format for double-precision (64-bit) arithmetic provided by IEEE Standard 754 guidelines. Cray XMT hardware does not support IEEE Standard 754 extended precision, and all 32-bit arithmetic is done by promotion to 64-bit formats. Rounding mode on the Cray XMT is controlled on a per-thread basis using mode bits in the stream status word (SSW). A newly created stream inherits the rounding mode of its parent. Hardware instructions that convert from an int or unsigned int number to a floating-point number use the same rounding mode as the SSW. You can use the MTA_FLOAT_UNS intrinsic function when converting large unsigned integers to a floating-point number. You can use the current rounding mode as the basis for converting a floating-point number to an integer by using the MTA_FLOAT_ROUND intrinsic function or use explicit rounding that ignores the mode bits in the SSW by using the MTA_FLOAT_CEIL, MTA_FLOAT_CHOP, MTA_FLOAT_FLOOR, or MTA_FLOAT_NEAR intrinsic functions. Each thread has its own set of floating-point exception flags and traps that can be enabled in its SSW. The normal mode of operation is to run with all floating-point traps disabled. If you convert a 64-bit floating-point number to a decimal string with at least 17 significant decimal digits and then convert it back to 64-bit floating-point number, the result matches the original. If you convert a decimal string with n less than 15 decimal digits to 64-bit floating-point number and then convert it back to n decimal digits, the result matches the original string. Add, subtract, and multiply operations each use one processor instruction on the Cray XMT. Divide operations use eight instructions, and square root operations require ten instructions. There is room in the divide and square-root sequences for other operations, particularly in the memory unit. 3.4.4 Rounding Results of Floating-point Operations The standard C math and C++ cmath libraries implement a set of functions that you can use when performing basic mathematical operations such as the log function for logarithms. When you use the math library functions on the Cray XMT, these mathematical operations do not necessarily produce correctly rounded results, except for the sqrt() function. Function results are generally accurate to within one unit in the last place, but there are exceptions, especially for large arguments. Trigonometric functions do infinitely precise argument reduction. Numbers are rounded according to the IEEE Standard 754. The default rounding method is overridden when you use the following intrinsic conversion functions: MTA_FLOAT_CEIL, MTA_INT_CHOP, and MTA_UNS_FLOOR. 30 S–2479–20Developing an Application [3] The current rounding mode for the math library is set to round to the nearest place (RND_NEAR). User functions that change the rounding mode must reset it to RND_NEAR before calling the math library functions. Exceptions are handled silently by the math library. No messages are printed, and errno is not set by the library. If functions return NaN or infinity, these arguments are propagated silently by the library. Exception flags are raised as appropriate. 3.5 Using Futures in an Application In your application, a future consists of: • A future statement that creates a continuation pointing to a series of statements that may be executed by another thread. • An optional future-qualified variable, known as a future variable, that synchronizes execution of other program threads upon completion of the future. The name of the future variable is also the name of the future. • Parameters used by the spawning thread to pass values to the thread executing the future. • The future body, which contains the statements pointed to by the continuation that may be executed by another thread. The body may end with a return statement that writes a value to the future variable. The keyword future is used in two ways: • As a type qualifier for a synchronization variable. future int x$; Upon allocation, the full-empty state of the future variable x$ is set to full. • As a statement. future x$(i) { return printf("i is %d\n", i); } In the previous statement, the full-empty state for x$ is set to empty. The argument i is passed in to the future body by value. The stream places the future on a queue that executes the future bodies asynchronously. Any stream can now dequeue the future and execute its body. The return value is stored to x$. Finally, the full-empty bit of x$ is set to full after the return value is stored in x$. S–2479–20 31Cray XMT™ Programming Environment User’s Guide Future statements contain the name of a future variable and parameters, a body, and a return statement. The future variable's value is set by the return statement. The future variable is optional; if no future variable is specified, the return statement of the future body supplies no value. For example: int x, y, z future int i$; future i$(x, y, z) { /* Some body statements */ return x*y*z; } In the previous example, when the computation completes, the return value returns in the future variable i$. Subsequent accesses to i$ are delayed until the future completes. The use of future variables is limited to scalar data types such as char, int, float, double, pointers, and array elements. The body of a future statement may contain any legal statement including function calls and other future statements. For a recursive operation, you can eliminate some of the overhead of blocking a thread by using the keyword touch in your program. This leaves the semantics unchanged, but if the future body has not begun, the calling thread executes it directly. int search_tree(Tree *root, unsigned target) { int sum = 0; if (root) { future int left$; future left$(root, target) { return search_tree(root->llink, target); } sum = root->data == target; sum += search_tree(root->rlink, target); sum += touch(&left$); } return sum; } In the previous example, the touch operation checks if any thread has started to execute the future body associated with left$. If so, it waits for the future body to complete. If not, the thread calling touch executes the future body itself. 3.5.1 Improving Performance of Future Statements When your application is compiled, future statements cause the compiler to create continuations. Continuations are structures that contain pointers to routines that contain the code from the body of the future statement and a list of arguments to pass to that code. 32 S–2479–20Developing an Application [3] Continuations are normally allocated and deallocated from the heap. However, if the associated future variable is a scalar variable that is located on the stack, the compiler causes the continuation to be placed on the stack. This reduces the overhead associated with allocation and deallocation operations. The compiler does not do this when there is an array of future variables on the stack because this requires an array of continuations. Continuations can be large so an array of continuations might cause the stack size to become very large. You can force the compiler to place an array of continuations on the stack by using the stack_continuations attribute in your application. This may improve the performance of the application. The attribute has the following syntax: __attribute__((stack_continuations)) You can add this attribute to any future-qualified stack-based array declaration in your application. void myFutures() { future int children[10] __attribute__((stack_continuations)); // ... } Another way to improve performance is by employing the autotouch compilation mode. This compilation mode automatically applies the touch generic whenever a future variable is referenced. There are three ways to use autotouch: The -autotouch compiler flag enables autotouch for all source modules compiled with the flag. The pragma directive mta autotouch can be applied to a single source module. The on option enables automatic touching, the off option disables automatic touching, and the default option reverts the autotouch mode to the default mode for that source module, which was determined by the compile-line flags. The gcc-style attribute future int foo$ __attribute__ ((autotouch (on|off))); allows you to change the autotouch mode on a per-variable basis. For example, future int foo$ __attribute__ ((autotouch (on))). The on option enables autotouch for all references to this variable, regardless of the current command-line flags or pragmas. Similarly, the off option disables autotouch for all references to the variable. This attribute generates a warning if it is applied to a variable without the future qualifier. S–2479–20 33Cray XMT™ Programming Environment User’s Guide 3.5.2 Anonymous futures Often, a concurrent computation does not have a return value. An example of such a concurrent computation is an I/O statement or a modification of global values. You can express such a computation using an anonymous future. An anonymous future has no name or return statement. If the anonymous future does not access a synchronized variable referenced by the main computation, there will be no dependence between the future and the main computation. If a future does not create a dependence, the future may not execute. An anonymous future does not need to execute or finish for a program to terminate normally. 3.6 Testing Expressions Using Condition Codes When you use arithmetic expressions in your code, you can test the expressions by using the MTA_TEST_CC intrinsic function. This function returns condition codes that identify problems in the expression. It uses the following prototype: MTA_TEST_CC(expression, condition-mask) MTA_TEST_CC evaluates the expression and generates a condition code. If the resulting condition code is a member of the set of condition values in condition-mask, true is returned; otherwise, false is returned. The expression can be a scalar variable, a single arithmetic operation, or a machine intrinsic. If you use a scalar variable, you must assign a value to it in the statement immediately preceding the call to MTA_TEST_CC. In MTA_TEST_CC, you test the operation on the right side of the assignment statement. The condition-mask should evaluate to a compile-time constant. The condition codes and possible values for the condition-mask are listed in Appendix D, Condition Codes on page 133. Example 1. Testing a shift-left operation for a carried number MTA_TEST_CC allows branching based on any of the condition codes produced by the machine intrinsics. For example, consider the problem of testing to see if one of the upper 32 bits in an integer is set. One approach is to use the MTA_SHIFT_LEFT intrinsic function, which generates a carried number if a 1 bit is shifted out. When using MTA_SHIFT_LEFT, you can use MTA_TEST_CC with the IF_CY condition mask to check for a carried number, as shown in the following example. enum{IF_CY = 16+32+64+128}; if(MTA_TEST_CC(MTA_SHIFT_LEFT(i, 32), IF_CY)) { printf("One of the upper 32 bits was set\n"); } In the previous example, the compiler would emit a SHIFT_LEFT_IMM_TEST operation, followed by a conditional branch on carry. 34 S–2479–20Developing an Application [3] Example 2. Retrieving a condition code and result of a previous operation It is also possible to test the condition code generated by some earlier operation, allowing you to make use of both the condition code and the result of the operation. In the following example, MTA_TEST_CC is used to test whether there was a carry generated by MTA_BIT_LEFT_ZEROS. MTA_BIT_LEFT_ZEROS returns the number of consecutive 0 bits on the left end of the word. enum{IF_CY = 16+32+64+128}; const int j = MTA_BIT_LEFT_ZEROS(i); if(MTA_TEST_CC(j, IF_CY)) { printf("i was zero\n"); } else { printf("i had %d significant zeros\n", j); } Example 3. Retrieving a condition code set by a previous operation The operation that sets the condition code does not need to be an intrinsic function. The condition code is usually set by an ordinary addition or multiplication operation, such as the following. enum{IF_CY = 16+32+64+128}; const int k = i + j; if(MTA_TEST_CC(k, IF_CY)) { printf("carry generated\n"); } If the expression is more complex, the condition code is only available from the last operation. For example, the expression in the following example requires two adds but only the second add affects the condition code. Because the compiler can evaluate this code in three different ways, it may not yield the correct result. enum{IF_CY = 16 + 32 + 64 + 128}; const int m = i + j + k; if(MTA_TEST_CC(m, IF_CY)) { printf("carry generated\n"); } S–2479–20 35Cray XMT™ Programming Environment User’s Guide 3.7 File I/O The Cray XMT performs I/O to a RAM-based file system (RAMFS) and a network file system (NFS). Neither the RAMFS nor the NFS are high-speed file systems, therefore, any data over 2 gigabytes in size must to be written to a high-speed file system, such as Lustre. You can use the NFS for small amounts of data, such as user files. During the system reboot, all data is lost from the RAMFS because it is not written to disk. Any data that you need to retain across system boots must be written to the Lustre file system prior to rebooting the system. The XMT-PE provides snapshot functions that you can use to move data between the service nodes and the Cray XMT compute nodes. Once the data is on the service nodes, you can use standard Cray XT commands to move data to the Lustre file system. The underlying details of the file system are abstracted behind UNIX library calls that you can add to your program to perform I/O. The Cray XMT system provides some support for concurrent I/O to multiple files, but you must provide explicit access control for concurrent I/O to a single file. The following sections discuss standard language-supported forms of I/O as well as I/O using the low-level UNIX I/O functions. Each section discusses the semantics, particularly parallelism, and performance possibilities. 3.7.1 Language-level I/O In serial code, the standard I/O functions behave as specified in the ANSI C standard. In parallel code, all calls to the standard I/O package are executed atomically. Atomic execution means that while one call is executing, no other call can interfere with what the first is doing. Each call appears to run from beginning to end without interruption. #pragma mta assert parallel for (i = 0; i < n; i++) { fprintf(f,"this is iteration %d\n", i); } The previous code asserts that the loop is parallel. In general, it is not safe for the compiler to parallelize a loop that contains procedure calls, especially calls to I/O functions. The assertion indicates to the compiler that, in this case, it is safe to parallelize the loop. The atomicity guarantee ensures that each line written to f by this loop is of the form that follows: this is iteration i Two lines are never mixed together, so the following never occurs: this is this is iteration j 36 S–2479–20Developing an Application [3] The actual sequence of lines is random because the different iterations are all executed in parallel. However, for a sequence of calls such as the following: #pragma mta assert parallel for (i = 0; i < n; i++) { fprintf(f,"this is "); fprintf(f,"iteration %d\n", i); } The output may look like the following, because only the individual calls to fprintf are atomic: this is iteration i this is this is iteration k iteration j Example 4. Calling standard I/O functions from parallel code To avoid the previous problem, you can combine the two calls to fprintf or add some sort of explicit synchronization. For example: sync int flag$ = 1; #pragma mta assert parallel for (i = 0; i < n; i++) { int j = flag$; // lock fprintf(f,"this is "); fprintf(f,"iteration %d\n", i); flag$ = j; // unlock } The previous code manipulates the sync variable flag$ to create an atomic section that contains two calls to fprintf. The actual value loaded from and stored to flag$ is not important because the code uses flag$ as a lock. Example 5. Calling record-oriented I/O functions from parallel code For record-oriented I/O, such as that which occurs when using a combination of fseek together with fread or fwrite, you can use explicit synchronization to ensure correct behavior, such as in the following code example: sync int flag$ = 1; #pragma mta assert parallel for (i = 0; i < n; i++) { Buf buffer; int j = flag$; // lock fseek(f, i*sizeof(Buf), SEEK_SET); fread(buffer, sizeof(Buf), 1, f); flag$ = j; // unlock // Work with buffer } S–2479–20 37Cray XMT™ Programming Environment User’s Guide In the previous code, flag$ controls access to file f, ensuring that the combination of fseek and fread are executed atomically. In this case, you use SEEK_SET because the SEEK_CUR (positioning relative to the current position) is not useful in a parallel context. Example 6. Preventing racing when calling I/O functions You use a similar technique when using ferror with another call to ensure that any error detected by the ferror call was not caused by a racing read or write call from a different thread. For example, in the following code, calls to several I/O functions are grouped together so that they are all executed atomically. sync int flag$ = 1; #pragma mta assert parallel for (i = 0; i < n; i++) { Buf buffer; int err; int j = flag$; // lock fseek(f, i*sizeof(Buf), SEEK_SET); fread(buffer, sizeof(Buf), 1, f); err = ferror(f); flag$ = j; // unlock if (!err) { // Work with buffer } } In the previous code, the result of the call to ferror is saved to a variable (err) for later testing. The same considerations apply when using futures or more complex loops, perhaps with the I/O hidden within a nest of procedure calls. Single calls always execute atomically. However, when a sequence of calls pertaining to a single file must be executed atomically, you must manage the sequence of calls explicitly. Internally, the stdio library enforces locking for each FILE object (FILE is a data type defined in stdio.h). This causes output to a number of different files can proceed in parallel, but output to a single file is serialized. Similarly, you can use sprintf and sscanf independently of calls to other functions because these functions do not use a FILE object. For example, for the loop in the following example, every iteration refers to a different FILE object, so each call to fprintf can run without interfering with files used by another iteration. #pragma mta assert parallel for (i = 0; i < n; i++) { fprintf(f[i],"this is iteration %d\n", i); } 38 S–2479–20Developing an Application [3] If many parallel calls refer to the same file, locking forces a serial execution order. For example, in the following code, it makes little sense to run the loop in parallel because the calls to fprintf are serialized by the lock on the FILE object referred to by g. However, the interpretation of the format string is controlled by the lock. #pragma mta assert parallel for (i = 0; i < n; i++) { fprintf(g,"this is iteration %d\n", i); } If the loop contains significant computations, such as in the following example, you may want to parallelize the loop. #pragma mta assert parallel for (i = 0; i < n; i++) { int j = expensive_function(i); fprintf(g,"f(%d) = %d\n", i, j); } You cannot use the stdio functions to support concurrent file access. For example, consider the following code: #pragma mta assert parallel for (i = 0; i < n; i++) { Buf buffer; FILE *f = fopen(file_name, "r"); fseek(f, i*sizeof(Buf), SEEK_SET); fread(buffer, sizeof(Buf), 1, f); fclose(f); } There are two problems in this example: • If n is large, the system cannot support so many open files. • The file position (set by fseek) is shared among all open versions of the file, so races may occur. 3.7.2 System-level I/O There are a number of low-level functions provided by the operating system to support more flexible and efficient I/O. However, you should avoid accessing a given file using both the high-level language-dependent methods and the low-level functions. The high-level functions use buffering that may interact with the low-level functions in unpredictable ways. S–2479–20 39Cray XMT™ Programming Environment User’s Guide Example 7. Calling UNIX I/O functions from parallel code In serial code, the low-level UNIX functions behave as specified by the Posix standard. In parallel code, all calls are executed atomically. In this case, you must explicitly manage access to a particular file by a sequence of calls, to prevent races. For example: #pragma mta assert parallel for (i = 0; i < n; i++) { char line[80]; int len = sprintf(line, "this is iteration %d\n", i); write(fd, line, len); } The previous code asserts that the loop is parallel. In general, it is not safe for the compiler to parallelize a loop that contains procedure calls, especially calls to I/O functions. The assertion tells the compiler that, in this case, it is safe to parallelize the loop. The atomicity guarantee ensures that each line written to fd by this loop is of the form that follows: this is iteration i Two lines are never mixed together, so the following never occurs: this is this is iteration j The actual sequence of lines is random because the different iterations are all executed in parallel. However, for a sequence of calls such as the following: char part1[80]; int len1 = sprintf(part1, "this is iteration "); #pragma mta assert parallel for (i = 0; i < n; i++) { char part2[80]; int len2 = sprintf(part2,"%d\n", i); write(fd, part1, len1); write(fd, part2, len2); } The output may look like the following, because only the individual calls to write are atomic: this is iteration this is iteration i k this is iteration j 40 S–2479–20Developing an Application [3] Example 8. Using synchronization with UNIX I/O functions To correct this problem, you can either rewrite the code in the style of the first example or add some sort of explicit synchronization, as shown in the following example. sync int flag$ = 1; char part1[80]; int len1 = sprintf(part1, "this is iteration "); #pragma mta assert parallel for (i = 0; i < n; i++) { char part2[80]; int len2 = sprintf(part2, "%d\n", i); int j = flag$; // lock write(fd, part1, len1); write(fd, part2, len2); flag$ = j; // unlock } The previous code manipulates the sync variable flag$ to create an atomic section that contains two calls to write. The actual value loaded from and stored to flag$ is not important because the code uses flag$ as a lock. Example 9. Using synchronization with UNIX record-oriented I/O functions For record-oriented I/O, you can use explicit synchronization to ensure the correct behavior by using a combination of lseek together with a read or write operation, such as in the following code example. sync int flag$ = 1; #pragma mta assert parallel for (i = 0; i < n; i++) { Buf buffer; int j = flag$; // lock lseek(fd, i*sizeof(Buf), SEEK_SET); read(fd, buffer, sizeof(Buf)); flag$ = j; // unlock //Work with buffer } In the previous code, flag$ controls access to file fd, ensuring that the combination of lseek and read are executed atomically. In this case, you use SEEK_SET because SEEK_CUR is not useful in a parallel context. The same considerations apply when using futures or more complex loops, perhaps with the I/O hidden within a nest of procedure calls. Single calls always execute atomically. However, when a sequence of calls pertaining to a single file must be executed atomically, you must manage the sequence explicitly. S–2479–20 41Cray XMT™ Programming Environment User’s Guide Internally, the UNIX library enforces locking for each file descriptor so that output to multiple files can occur in parallel, but output to a single file occurs serially. For example, in the following loop, every iteration refers to a different file descriptor, so each call to write runs without interfering with other calls. #pragma mta assert parallel for (i = 0; i < n; i++) { char line[80]; int len = sprintf(line, "this is iteration %d\n", i); write(fd[i], line, len); } If many parallel calls refer to the same file, locking forces a serial execution order. For example, in the following code, it makes little sense to run the loop in parallel because calls to write are serialized by the lock on the file descriptor fd. #pragma mta assert parallel for (i = 0; i < n; i++) { char line[80]; int len = sprintf(line, "this is iteration %d\n", i); write(fd, line, len); } If the loop contains a significant computation, such as in the following example, you may want to parallelize the loop. #pragma mta assert parallel for (i = 0; i < n; i++) { char line[80]; int j = expensive_function(i); int len = sprintf(line, "f(%d) = %d\n" ,` i, j); write(fd, line, len); } You cannot use the other low-level UNIX I/O functions to support concurrent access to a single file. 42 S–2479–20Developing an Application [3] 3.8 Porting Programs to the Cray XMT Use the following information when you prepare to port C and C++ programs to the Cray XMT platform. 64-bit issues The following list describes important 64-bit issues. Alignment On the Cray XMT, many data types are aligned on 8-byte boundaries that other machines align on 2- or 4-byte boundaries. The Cray XMT uses the following alignments: • 1-byte boundaries: char, __int8 • 2-byte boundaries: __short16, __int16 • 4-byte boundaries: short, __short32, float, __int32 • 8-byte boundaries: int, long, double, long double, and all pointers Bit shift and bit mask Be careful when using bit shift or bit mask to extract fields of a value. Problems can occur if the size of the value type on the Cray XMT is different from the size on the machine you are porting from. Conversion of floating-point data types In C and C++ programs, floating-point data types are converted to doubles in all expressions. This conversion is also made on the Cray XMT, except for long doubles (16-bytes long) which are not converted to doubles (8 bytes long). Unions Unions sometimes contain assumptions about the relative sizes of data types. For example, on some machines, two int values use the same number of bytes as a long. However, on the Cray XMT, int and long values use the same number of bytes. When in doubt, use the sizeof operator to determine the size of data types. Posix compliance The following list describes issues related to IEEE Portable Operating System Interface (Posix) compliance. S–2479–20 43Cray XMT™ Programming Environment User’s Guide errno.h errno is thread-specific and not a global variable. Files that use errno in the same way that it is used by library calls such as perror must include errno.h. This is required by ANSI and Posix, but most systems do not comply with this convention. On the Cray XMT, each thread has its own value of errno, so you must include errno.h for correct behavior. time.h One goal of the Cray XMT is to support a Posix-compliant application programming interface. As a result, when you port non-Posix programs, you may have to change the header files that are included. For example, you may need to include time.h instead of, or in addition to, sys/time.h. Executable formats On the Cray XMT, executable programs are in ELF format instead of a.out format. Therefore, you should replace a.out.h in your programs with elf64.h. Another characteristic of the ELF format is that uninitialized and initialized global variables are both mixed in memory. Miscellaneous issues The following list describes important miscellaneous issues. printf and $ Different implementations of printf have different ways of interpreting $. The implementation of printf on the Cray XMT does not have a special interpretation. C and C++ structure passing Structures cannot be passed by value from C to C++. mmap mmap is based on file data-block size. The data-block size for a Cray XMT file is different from that on BSD 4.4 UNIX. Although you can use mmap, the mmap_fsblk system call provides richer semantics. 44 S–2479–20Developing an Application [3] Cray XMT keywords You can disable Cray XMT specific keywords (for example, sync and future) by using the compiler flag -no_mta_ext. When this flag is not used, the C compiler for the Cray XMT reserves all keywords—even standard C++ keywords such as new, try, throw, and catch. Preprocessor directives The following directives are supported on the Cray XMT: #define #elif #else #endif #error #ident #if #ifdef #ifndef #include #line #undef #pragma #pragma fenv_access #pragma noalias #pragma once 3.9 Debugging the Program After completing your program, refer to the Cray XMT Debugger Reference Guide for debugging information. S–2479–20 45Cray XMT™ Programming Environment User’s Guide 46 S–2479–20Shared Memory Between Processes [4] You can share memory between multiple programs by creating a shared memory region using the mmap system call. 4.1 Mapping a Memory Region for Data Sharing A shared memory region is identified by a file name. Before your applications can use shared memory, you must create an empty readable, writable file and run mmap to map a memory region to use for shared memory. When you run mmap, it allocates the specified amount of physical memory and maps it into the caller's address space. Other programs may share the same memory region by specifying the same file name. A process may use the unmap system call to unmap the shared memory region. Example 10. Mapping memory to share among multiple processes The following example demonstrates how to create a file and map it to a memory location. #include #include #include #include #define SHARED_SIZE (256*1024*1024) int main(int argc, char *argv[]) { int fd = open(argv[1], O_RDWR|O_CREAT); if (fd==-1) { perror(argv[1]); return 1; } caddr_t data = mmap(0, SHARED_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANON, fd, 0); if (data == MAP_FAILED) { perror("mmap"); return 1; } unsigned *words = (unsigned*)data; // words now points to a shared memory segment } S–2479–20 47Cray XMT™ Programming Environment User’s Guide In the previous example, a new readable and writable file is created by using the open system call with the O_RDWR and O_CREAT flags. The fd file descriptor is allocated and refers to the file. The fd is specified as an argument to the mmap system call and identifies the memory region. SHARED_SIZE specifies the size of the memory region to allocate and map into the caller's address space. PROT_READ | PROT_WRITE specifies that the caller has both read and write permissions to the memory. MAP_SHARED specifies that this is a shared memory region. MAP_ANON specifies that the operating system should allocate physical memory that is not backed up to a file. The mmap system call returns a pointer to the starting virtual address at which the memory was mapped. The data in the memory region is initialized to zeros and the memory state is initialized to full. The physical memory associated with a shared memory region is normally freed when the last process that was sharing the memory unmaps the memory from its address space. The memory is unmapped either by an explicit call to the unmap system call or automatically upon termination of the process. The persist_rememd function causes the remember daemon to create a mapping to the shared memory region. This preserves the shared memory region even after all other user processes have unmapped the region. The data is preserved only until the system reboots, at which time all data that was in the shared memory region is lost. The persist_rememd function will remember the file name and size of the memory region across reboots and will automatically reallocate the shared memory region upon reboot; the data in the shared memory region is initialized to contain zeros and the state is initialized to full. For more information, see the rememd(8) man page. Additionally, programs that use synchronization must add calls to the mta_lock_thread and mta_set_thread runtime functions, as shown in the following example. mta_lock_thread(); //Set retry > 0 mta_set_thread_datablocked_retry(MAXINT); //Sets retries = INF The mta_lock_thread function locks a thread to its stream so that the thread does not block and release the stream when it takes a retry limit exception. The mta_set_thread function sets the retry limit to the maximum value. The result of calling these two functions is to cause a thread to spin if a sync-qualified or future-qualified variable is not in the appropriate state for a given memory access, until the thread gains access to the shared data. Spinning is the act of checking the full-empty state repeatedly until the full-empty state changes to the state that the memory operation needs to perform its operation. This is necessary when synchronization operations are performed between multiple separate processes. Threads that are blocked can only be unblocked by threads within the same process because blocking and unblocking requires access to the runtime internal data structures that are only accessible within the process to which the thread belongs. For more information, see the mmap(2) man page. 48 S–2479–20Shared Memory Between Processes [4] 4.2 Persisting Shared Memory The remember daemon rememd retains information about shared memory so that programs preserve shared memory throughout the life cycle of the process. Shared memory is allocated by calling mmap with the MMAP_ANON and MMAP_SHARED flags and a valid file descriptor. When the rememd daemon is first started, it reads in all the records from its maps file and calls mmap to map the specified memory into its virtual address space. The daemon does not repopulate the memory; it only allocates it and retains a reference. rememd does not attempt to map the same memory segment twice. Once it is mapped, rememd increments an internal reference count on subsequent remember requests. Calling rememd does not guarantee that the memory is reclaimed as free. If another program is retaining a reference to the memory, it remains allocated. If multiple requests are made to remember the same segment, rememd decrements its internal counter for each forget request until the counter is 0 (zero), at which point, it calls munmap. By holding memory references, the rememd daemon allows the memory to outlast one or more processes that might want to use the memory. Programs that wish to make use of the functionality offered by rememd are required to link with the libremem library. When a method is called, a remote procedure call is made from rememd using UNIX domain sockets. The path to use to communicate with the daemon is specified in the configuration file found at /etc/rememd.conf or in the path specified by the environment variable REMEMD_CONFIG_PATH. S–2479–20 49Cray XMT™ Programming Environment User’s Guide Use the following functions to call rememd from a program. persist_remember Causes the rememd daemon to call mmap to map the shared memory into its virtual address space and write a record of it to disk. If rememd has already mapped this segment, its reference count is incremented instead. This function returns 0 on success, and errno on failure. persist_mmap_size Causes the rememd daemon to return the size of the shared memory mapped into its virtual address space. This function returns the size, in bytes, of the memory region on success, and 0 if the region is not found. When an error occurs, errno is set and -1 is returned. persist_forget Causes the rememd daemon to decrement the specified segment's reference count. If the reference count is zero, the rememd daemon calls munmap to unmap the shared memory from its virtual address space and remove the record of it from disk. This function returns 0 on success, and the errno on failure. The following example shows how to persist memory in your program. #include #include #include #include #include #include #include #include const char *remember_path = "/tmp/my_data_handle"; void run_computation(caddr_t addr, size_t len, int ret); int main(int argc, char **argv) { caddr_t mmap_addr = 0; size_t mmap_len = 4096; int fd = -1; // find out if memory is mapped ssize_t ret = persist_mmap_size(remember_path); if (-1 == ret) { // -1 indicates there was an error printf("Unexpected error from libremem: %s\n", strerror(errno)); exit(1); } else if (0 == ret) { // 0 indicates the memory has not been mapped yet fd = open(remember_path, O_CREAT | O_RDWR, 0600); if (-1 == fd) { printf("Unexpected error opening remember_path: %s\n", strerror(errno)); exit(1); 50 S–2479–20Shared Memory Between Processes [4] } mmap_addr = mmap(0, mmap_len, PROT_WRITE, MAP_ANON | MAP_SHARED, fd, 0); if (MAP_FAILED == mmap_addr) { printf("Unexpected error calling mmap: %s\n", strerror(errno)); close(fd); exit(1); } int remember_ret = persist_remember(remember_path, mmap_len); if(0 != remember_ret) { printf("Unexpected error calling persist_remember: %s\n", strerror(remember_ret)); close(fd); munmap(mmap_addr, mmap_len); exit(1); } } else { // if ret is not -1 or 0, then it's the length of the mapped segment fd = open(remember_path, O_CREAT | O_RDWR, 0600); if (-1 == fd) { printf("Unexpected error opening remember_path: %s\n", strerror(errno)); exit(1); } mmap_len = ret; mmap_addr = mmap(0, mmap_len, PROT_WRITE, MAP_ANON | MAP_SHARED, fd, 0); if (MAP_FAILED == mmap_addr) { printf("Unexpected error calling mmap: %s\n", strerror(errno)); close(fd); exit(1); } } run_computation(mmap_addr, mmap_len, ret); int forget_ret = persist_forget(remember_path, false); if(0 != forget_ret) { printf("Unexpected error calling persist_remember: %s\n", strerror(forget_ret)); } if(0 != munmap(mmap_addr, mmap_len)) { printf("Unexpected error calling munmap: %s\n", strerror(errno)); } if(0 != close(fd)) { printf("Unexpected error calling close: %s\n", strerror(errno)); } return 0; } S–2479–20 51Cray XMT™ Programming Environment User’s Guide 52 S–2479–20Developing LUC Applications [5] This chapter describes how to use the LUC library in your application. The following tasks are discussed: • Constructing a client • Constructing a server • Making remote procedure calls 5.1 Programming Considerations for LUC Applications • On the service (Linux) nodes, int is defined as 4 bytes. On the MTK compute nodes int is defined as 8 bytes. To avoid potential issues, programmers should use types that have explicit sizes, for example int64_t. • There is a limit of 256 MB of data that can be transferred in a single call. This applies to both input and output buffers. • Linux and MTK have different native byte ordering, Linux is little endian (4 bytes) and MTK is big endian (8 bytes). LUC does not byte-swap or otherwise interpret the application's input and output data so you must add byte-swapping into your application that will perform byte swapping for data transfers between the server and client applications. • The number of threads that are assigned to an object during a call to startService should be determined by the length of time function calls made by that object are expected to take. Allocate enough threads so that an operation is never stalled while waiting for a thread to become available. • The Linux version of the library can only honor a requestedPid value other than PTL_PID_ANY for the first endpoint in an application process. The exception is that subsequent values of requestedPid may be honored if they are equal to the requestedPid of the first endpoint for a process. 5.2 Creating and Using a LUC Client Use the following procedure to create a client object. S–2479–20 53Cray XMT™ Programming Environment User’s Guide Procedure 3. Creating and using a LUC client object 1. Include the header file . This header file includes all of the definitions required for both the client and server endpoints, including the LucEndpoint class definition, configuration variables, and external function prototype definitions. 2. Declare a pointer to a LucEndpoint object. A LucEndpoint is the abstract base class for both the Linux and MTK implementations of LUC endpoints and defines the user interface as virtual functions. Internal to the LUC implementation, there are two subclasses that are derived from the LucEndpoint abstract class: LucPortalsEndpoint is the Linux implementation, and LucFioEndpoint is the MTK implementation. These derived classes implement the virtual functions for either Linux/Portals or MTK/FAST I/O. From the user-application perspective, both derived classes present an identical interface. 3. Allocate the object by calling luc_allocate_endpoint(). This function takes a service type as an argument and allocates the correct LucEndpoint derived class object. When compiling for a Linux node, the Linux version of the object is returned. When compiling for an MTK node, the MTK version of the object is returned. 4. Activate the client endpoint by calling startService. This causes LUC to allocate a system wide unique endpoint identifier and to allocate the underlying Fast I/O data streams. If an error is encountered while activating the service, LUC returns an error code 5. Prepare the input and output buffers. The input buffer is provided as input to the remote server function. The output buffer contains the output data from the remote server function. The buffers may reside in nearby or global distributed memory. 6. Invoke a remote function synchronously by calling remoteCallSync; provide the server endpoint identifier, the service type, the function index, and the input and output buffers and lengths. The outputDataLen parameter specifies the size of the data buffer provided by the caller. On return from the function, this parameter will contain the actual size of the output data, which is less than or equal to the original value provided by the caller. 7. The service type and function index are application defined and can be any integer value. As illustrated in the example that follows, the function indices need not be consecutive. The service types describe the type of service provided by the object. 8. Wait for the remote function to complete and then process the result. The remoteCallSync() method will not return until the remote function has completed or an error has occurred. The return value from remoteCallSync is either a LUC error code or the return value from the remote function. 54 S–2479–20Developing LUC Applications [5] 9. Stop the service by calling stopService. This releases any nearby memory that was allocated by the endpoint, closes all previously opened Fast I/O data streams, and deactivates the object. 10. Delete the object. This invokes the virtual destructor for the derived object. If an endpoint object is deleted before calling stopService, the destructor automatically stops the service and deactivates the object. Example 11. LUC client code example user_application_defs.h // sample header // function index definitions // note that the values do not have to be contiguous #define FUNC_QUERY1 1 #define FUNC_QUERY3 3 #define FUNC_QUERY8 8 // service type definitions #define QUERY_MANAGER 1 #define QUERY_ENGINE 2 #define UPDATE_MANAGER 3 #define UPDATE_ENGINE 4 user_application.cpp //sample client code #include #include const int INBUF_SIZE = (1 * 1024 * 1024); // 1 MB input data const int OUTBUF_SIZE = (2 * 1024 * 1024); // 2 MB output data void client(luc_endpoint_id_t serverID) { LucEndpoint *clientEndpoint; luc_error_t result; char *outbuf = malloc(OUTBUF_SIZE); char *inbuf = malloc(INBUF_SIZE); size_t outDataLen = OUTBUF_SIZE; clientEndpoint = luc_allocate_endpoint(LUC_CLIENT_ONLY); result = clientEndpoint->startService(); if (result != LUC_ERR_OK) { // process LUC error delete clientEndpoint; return; } S–2479–20 55Cray XMT™ Programming Environment User’s Guide result = clientEndpoint->remoteCallSync(serverID, QUERY_ENGINE, FUNC_QUERY1, inbuf, INBUF_SIZE, outbuf, &outDatLen); if(result == LUC_ERR_OK) // The RPC was successful. // outDataLen contains the size of data returned in outbuf. else if result < LUC_ERR_MAX) { // Result contains a LUC error code. } else { // Result is the return value from remote function } clientEndpoint->stopService(); delete clientEndpoint; } 5.3 Creating and Using a LUC Server The server allocates and activates an endpoint object in a manner similar to that of the client. Object deactivation and deletion are also similar. The primary difference is the requirement for the server to register its remote functions. Use the following steps to create a server object. Procedure 4. Creating and using a LUC server object 1. Include the header file , as well as the application defined header file. 2. Declare a pointer to a LucEndpoint object. 3. Allocate the object by calling luc_allocate_endpoint. 4. Call registerRemoteCall to register each function that will be serviced by this endpoint. The first parameter is the service type, the second parameter is the function index, and the third parameter is the address of the function. 5. Activate the server endpoint by calling startService. The parameter is the number of LUC worker threads to start. The default is 1. The MTK version of the library ignores this value and creates one worker thread for each RPC. This method call causes LUC to allocate a number of nearby memory buffers for incoming requests and pre-post these receive buffers with the Fast I/O driver. The worker threads service the client requests as they come in. 6. Wait for a request to halt the service. There are many ways to accomplish this. In the following MTK example, the main application server thread then waits to be told to halt the service — by doing a synchronized read on an empty memory location. When the request is received, the application stops the service and deletes the endpoint. The application coordinates the notification to the server to 56 S–2479–20Developing LUC Applications [5] shutdown the service. For instance, if a serious application internal error occurs or an application shutdown request is received, the server must be told to halt by the application. Example 12. LUC Server code example #include #include (see below, step 6) void server() { LucEndpoint *svrEndpoint; luc_error_t err; svrEndpoint = luc_allocate_endpoint(LUC_SERVER_ONLY); err = svrEndpoint->registerRemoteCall(QUERY_ENGINE, FUNC_QUERY1, query1); if (err != LUC_ERR_OK) { // Process LUC error code delete svrEndpoint; return; } // Register more remote calls as above .... err = svrEndpoint->startService(); if (err != LUC_ERR_OK) { // process LUC error code delete svrEndpoint; return; } readfe(&haltService); // MTK full-empty synchronization svrEndpoint->stopService(); delete svrEndpoint; return; } 5.4 Communication Between LUC Objects The following example shows how the application uses the client and server objects to communicate. Example 13. Allocating and using LucEndpoint objects to communicate // Application-specific definitions #define QUERY_ENGINE_ALIVE_FCTN_ID 1 #define QUERY_ENGINE_DATA_BOUNCE_FCTN_ID 2 // // This asynchronous completion handler conforms to LUC_Completion_Handler // void ClientCompletionHandler(luc_endpoint_id_t destAddr, S–2479–20 57Cray XMT™ Programming Environment User’s Guide luc_service_type_t serviceType, int serviceFunctionIndex, void * userHandle, luc_error_t remoteLucError) { // In the example given, 'userHandle' will equal 0xf00 return; } void LucClientOnlyUsageModel(void) { // First create an endpoint. This is used to make the remote calls. LucEndpoint *client = luc_allocate_endpoint(LUC_CLIENT_ONLY); // In order to issue the remote calls, we need to know where to send them. // The library uses the abstract 64 bit 'luc_endpoint_id_t' value, so the // client application has to get this value from the server by some other // means. luc_endpoint_id_t serverEndpointId; // This example assumes that 'serverEndpointId' is filled in by some // other means; environment variable, command line option, etc. // Enable the local endpoint. This will create worker threads and allocate // resources. luc_error_t lucError = client->startService(); if (LUC_ERR_OK != lucError) { // error case delete client; return; } // Once the client object has been started successfully, the application // can use it to make synchronous and asynchronous calls. // A synchronous (blocking) call. // The application is responsible for setting serviceType and // serviceFunctionIndex to something meaningful (ie. something // registered by the object at 'serverEndpointId'). luc_service_type_t serviceType = LUC_ST_QueryEngine; int serviceFunctionIndex = QUERY_ENGINE_ALIVE_FCTN_ID; // This particular remote call passes no data. lucError = client->remoteCallSync(serverEndpointId, serviceType, serviceFunctionIndex, NULL, // void *inputData, 0, // size_t inputDataLen, NULL, // void *outputData, 0); // size_t *outputDataLen); if(lucError == LUC_ERR_OK) //RPC was successful 58 S–2479–20Developing LUC Applications [5] else if (lucError < LUC_ERR_MAX) // LUC library generated error code else // user remote function return value // // An asynchronous (non-blocking) call. // Return data is not supported for asynchronous callers. // void *myMeaningfulHandle = 0xf00; lucError = client->remoteCall(serverEndpointId, serviceType, serviceFunctionIndex, NULL, // void *inputData, 0, // size_t inputDataLen, myMeaningfulHandle, ClientCompletionHandler); // The application can do other work while the remote call is in progress. // ClientCompletionHandler will fire in some other context at a later time. // When the application is finished with the endpoint object, it should // be stopped. lucError = client->stopService(); // and destroyed. delete client; return; } // ServerQueryEngineAliveFunction: // implements {LUC_ST_QueryEngine, QUERY_ENGINE_ALIVE_FCTN_ID} // conforms to LUC_RPC_Function_InOut prototype // luc_error_t ServerQueryEngineAliveFunction(void * inData, u_int64_t inDataLen, void ** outData, u_int64_t * outDataLen, void ** completionArg, LUC_Mem_Avail_Completion * completionFctn, luc_endpoint_id_t callerEndpoint) { // This function is a simple case. It does not accept or return any data. if (*outData) *outData = NULL; if (*outDataLen) *outDataLen = 0; // Since this function is not returning data, it does not need to register // a memory-available (or dereference) handler. *completionFctn = NULL; return LUC_ERR_OK; // successful return code } S–2479–20 59Cray XMT™ Programming Environment User’s Guide void LucServerOnlyUsageModel(void) { // First create a communication endpoint. This is used to accept calls from // remote clients. LucEndpoint *server = luc_allocate_endpoint(LUC_SERVER_ONLY); // These values correspond to values used by clients of this service. luc_service_type_t serviceType = LUC_ST_QueryEngine; int serviceFunctionIndex = QUERY_ENGINE_ALIVE_FCTN_ID; // The registration routine simply records the desired function in a // table so that future client requests know which function to fire. lucError = server->registerRemoteCall(serviceType, serviceFunctionIndex, ServerQueryEngineAliveFunction); // The LucEndpoint object must be started before it can accept remote // function call requests. // This example creates two server worker threads; one to do main processing // and one to execute the ServerQueryEngineAliveFunction when it's called. uint_t totalThreadCount = 2; // This server doesn't need a specific Portals PID value. uint_t requestedPid = PTL_PID_ANY; lucError = server->startService(totalThreadCount, requestedPid); // If the server wants to report its endpoint id, via printf or socket-based // communication to some other application, it can get its endpoint ID // with the following function. luc_endpoint_id_t myEndpointId = server->getMyEndpointId(); // A proper service can go do other work here, wait for a termination // signal, or exit this thread (as long as the server object isn't // destroyed). // The endpoint object will accept and remote function requests // until stopped at some later time with stopService. lucError = server->stopService(); delete server; return; } 5.5 LUC Client/Server Example This example implements a server-side sum of values provided by the client, with the sum returned to the client. The program should be run once using the following command to start the server: % exluc -s 60 S–2479–20Developing LUC Applications [5] Then the client can be run multiple times using the following command: % exluc -c id Where id is the server endpoint ID printed to the command line when the server starts. #include #include #include #include // htonl/ntohl byte swapping #include // The service type is an application-specific major service id. // It identifies the general type of service requested by the client. // One server may implement one or more service types. // For this example, one service type is defined. int svc_type = 0; // The function index is an application-specific minor service id. // It identifies a specific server function out of the functions defined // by the server within one of its supported service types. // Each service type may implement one or more functions. // For this example, one function within the svc_type service type is defined. int reduce_func_idx = 0; #define NREDUCE 10 // number of values to be summed // Opteron uses little-endian byte order and XMT uses big-endian byte order. // When an Opteron client uses an XMT server, byte swapping is required // to convert the data between the two systems. // This example uses network byte order (big-endian) for all LUC data transfers, // and converts to host byte order before using LUC data. #if defined(__MTA__) || defined(NO_BYTE_SWAP) // Host byte order is the same as network byte order on XMT, // so no conversion is necssary. #define NetworkToHost(b,l) #define HostToNetwork(b,l) #else // Byte swap to convert between host and network byte ordering. void ByteSwap(void *buf, size_t len) { char *c = (char *) buf; int i; for (i=0;i < len;i+=2) { char t = c[0]; c[0] = c[1]; c[1] = t; } } S–2479–20 61Cray XMT™ Programming Environment User’s Guide #define NetworkToHost(b,l) ByteSwap((b),(l)) #define HostToNetwork(b,l) ByteSwap((b),(l)) #endif // The LUC client runs on the XMT login node, and acts as // the application user interface. // Return value is 0 for success, 1 for error. int client(luc_endpoint_id_t serverID) { double input[NREDUCE]; // input data double output; // result size_t in_size = sizeof(double) * NREDUCE; // input size in bytes size_t out_size = sizeof(double); // output size in bytes luc_error_t err; // result code from LUC calls // Initialize the input data. for (int i=0;i < NREDUCE;i++) input[i] = i; // Create the LUC client endpoint. LucEndpoint *clientEndpoint = luc_allocate_endpoint(LUC_CLIENT_ONLY); // Initialize the endpoint (connect to the server). err = clientEndpoint->startService(); if (err != LUC_ERR_OK) { fprintf(stderr,"client: LUC startService error %d\n",err); delete clientEndpoint; // free memory return 1; // error } HostToNetwork(input,in_size); // convert data to network byte order // Send the request to the server and wait for a response. // In this example, the array of values to be summed is sent, // and the sum is returned as the result. err = clientEndpoint->remoteCallSync(serverID, svc_type, reduce_func_idx, input, in_size, &output, &out_size); if (err != LUC_ERR_OK) { // err contains a LUC error code. fprintf(stderr,"client: LUC remoteCallSync error %d\n",err); } else { // out_size contains the size of data returned in outbuf NetworkToHost(&output,out_size); // convert data to host byte order printf("The sum of the %d values is %lf\n",NREDUCE,output); } clientEndpoint->stopService(); // disconnect from the server delete clientEndpoint; // free memory return (err != LUC_ERR_OK) ? 1 : 0; } 62 S–2479–20Developing LUC Applications [5] // Reduction service. // This routine is called by the LUC server library // when a client request of type // (svc_type,reduce_func_idx) is received. luc_error_t reduce(void *inPtr, u_int64_t inDataLen, void **outPtr, u_int64_t *outDataLen, void **completionArg, LUC_Mem_Avail_Completion *completionFctn, luc_endpoint_id_t callerEndpoint) { double *input = (double *) inPtr; // input data double *output = NULL; int n = inDataLen / sizeof(double); // number of values to sum // Default (error) return will be no output data *outPtr = NULL; *completionArg = NULL; *completionFctn = NULL; // Allocate space for the return data output = (double *)malloc(sizeof(double)); if (NULL == output) { return LUC_ERR_RESOURCE_FAILURE; // or use a custom code } NetworkToHost(input,inDataLen); // convert data to host byte order // Perform the reduction. double sum = 0; for (int i=0;i < n;i++) sum += input[i]; *output = sum; // set result value HostToNetwork(output,sizeof(double)); // convert result to network byte order *outDataLen = sizeof(double); // set result size *outPtr = (void *)output; // set result / output pointer // Tell LUC to call 'free' when it is done without the output data. // Pass the 'output' pointer to free() *completionArg = output; *completionFctn = free; return LUC_ERR_OK; } // The LUC server can run on the XMT login node or in the compute partition. // Return value is 0 for success, 1 for error. int server(int threadCount) { luc_error_t err; // result code from LUC calls // Create the LUC server endpoint. LucEndpoint *svrEndpoint = luc_allocate_endpoint(LUC_SERVER_ONLY); // Register routines which implement the services. S–2479–20 63Cray XMT™ Programming Environment User’s Guide err = svrEndpoint->registerRemoteCall(svc_type, reduce_func_idx, reduce); if (err != LUC_ERR_OK) { fprintf(stderr,"client: LUC registerRemoteCall error %d\n",err); delete svrEndpoint; return 1; // error } // Begin offering services (begin listening for requests). err = svrEndpoint->startService(threadCount); if (err != LUC_ERR_OK) { fprintf(stderr,"client: LUC startService error %d\n",err); delete svrEndpoint; return 1; // error } // Print out the endpoint id for the server. This value is a required // input for the client. fprintf(stderr,"server: Server ready. My endpoint id is %ld\n",svrEndpoint->getMyEndpointId()); // At this point, the main server thread waits while requests // to the server are handled by other threads. // A "terminate server" client request can be defined by the // application to handle server shutdown, or else the server can // simply be killed when the server is no longer needed. // For this example, the server waits until it is killed. getc(stdin); // The server has been requested to shut down. svrEndpoint->stopService(); // stop listening for requests delete svrEndpoint; // free memory return 0; } // The main program either calls the server routine or the client // routine. The server (-s) should be started first, then the // client (-c id) can be run multiple times. // Shut down by killing the server process. int main(int argc, char **argv) { luc_endpoint_id_t id; int i; while ((i = getopt(argc,argv,"c:s")) != EOF) { switch (i) { case 'c': id = strtoul(optarg, NULL, 0); return client(id); // make a request to server with this endpoint id case 's': return server(1); // start server with 1 request-processing thread } } 64 S–2479–20Developing LUC Applications [5] // If no valid options were given, print the program usage message. fprintf(stderr,"Usage: exluc -c id | -s\n"); fprintf(stderr,"-c id Run as a client with the given endpoint id.\n"); fprintf(stderr,"-s Run as a server, printing the endpoint id.\n"); return 1; } 5.6 Fast I/O Memory Usage The MTK Fast I/O Library performs all data transfer operations through nearby memory. Nearby memory is memory on the same node as the Threadstorm processor where the LUC endpoint was started. The library transfers user data into and out of nearby memory buffers automatically. Use configuration variables to control the amount of nearby memory used by the library. The MTK Fast I/O Library uses one or two regions of nearby memory for each local endpoint as I/O buffers. The library requires one region for all small allocations and allows for an optional region for large allocations. The small region is used for core RPC data structures that are sent over the high speed network. Small data transfer buffers may use the small region as well. The optional large memory region is used for large transfer requests and many concurrent smaller requests. The large region may be sized to support one very large RPC request or several smaller requests. To control the size of the small memory region use the configuration variable LUC_CONFIG_MAX_SMALL_NEARMEM_SIZE. Legal values range from 1 MB (1,048,576) to 256 MB (268,435,456), inclusive, in power-of-two increments. The size of the largest allowable request on this memory region may be specified with the LUC_CONFIG_MAX_SMALL_MEM_REQUEST variable. Legal values range from 64 KB (65,536) to one half of the current small memory region size, inclusive, in power-of-two increments. To control the size of the large memory region use the configuration variable LUC_CONFIG_MAX_LARGE_NEARMEM_SIZE. Legal values range from 1 MB (1,048,576) to 2 GB (2,147,483,648), inclusive, in power-of-two increments. While the library allows for a very large nearby memory region, the system may not be configured with enough nearby memory to support a maximum size nearby memory region. The size of the largest allowable request on this memory region may be specified with the LUC_CONFIG_MAX_LARGE_MEM_REQUEST variable. Legal values range from 1 MB (1,048,576) to the current large memory region size or 256 MB, whichever is less. The maximum request size must be an integral power-of-two. To disable the large memory region specify a requested size of zero. S–2479–20 65Cray XMT™ Programming Environment User’s Guide Initialize the memory region variables from the global variables when creating the LUC Endpoint object. Changes to the global variables are propagated to new endpoint objects, not objects that already exist. An endpoint's memory configuration variables may be changed by using the LucEndpoint::setConfigValue() method until the endpoint is started. Once the endpoint starts, the size of the nearby memory regions and the maximum transfer sizes are locked in and may not be modified until you stop the endpoint. Attempts to change these configuration variables by using LucEndpoint::setConfigValue() fail with LUC_ERR_INVALID_STATE. If you try to change the global configuration variables, the changes do not propagate to started endpoints. Attempts to set invalid memory sizes or maximum request sizes fail with LUC_ERR_BAD_PARAMETER. 66 S–2479–20Managing Lustre I/O with the Snapshot Library [6] 6.1 About the Snapshot Library The Cray XMT snapshot library provides a high speed bulk data transfer facility that moves data between memory regions within an MTK application and files hosted on the XMT Linux service partition. The primary use of the snapshot library is to load and save large data sets that are being stored on a Lustre file system. For example, an application might use the snapshot library to load a large data set at the beginning of a run, process the data, then use the snapshot library to save the processed data in a file at the end of a run. An application might also use the snapshot library to save intermediate copies of the processed data during the course of a run. The snapshot library uses the Fast IO (FIO) mechanism on the compute partition to transfer data, in parallel, to and from files on the service partition using instances of a helper program called fsworker that provide file system access on login nodes. Multiple instances of fsworker can be used in parallel to provide higher throughput. This figure shows the most common data communication paths between an application using the snapshot library and a file on the compute partition. The data moves, in four distinct stages, between a global memory buffer in the application and a file on a Lustre file system hosted by the service partition. Figure 1. Snapshot Library Data Paths Global Memory Linux Service Partition Threadstorm Compute Nodes Snapshot Client Compute Node FIO Lustre File System FC Portals OSS OSS OSS FSW FSW FSW Application Data Buffer Compute Node Compute Node S–2479–20 67Cray XMT™ Programming Environment User’s Guide The easiest way to understand this is to imagine data going to a file from the application. In this case, the data is copied by each compute node into the FIO transport and sent to its corresponding fsworker on a login node in the Linux service partition. Each fsworker then uses Linux system calls to write data into the Lustre file, which results in the data moving across the Portals transport from the login node to one or more Lustre OSS nodes. From there, the data moves through Fibre Channel (FC) to the actual storage device. Moving data from a file to the application simply reverses the order of the stages and the direction of the data flow through each stage, ultimately resulting in data being copied from compute nodes into the application's global memory buffer. 6.2 The Snapshot Library Interface Note: Effective with Cray XMT version 2.0 the snap_* functions are replaced by dslr_* equivalents. The snap_* functions are deprecated and will be removed in a future release. The snapshot library interface consists of these functions: dslr_snapshot Copies data in parallel from a buffer in the application to a file on the service partition. dslr_restore Copies data in parallel from a file on the service partition to a buffer in the application. dslr_pread Allows the application to specify an offset into a file from which to read data. Does not move data in parallel. dslr_pwrite Allows the application to specify an offset into a file at which to write data. Does not move data in parallel. dslr_stat Allows the application to obtain file status from a file, similar to the stat function. dslr_truncate Truncates a file to a specified length. For more information on any of these functions, see the associated man page. 68 S–2479–20Managing Lustre I/O with the Snapshot Library [6] For large data transfers starting at the beginning of a file, the best functions to use are dslr_snapshot and dslr_restore, because they are able to transfer data in parallel to achieve high throughput. To store data, the application calls dslr_snapshot, specifying the buffer to be copied, the length of the data, and the name of the file receiving the data. To read back (restore) data from the file into application memory, the application calls dslr_restore, specifying the buffer receiving the data, the length of the data to read, and the name of the file providing the data. Because this name will be used by all instances of fsworker to open and read or write the file the file name should be an absolute path name to the location of the file on the service partition. A relative path name could be ambiguous or meaningless to a particular fsworker. A typical application might use dslr_restore and dslr_snapshot in the following manner: 1. Start up and allocate a large buffer to hold a data set. 2. Call dslr_restore specifying the name of the file providing the data, the buffer allocated in step 1, and the length of that buffer. 3. Process and change the data set. 4. Call dslr_snapshot to store the data set back to the file (or to a new modified data file). 5. If necessary repeat 3 and 4, using the snapshots as a way to preserve forward progress. The dslr_pwrite and dslr_pread functions are provided for transferring smaller amounts of data between a buffer and arbitrary locations in a file. To write data to a file, the application calls dslr_pwrite specifying the endpoint-ID of a single fsworker, the name of the file, the offset of the data in the file, a pointer to a buffer from which to take the data, and the length of the data to be written. To read data from a file, the application calls dslr_pread specifying the endpoint-ID of a single fsworker, the name of the file, the offset of the data in the file, a pointer to a buffer into which to put the data, and the length of the data to be read. Again, absolute path names for files are strongly recommended. S–2479–20 69Cray XMT™ Programming Environment User’s Guide A typical application might use dslr_pread and dslr_pwrite in the following manner: 1. Start up and allocate a small buffer to be initialized from a file. 2. Call dslr_pread specifying the name of the file providing the data, the offset of the data in the file, a pointer to the buffer allocated in 1, and the length of the data. 3. Process and change the data. 4. Call dslr_pwrite to store the data back to the file (or to a new modified data file). 5. Repeat 3 and 4 as often needed, using snapshots as a way to preserve forward progress in case of failure or for the sake of sharing the system. It is possible to mix uses of dslr_snapshot/dslr_restore and uses of dslr_pwrite/dslr_pread as needed in an application. ! Caution: The snapshot library functions can only be used one at a time; they cannot be used in parallel. Any attempt to use snapshot library functions in parallel will eventually result in corruption of the snapshot data and possible uncontrolled failure of the snapshot library or of one or more instances of fsworker. 6.3 Maintaining File System and I/O Parallelism The snapshot library is intended primarily for saving and retrieving large data sets on platforms with a Lustre file system. Lustre supports parallel access and is highly tunable, allowing users and administrators to set many options, including file stripe widths and block sizes. With proper provisioning and tuning, Lustre can sustain many gigabytes per second of throughput. Because the performance of the underlying Lustre configuration bounds the throughput of most snapshot library operations, careful Lustre tuning is essential for optimal snapshot performance. A detailed discussion of Lustre provisioning, configuration and tuning are beyond the scope of this document. One rule of thumb, however, makes a good starting point when using dslr_snapshot and dslr_restore in single-file mode with multiple fsworkers. Setting the block size to 32 megabytes and a file stripe width of all object storage server (OSS) nodes (-1) generally yields good results. Typically, for multi-file mode the directory is striped to a single object storage target (OST). The lfs command allows a user to set these parameters on a per-directory basis. See the setstripe/getstripe documentation in the lfs man page for more information. Contact your system administrator for more detailed information on tuning Lustre to the requirements of a particular application. 70 S–2479–20Managing Lustre I/O with the Snapshot Library [6] If the underlying file system is naturally serial (NFS, for example) its performance is constrained by the serial performance of the file system and any contention introduced by trying to use the file system in parallel. Again, the throughput of the snapshot library is bounded by the file system performance, so when using a serial file system a single fsworker provides the best throughput for the snapshot library. Note that fsworkers are not resilient. If a transaction fails, all involved fsworkers must be terminated and restarted. If the file system is full a snapshot function may return success even though the file was not written, or was only partially written. 6.4 Examples Example 14. Using dslr_snapshot and dslr_restore to save and restore data in a file. Note that this example waits for the call to dslr_snapshot to complete before calling dslr_restore. While this is logical in this example, it is also crucial for correct operation. (See the caution about using snapshot library functions in parallel above.) #include #include #include #include #include #include const size_t DEFAULT_BUFFER_SIZE = 1024 * 1024 * 1024; const char DEFAULT_FILENAME[] = "/mnt/lustre/myusername/snapshot.data"; int main(int argc, char *argv[]) { void *testBuffer = NULL; int64_t err; int64_t snapError = 0; // Allocate a large buffer to be transferred. if (NULL == (testBuffer = malloc(DEFAULT_BUFFER_SIZE))) { fprintf(stderr,"Failed to malloc %d byte snapshot buffer.\n", DEFAULT_BUFFER_SIZE); return -1; } memset(testBuffer, 'a', DEFAULT_BUFFER_SIZE); // Snapshot the testBuffer to disk // All file system workers must be able to access the specified path. err = dslr_snapshot ((char *)DEFAULT_FILENAME, testBuffer, DEFAULT_BUFFER_SIZE, &snapError ); if (dslr_ERR_OK != err) { fprintf(stderr,"Failed to snapshot the dataset. Error %d.\n",err);free(testBuffer); return -1; } S–2479–20 71Cray XMT™ Programming Environment User’s Guide memset(testBuffer, 0, DEFAULT_BUFFER_SIZE); // Restore a snapshot dataset from disk back into memory. err = dslr_restore ((char *)DEFAULT_FILENAME, testBuffer, DEFAULT_BUFFER_SIZE, &snapError); if (dslr_ERR_OK != err) { fprintf(stderr,"Failed to restore the dataset. Error %d.\n",err); free(testBuffer); return -1; } // At this point, the testBuffer should be full of 'a' free(testBuffer); return 0; } Example 15. Using dslr_pwrite to write data to a file and dslr_pread to read back the data Note that the calls to dslr_pwrite and dslr_pread accept the value dslr_ANY_SW to specify the endpoint ID of the fsworker, allowing libsnapshot to use any registered endpoint. Therefore, the fsworkerID is automatically set to dslr_ANY_SW rather than requiring the user to enter the endpoint either manually or by the environment. Also note that, while the function call interface appears to invite parallel use of dslr_pwrite and dslr_pread, the functions cannot be used in parallel. Concurrent calls to these or any other snapshot library functions results in the problems described in the caution statement above. Regardless of how the endpoint is set, only one thread of one instance of fsworker will be applied to any given call to dslr_pwrite and dslr_pread. 72 S–2479–20Managing Lustre I/O with the Snapshot Library [6] While these functions are useful for transferring small quantities of data to or from arbitrary locations in files but, because they are unable to benefit from parallelism, they are not useful for bulk data transfer. You should not expect throughput greater than 100MB/second when using dslr_pwrite or dslr_pread. #include #include #include #include #include #include const size_t DEFAULT_BUFFER_SIZE = 1024 * 1024; // Relatively short buffer const char DEFAULT_FILENAME[] = "/mnt/lustre/myusername/snapshot.data"; int main(int argc, char *argv[]) { void *testBuffer = NULL;] int64_t err; int64_t snapError = 0; uint64_t fsworkerID = dslr_ANY_SW; off_t fileOffset = 0; int rc = 0; // Allocate a small buffer to be transferred. if (NULL == (testBuffer = malloc(DEFAULT_BUFFER_SIZE))) { fprintf(stderr,"Failed to malloc %d byte snapshot buffer.\n", DEFAULT_BUFFER_SIZE); return -1; } memset(testBuffer, 'a', DEFAULT_BUFFER_SIZE); // pwrite the testBuffer to disk fileOffset = 0; err = dslr_pwrite((char *)DEFAULT_FILENAME, fsworkerID, testBuffer, DEFAULT_BUFFER_SIZE, fileOffset,&snapError ); if (dslr_ERR_OK != err) { fprintf(stderr,"Failed to pwrite the dataset. Error %d.\n",err); free(testBuffer); return -1; } memset(testBuffer, 0, DEFAULT_BUFFER_SIZE); // pread the testBuffer from disk. err = dslr_pread ((char *)DEFAULT_FILENAME, fsworkerID, testBuffer, DEFAULT_BUFFER_SIZE, fileOffset, &snapError); if (dslr_ERR_OK != err) { fprintf(stderr,"Failed to pread the dataset. Error %d.\n",err); free(testBuffer); return -1; } // At this point, the testBuffer should be full of 'a' free(testBuffer); return 0; } S–2479–20 73Cray XMT™ Programming Environment User’s Guide 6.5 Managing File I/O on File Systems Other Than Lustre Using the snapshot library to read and write files on a file system, such as NFS that does not support high performance parallel I/O can result in overloading the underlying file system with data requests and transfers. Cray does not support this use of the snapshot library on any system with more than a single login node, as even file transfers of a few hundred MB can cause unacceptable network congestions. The standard operating system I/O functions OPEN(2), close(2), read(2) and write(2) are available for reading and writing files on NFS file systems that are cross-mounted to the compute partition. Files larger than 1 GB should always be read or written using the dslr* functions to a high performance parallel file system, such as Lustre. 74 S–2479–20Compiler Overview [7] This chapter provides an overview of the Cray XMT compilers. You need to understand these concepts before you compile your program. The Cray XMT platform includes Cray XMT compilers for C and C++ applications. These compilers optimize programs to improve performance. These features include: Debugging support The Cray XMT compilers support multiple levels of debugging. Each level receives some degree of optimization, but the level of optimization decreases as the level of debugging support increases. For example, the compilation process suppresses parallelization of loops at the highest debugging level. Optimization The Cray XMT compilers optimize parallelization, loop restructuring, and software pipelining, in addition to the classical scalar optimizations. Inlining The Cray XMT compilers support automatic and programmer-directed inlining within source files and among multiple source files. In addition, the compilers support inlining from separately compiled libraries. For a discussion of inlining, see Inlining Functions on page 84. Incremental recompiling and relinking The Cray XMT compiler detects unmodified functions and avoids recompiling them, even when other functions in the same file have been changed. The Cray XMT compiler uses incremental linking to avoid relinking an entire executable when some, but not all, of the functions have been modified. Each compiler is organized as a language-dependent front end. Both compilers use a common set of backend subprograms for translating, optimizing, and linking. The Cray XMT C compiler supports ANSI X3.159-1989 standard C. The Cray XMT C++ compiler supports the draft ISO/IEC 14882 C++ standard. Because of the commonality between the Cray XMT C and C++ compilers, they are referred to collectively as the compiler in the remainder of this chapter. S–2479–20 75Cray XMT™ Programming Environment User’s Guide 7.1 The Compilation Process There are two major phases of building a program executable from a number of source files. Compilation The compiler creates object files by invoking subprograms that translate the source files and optimize functions in the program. The compiler starts by invoking the front end. When the front end finishes, the compiler invokes the translator, which is the subprogram that optimizes and parallelizes code, and generates object files. Linking The compiler creates an executable program by invoking subprograms that create links between object files created during the compilation process and any associated libraries. Links can be created between two or more object files, in any combination, including the startup file, any specified object files or compile results, and user-created or standard libraries. For a traditional UNIX compiler, you use the cc -c file1.c command to translate the source file file1.c into an object file, which, by default, is called file1.o. You then link a set of object files using the cc file1.o file2.o command. This creates an executable called a.out. Unfortunately, this approach to compilation decreases the efficiency of the resultant executable program because each file of functions is first compiled independently and then linked together in a separate process. Using this approach, information that the compiler uses to optimize functions during the first compilation is not available during the linking phase when the object files are combined to form an executable. As a result, the compiler cannot perform some optimizations between object files that might seem simple to a programmer. In response to this problem, the Cray XMT compiler supports a compilation mode that enables information to be captured from individual modules and used when compiling multi-module programs. In this mode, each function is compiled in the context of a complete program, and the compiler may use facts about that context to optimize the translation of the function. The compiler retains this information so that when you modify your program's functions in the future, the compiler only needs to recompile the modified functions, resulting in a shorter recompile time. This mode is called whole-program compilation. The Cray XMT compiler also supports a mode for the traditional UNIX style of compiling called separate-module compilation. 76 S–2479–20Compiler Overview [7] The compilation processes for these modes differ in the following ways: Whole-program compilation This is the preferred method for compiling applications. In whole-program compilation, the compilation phase is made up of several sub-phases. The compiler first parses (partially compiles) each source file. During this phase, the compiler gathers information about every module in the program and saves it to the program library. The next phase is the translation phase. During this phase, the program is translated and optimized. The compiler optimizes each function in the program using information from within that function's module or other modules, including linked libraries, that the compiler gathered earlier. Finally, in the linking phase, the compiler links separate modules into a program executable. Information about all modules is stored, and passed between phases, in the program library. Separate-module compilation The compiler creates a separate object file for each source file and optimizes the functions within each source file using information about functions within that file. Then, the separate modules are linked to create a program executable. Whole-program compilation generally produces more highly optimized code than separate-module compilation. You can compile a program using one mode or the other, or a combination of the two. The following diagram shows the object files that the compiler creates when compiling the same arnoldi.cc and blas.cc files in different modes. S–2479–20 77Cray XMT™ Programming Environment User’s Guide Figure 2. Comparison of Whole-program and Separate-module Modes Whole-program Compilation skinny .o Files Parsed source code Call graph Object code Debugger information arnoldi.o blas.o test Executable code Separate-module Compilation fat .o Files Parsed source code Partial call graph Object code Debugger information blas.o test.pl test Parsed source code Partial call graph Object code Debugger information Debugger information Executable code test.pl arnoldi.o In whole-program mode, all the traditional object information for a program is contained in a single program library file. The program library has a .pl filename extension. The compilation process also produces an object file with a .o extension for each source file. This file is used as a time stamp to drive build processes. Each .o file corresponds to a module contained in the program library. The object information, or modules, for a program's source files are packaged together. This enables the compiler to optimize each function within the context of the entire program. 78 S–2479–20Compiler Overview [7] In separate-module mode, the .o files are true object files. The compiler optimizes each object file, or module, separate from the others. The link step produces a program library, although this program library primarily contains information that directs the debugger to various object files. Because of the relative sizes of the .o files in the two compilation modes, the qualifier skinny refers to whole-program mode and its products (such as the .o files) and the qualifier fat refers to separate-module mode and its products. During the compilation process, the compiler creates the following files: a.out The executable file. a.out.pl The program library. LOCK.a.out.pl The temporary lock file. The lock file prevents other compilers from accessing a program library when it is already in use. The compiler removes this file after use, unless the compiler terminates before completion. *.o Relocatable object files. 7.1.1 File Types Accepted by the Compiler The compiler accepts files that use the following extensions: .c C file when invoked with cc, C++ file when invoked with c++. .cc, .cpp C++ file. .o In whole-program compilation, time stamp file that does not need to be compiled but participates in any link step. Also referred to as a skinny .o file. In separate-module compilation, a true object file. Also referred to as a fat .o file. .pl Program library. Used to support incremental recompiling and debugging. In whole-program compilation, used to support inter-module analysis. .a Archive or library file. File prefixes used in the compilation process include the following: LOCK Temporary lock file used to prevent concurrent updates to the associated program library. S–2479–20 79Cray XMT™ Programming Environment User’s Guide 7.2 Invoking the Compiler You can only use the Cray XMT compiler when the Cray XMT Programming Environment (mta-pe) module is loaded. The commands to use to invoke the compiler are cc for a C program and c++ for a C++ program. You can control the operation of the compiler by setting various options when running the compiler command. The compiler uses driver options, language options, parallelization options, and debugging options. The driver options control how the compiler invokes subprograms. The compiler mode is set using driver options. The driver options that you use most often are the following: -c filename Compiles a specified source file. -o filename Links files and creates an executable. -pl filename Places object code and other data generated by the compiler into a program library file. This option is used for whole-program compilation. For example, if you specify both the -c and -pl driver options, the compiler compiles the program in whole program mode, but it does not link the files into an executable. For more information, see Setting the Compiler Mode on page 80. The language options control how the front end processes information. For example, the -E option indicates that the compiler should preprocess source files but not compile them. The -no_float_opt option prevents floating-point optimization. The parallelization options control parallelism in the program. For example, the -par1 option compiles a program so that it runs in parallel on a single processor. For more information, see Optimizing Parallelization on page 85. The debugging options control how the debugger works. For more information, see Setting Debugger Options during Compilation on page 88. Each compiler uses the same set of command-line options. For a complete list of command-line options, see the cc(1) or c++(1) man pages. 7.3 Setting the Compiler Mode To set the compiler mode to whole-program mode, run the cc or c++ command with the -pl option. This option builds a program library. 80 S–2479–20Compiler Overview [7] The following examples show how to use the compiler options for various compiler tasks using the whole-program and separate-module modes. Whole-program: c++ -c a.cc -pl prog.pl (parses a.cc) c++ -c b.cc -pl prog.pl (parses b.cc) c++ -pl prog.pl -o prog a.o b.o (translates a.o, b.o; links prog) Or, as a shortcut: c++ a.cc b.cc -o prog (compiles a.cc, b.cc; links prog, and creates prog.pl) Separate-module: c++ -c a.cc (parses and translates a.cc) c++ -c b.cc (parses and translates b.cc) c++ -o prog a.o b.o (links prog) 7.3.1 Whole-program Mode With whole-program compilation, the compiler has access to information about all functions in the program while optimizing each function. This information provides the compiler with the context for how the larger program uses each function. For example, when you use the c++ command to link the files jacobian.cc and blas.cc, the compiler has access to the entire program during all but the initial compilation phases, and compiles the program in whole-program mode. To do this, use the following command: c++ jacobian.cc blas.cc The previous command produces the skinny .o files jacobian.o and blas.o, the executable a.out, and the program library a.out.pl. Whole-program compilation enables inlining among files. The compiler can inline functions in blas into call sites in jacobian, and vice versa. The compiler can also inline functions into jacobian and blas from user-defined libraries linked with the program. See Creating New Libraries on page 87. The compiler builds the program library a.out.pl during the compilation phase. The whole-program compilation mode can be specified while retaining the flexibility of multiple compilation steps that you typically use for separate-module compilation. To do this, use the following sequence of commands: c++ -pl test.pl -c ddot.cc c++ -pl test.pl -c svd.cc c++ -pl test.pl -o test svd.o ddot.o The first two commands perform the initial compilation phase of ddot.cc and svd.cc using the program library test.pl. The last command specifies the construction of the test executable using the test.pl program library and the svd and ddot modules. S–2479–20 81Cray XMT™ Programming Environment User’s Guide When you use the -pl and -c options to compile a source file, the compiler performs the following tasks during the compilation phase: • Checks the source for syntax errors • Creates an internal representation of each function in the program library • Produces a skinny .o file During the linking phase, the compiler performs the following tasks to create an executable: • Performs optimizations using information about the complete program • Builds objects for each module • Links the modules together to produce an executable • Stores objects in the program library to support incremental recompilation As in traditional UNIX compilation, the -o flag specifies the executable name explicitly. To do this, use the following command: c++ -pl test.pl -o test svd.o ddot.o The previous command links the svd and ddot modules that reside in test.pl and creates the executable in a file called test. You can also specify multistep command sequences that use a mix of source and object files when using whole-program mode. To do this, use the following sequence of commands: c++ -pl a.out.pl -c ddot.cc c++ -pl a.out.pl arnoldi.cc ddot.o The first command partially compiles ddot.cc. The second command partially compiles arnoldi.cc; completes compilation and optimization of the modules ddot and arnoldi; links arnoldi, ddot, and any required libraries; and places the resulting executable in a.out. The compiler optimizes each function using information about the ddot and arnoldi modules. 7.3.2 Separate-module Mode -pl flag to compilation and link lines. Separate-module mode also prevents the propagation of changes made in one module to other modules. This greatly reduces the level of optimization that occurs when using separate-module mode compared to that of whole-program mode. To compile a single source file into its corresponding object file, use the following command. c++ -c ddot.cc 82 S–2479–20Compiler Overview [7] This produces (barring errors in the source file) a traditional, or fat, object file ddot.o. To produce the two fat object files ddot.o and daxpy.o, each of the two source files can be compiled separately. To do this, use the following command. c++ -c ddot.cc daxpy.cc Using the previous command is the same as using the following sequence of commands. c++ -c ddot.cc c++ -c daxpy.cc When compiling a file in separate-module mode, the compiler performs inter-function optimizations within individual files. As in whole-program mode, when the compiler constructs an executable, it also produces a program library. In separate-module mode, however, the program library is much smaller because it contains only information the debugger uses to locate more detailed debugging information in the separate fat object files. 7.3.3 Mixed Mode Whole-program and separate-module mode may be used in combination to build a particular program. You can use mixed-mode to isolate code in fat modules from changes made in other skinny or fat modules. You can also use it to share the same piece of precompiled object code among several programs, while still allowing the programs to take advantage of whole-program optimizations performed on unshared code. The following sequence of commands shows how to use mixed mode. c++ -c arnoldi.cc c++ -pl test.pl -c jacobian.cc blas.cc c++ -pl test.pl -o test arnoldi.o jacobian.o blas.o The first command compiles arnoldi.cc in separate-module mode, and produces the fat object file arnold.o . In this step, the compiler optimizes functions in arnoldi.cc without using information from the jacobian or blas functions. The second command partially compiles jacobian.cc and blas.cc in whole-program mode, places the results in test.pl, and produces the skinny .o files jacobian.o and blas.o . The third command performs final compilation and optimizations of functions from jacobian.cc and blas.cc, then links the functions to form the executable test. In this step, the compiler has knowledge of functions in arnoldi.cc, jacobian.cc, and blas.cc. S–2479–20 83Cray XMT™ Programming Environment User’s Guide 7.4 Inlining Functions Inline expansion, commonly known as inlining, occurs when the compiler replaces a function reference with the body of the function. The advantages to using inlining include a reduction in memory usage due to the removal of function calls and returns, and the possibility of optimizing code near the function call with the function body. The disadvantages include an increase in the size of the executable and an increase in the level of complexity required during debugging. When compiling in separate-module mode, the compiler inlines functions that are defined in the same file where they are referenced. When compiling in whole-program mode, the compiler can inline any function in the program or associated libraries. To view functions that are inlined, use the canal or Apprentice2 performance tools. See Cray XMT Performance Tools User's Guide. You can use either command-line switches or compiler directives to control how the compiler inlines functions. To set inlining from the command line, you can use either the -inline fcn to force the compiler to inline a specified function or -no_inline fcn to suppress inlining for the specified function. The option -no_inline_all suppresses inlining for all functions in a program. For C++, the function name fcn must use the mangled-name format. Mangled names are internal compiler names with complete type signatures. To do this, use the following command format. -inline mangledfunctionname To obtain the character string for the mangled name, use the nm -f command. To set inlining using a directive in your C or C++ program, you can add pragma statements that require or prohibit inlining of individual functions. To do this, use one of the following directives. #pragma mta inline #pragma mta no inline You must place one of the previous directives immediately before the function's definition in your program. The C++ keyword inline also inlines a function, but it makes the function local to the file. In this case, if you also add the function's definition to the header file, multiple inclusions would result in many copies of this function being added to the program library. Therefore, the use of the pragma directive is usually preferable to the C++ keyword. 84 S–2479–20Compiler Overview [7] 7.5 Optimizing Parallelization You can control how the compiler makes your program parallel in two ways: • You can add parallelization directives to your program. • You can specify a compiler option from the command line that controls parallelization. Parallelization directives and options tell the compiler how to parallelize various sections of a program. The following types of parallelization are allowed. Single-processor parallelism This form of parallelism has low overhead, but does not allow the program to scale beyond a single processor. This type of parallelization takes advantage of only the streams on the processor on which the code is running. Multiprocessor parallelism This form of parallelism uses more memory and has a higher startup time than single-processor parallelism. However, the number of streams available is much larger, being bounded by the number of processors on the entire machine rather than the size of a single processor. Loop future parallelism Loop future parallelism runs on multiple processors. It is the highest overhead form of parallelism, but is also the only form of parallelism with the ability to dynamically increase thread and processor counts as needed while the parallel region is executing. It provides good load balancing, especially with recursive loops. When using a directive, the parallelization type is set using the #pragma mta parallel directive. See Parallelization Directives on page 124. When the parallelism type is set using a compiler option, the following options are available. par Compiles a program to run in parallel on multiple processors. par1 Compiles a program to run in parallel on a single processor. parfuture Compiles a program to run on multiple processors using loop future parallelism. serial Compiles a program to run without automatic parallelization. S–2479–20 85Cray XMT™ Programming Environment User’s Guide Parallelism that you specify with future statements in your program is always enabled. Compiler options have no effect on future statements. If you do not specify a compiler option, the default is to run using the par option. There are also parallelization directives and compiler options available that you can use to enable or disable loop restructuring. Loop restructuring includes loop transformations, loop fusion, loop unrolling, loop distribution, and loop interchange. By default, loop restructuring is enabled when parallelization is enabled, and disabled otherwise. To enable or disable loop restructuring using a directive, use the #pragma mta restructure directive. Disabling loop restructuring may inhibit parallelization of some loops. The previous directive restructures loops from the point where it appears in the file to the end of the file. It can be disabled during the compilation process when you specify the -nopar compiler option from the command line. You can enable loop restructuring from the command line using the -restructure compiler option. You can disable loop restructuring using the -no_restructure option. You may need to use this if you are also using the -par, -par1, or -parfuture option, because these options automatically enable loop restructuring. You can also control whether the compiler automatically parallelizes recurrences and reductions. Recurrence is enabled, by default, but you may want to disable it for a section in the program. To do this, use the #pragma mta recurrence off command. For information about the parallelization options, see the cc(1) or c++(1) man page. For a complete list and explanation of the parallel directives and assertions, see Appendix C, Compiler Directives and Assertions on page 109. 7.6 Incremental Recompilation and Relinking When a previously built program library and executable are present, the compiler performs incremental recompilation and relinking, regardless of the compilation mode. An incremental recompilation saves time during the compilation process. The compiler performs incremental recompilation on a function-by-function basis within each source file. If you repeatedly edit and compile several functions in the blas.cc file, the compiler detects which functions require recompilation after editing. For example, if you edit a particular function f, the compiler only recompiles f and any function that inlined f. But if you change a globally-visible type declaration, the compiler recompiles all functions that use that type. In whole-program mode, separate-module mode, or mixed mode, the compiler builds a program library for the executable. The compiler uses the program library during the incremental compilation. If you delete the .pl file between compilations, the compiler cannot execute an incremental recompilation. Similarly, deleting the executable file prevents incremental linking. 86 S–2479–20Compiler Overview [7] 7.7 Creating New Libraries You can create a user-defined library in the same way that you build a program in whole-program mode. To do this, use the -R option to suppress the creation of an executable. For example, to build the library tinyblas.a from functions in the files ddot.cc and dgemv.cc, use the following sequence of commands. c++ -pl tinyblas.a -c ddot.cc dgemv.cc c++ -pl tinyblas.a -R ddot.o dgemv.o In the previous example, the first command creates the initial program library, checks the two source files for syntax errors, and copies them into the program library. The second command finishes compilation of the functions in ddot and dgemv with inlining enabled between files and from the standard libraries. The -R flag directs the compiler to place the generated relocatable object code in the program library and suppresses the build of an executable. The following sequence of commands provides the same results: c++ -pl tinyblas.a -c ddot.cc c++ -pl tinyblas.a -c dgemv.cc c++ -pl tinyblas.a -R ddot.o dgemv.o Or, you can use the following single command: c++ -pl tinyblas.a -R ddot.cc dgemv.cc You can update a library with an incremental compilation. To do this, use the following sequence of commands. c++ -pl tinyblas.a -R ddot.cc dgemv.cc edit dgemv.cc c++ -pl tinyblas.a -R ddot.cc dgemv.cc In the previous example, the first compile creates the library as usual. The second compile examines ddot.cc (and ignores it because it remains unchanged) and then focuses on dgemv.cc , which has presumably been changed by the edit. The compiler recompiles any modified function in dgemv.cc and any function that depends on a changed function (perhaps because of inlining). The rest of the library remains the same. There is no requirement that a library end with an .a suffix. The inclusion of the -R flag in a separate-module compilation line enables inlining from the standard libraries into the newly created library. The library looks like a traditional (fat) object file. S–2479–20 87Cray XMT™ Programming Environment User’s Guide 7.8 Compiler Messages There are three categories for compiler messages: errors, warnings, and remarks. Errors are the most severe and indicate problems that cause the compiler to halt after parsing without generating object code. Warnings are less severe — the compiler runs to completion and generates object code. Remarks tend to highlight conditions that prevent the code from being portable, but the resulting object code almost always behaves as expected. 7.9 Setting Debugger Options during Compilation Rather than providing many levels of optimization, the compiler provides the -g1 and -g2 options to support progressing levels of debugging. The debugger options include the following: -g, -g1 At this level, the debugger displays the values of variables (including global variables and array elements) anywhere in their scope. However, this level causes some loss of optimization. Specifically, the compiler no longer restructures loops, although basic loop parallelization is still possible. The -g flag is identical to -g1. -g2 This is the highest level of debugging support. This level lets you view and modify variables anywhere in their scope. However, this level significantly inhibits optimization. Specifically, the compiler no longer parallelizes loops. If you do not specify either option, the compiler runs with all optimizations enabled. Although debugging is not set, you can still perform some debugging operations. For example, you can control trace control flow using breakpoints together with the step and next commands. You can also view the value of global variables, although these can sometimes be out-of-date. The compiler also has options that perform tracing. Tracing creates a trace file, trace.out, that you use for performance tuning. You use the -trace option to turn on tracing and -trace_level n to trace functions larger than n source lines. You can also trace stack allocation by using the -trace_stack_alloc compiler option. For more information about the trace option, see the cc(1) or c++(1) man pages. For information about performance tuning, see Cray XMT Performance Tools User's Guide. If you compile an executable using modules that have been compiled at different debugging levels, the level of debugging support changes between one module and another, whether inlined or not. For more information about using the Cray XMT debugger, see Cray XMT Debugger Reference Guide. 88 S–2479–20Compiler Overview [7] 7.10 Using Compiler Directives and Assertions Directives are metalanguage constructs that you can add to a program to influence how the compiler performs a translation. In C and C++, you prefix directives with #pragma mta. Macros are allowed after the word mta in a pragma, as shown in this example: #define NUMSTREAMS 40 ... #pragma mta use NUMSTREAMS streams The preceding pragma is equivalent to #pragma mta use 40 streams. You can also write compiler directives in C and C++ code using _Pragma rather than #pragma mta. In this case, the directive appears syntactically as if it were a single string argument to a function call, as shown in the following command. _Pragma("mta assert parallel") The advantage to using the command form of this directive is that you can use it in macros or similar locations. The disadvantage of this form is that most C and C++ compilers treat it as an actual function, which makes the code less portable. Directives are grouped into five general categories: compilation directives, parallelization directives, semantic assertions, implementation hints, and language-extension directives. A compilation directive is a command to compile a program in a particular way. Parallelization directives tell the compiler how to parallelize various sections of a program. Semantic assertions provide information to the compiler that could be proved true about the program even though that proof is beyond the capabilities of the compiler. Implementation hints tell the compiler about the expected behavior of the program. Language-extension directives allow you to place Cray XMT specific language features into a program without interfering with the portability of code to other systems. For more information, see Appendix C, Compiler Directives and Assertions on page 109. S–2479–20 89Cray XMT™ Programming Environment User’s Guide 90 S–2479–20Running an Application [8] This chapter contains procedures for launching your application on the Cray XMT. 8.1 Launching the Application You use the mtarun command to launch and run a program. The mtarun command connects to the mtarund daemon that runs on the compute node on the backend. The daemon creates a copy of your environment and runs it on the compute nodes. Your file directories from the login node appear on the compute nodes with the same paths. From the login node, you use the mtarun command to launch a program, as shown in the following example. mtarun MyProgram.out The most common options to use with the mtarun command are -m max_procs and -t min_procs. The -m max_procs option sets the maximum number of processors for the program. This option is the same as setting the MTA_PARAMS environment variable to NUM_PROCS. The -t min_procs option sets the number of processors to use when the program starts running. By default, a program starts with one processor and adds processors, as needed. After launching the program, mtarun acts as the frontend of the program. mtarun provides the following services to the program: • Standard I/O forwarding. Provided by mtarun stdin, mtarun stdout and mtarun stderr. • Signal forwarding. mtarun forwards all catchable signals. • Termination management. If the program exits normally, mtarun exits with the same exit status. If the remote process is killed by a signal, mtarun terminates with the matching exit status and sends a message to stderr with information about the signal that caused the program to exit. If mtarun terminates prematurely, the mtarun daemon uses SIGKILL to kill the program. S–2479–20 91Cray XMT™ Programming Environment User’s Guide The mtarun command uses a default configuration file, .mtarunrc, which exists in your home directory. You can modify this file to include any mtarun options, separated by spaces. The configurations in this file are overridden by options that you use from the command line. To monitor process or CPU usage by your program, you use mtatop. For more information about using mtarun to run the program or mtatop to monitor the program, see Cray XMT System Management. Note: When an application that was built for tracing is running, an intermediate process runs to flush trace data back to the service partition as the tracing buffers fill. To ensure that all tracing data is captured, the mtarun that launched the application will not exit until this tracing process completes. Depending on the amount of data that needs to be flushed, and the speed of the underlying file system, mtarun may not exit for some time after the application has completed. If you kill the mtarun process, in the belief that it is hung, you may get incomplete tracing data. For more information on partial tracing data see Partial Tracing in the Cray XMT Performance Tools User's Guide. 8.2 User Runtime Environment Variables There are a number of environment variables that you can use with the user runtime known as MTA_PARAMS. You can use these environment variables for debugging, dumping registers, setting the number of streams, setting maximums for processors and ready pools, and so on. For csh, use the following command: % setenv MTA_PARAMS "param1 param2" For example, to set the maximum number of processors and to prevent streams from being reserved for the debugger, set MTA_PARAMS by using the following command: % setenv MTA_PARAMS "num_procs 100 no_prereserve" For a bash shell, use the following command: % export MTA_PARAMS="param1 param2" For example, to set the maximum number of processors to two and indicate that the program must wait for a debugger to attach in the event of a poison, you use the following command on a bash shell: % export MTA_PARAMS="num_procs 2 debug_data_prot" For a list of environment variables that you can set, see Appendix G, MTA_PARAMS on page 143. 92 S–2479–20Running an Application [8] 8.3 Improving Performance For information about improving performance on your program, see Cray XMT Performance Tools User's Guide. S–2479–20 93Cray XMT™ Programming Environment User’s Guide 94 S–2479–20Optional Optimizations [9] 9.1 Scalar Replacement of Aggregates Effective with version 2.0 of the Cray XMT software, the XMT compiler provides an optional optimization pass that performs a code transformation called scalar replacement of aggregates. This transformation replaces C++ class objects and C structures (aggregate data types) with collections of temporary scalar variables. Values are copied from the aggregate to the temporary variables and back again as needed. These scalar variables allow the compiler to perform more precise analysis in later phases, and may enable additional optimizations and parallelization of loops. For example, consider the following code: class myTwoInts { public: int i; int j; }; myTwoInts foobar2(myTwoInts t, int n, int * restrict foo) { for (int i = 0; i < n; i++) { t.i += foo[i]; } return t; } Without scalar replacement the compiler cannot determine whether the references to fields of the object t form a loop-carried dependence, thus it is unable to parallelize this loop. By viewing the canal report you can see that the loop is not parallelized: | myTwoInts foobar2(myTwoInts t, int n, int * restrict foo) { | for (int i = 0; i < n; i++){ 8S | t.i += foo[i]; | } | return t; | } S–2479–20 95Cray XMT™ Programming Environment User’s Guide After recompiling this code with automatic scalar replacement enabled, the compiler is able to transform the foobar2 routine into something that resembles the following: myTwoInts foobar2(myTwoInts t, int n, int * restrict foo) { __tmp_t_i = t.i; for (int i = 0; i < n; i++) { __tmp_t_i += foo[i]; } t.i = __tmp_t_i; return t; } Note that the compiler does not bother creating a temporary variable for the unused field j. After this transformation, the compiler is better able to analyze the dependencies in the loop and to determine that the loop can be safely parallelized as a reduction. This can be seen in the canal report of the recompiled code: | myTwoInts foobar2(myTwoInts t, int n, int * restrict foo) { ** scalar replacing t | for (int i = 0; i < n; i++) { 18 P:$ 18 P:$ | t.i += foo[i]; ** reduction moved out of 1 loop | } | return t; | } Scalar replacement of aggregates can enable parallelization of many additional loops. However, it can also add additional memory references which can adversely affect performance. For this reason, the compiler performs scalar replacement only when requested by the programmer. Automatic scalar replacement of aggregates can be enabled either by using a command-line flag at compile time, or by using pragmas in your code. If you compile a file with the -scalar_replacement flag, the compiler will automatically attempt to perform scalar replacement on any aggregates that it can prove are safely replaceable unless those aggregates have been marked with an mta no replace pragma. (See Semantic Assertions on page 125.) You can use the noalias pragmas and restrict type qualifiers as needed to indicate to the compiler that certain aggregates, or pointers to aggregates, are safe to replace. 96 S–2479–20Optional Optimizations [9] Alternatively, you can enable scalar replacement for individual aggregates by using the mta assert can replace pragma. This pragma, which takes a list of aggregates and/or aggregate pointers, serves two purposes. First, it tells the compiler that it is safe to perform scalar replacement on the aggregates or pointers listed. The compiler follows this assertion even if it was unable to prove that the replacement was safe. Second, it is a request to replace the listed aggregates even if the code was not compiled with the -scalar_replacement flag. This pragma is useful in situations where the compiler would not be able to verify that a key aggregate is replaceable. You can also use this pragma in situations where, because of the extra memory references, you do not want to enable scalar replacement for an entire source file, but where you need a particular aggregate to be replaced in order to achieve automatic loop parallelization. For example, consider the loop in the method doit below: class foo { int * restrict b; int n; public: #pragma mta no inline void doit(int *c) { int i; #pragma mta assert noalias *this for (i = 1; i < n; ++i) { b[i] = b[i-1] + c[i-1]; } }; }; Without scalar replacement, this parallel recurrence loop will not parallelize, because the accesses to the b array, which are accesses into a field of the aggregate *this, defy alias analysis. By adding an mta assert can replace pragma, however, the loop will parallelize as can be seen in the canal report: | #pragma mta no inline | void doit(int *c) { ** scalar replacing *this | int i; | | #pragma mta assert noalias *this | #pragma mta assert can replace *this | for (i = 1; i < n; ++i) { 5 L | b[i] = b[i-1] + c[i-1]; | } | }; | }; The can replace assertion also has a loop- specific variant, mta assert loop can replace, which requests scalar replacement for a specific loop instead of an entire function. In this case we copy into the temporaries immediately before the loop, and copy back into the aggregate immediately after the loop. Any accesses S–2479–20 97Cray XMT™ Programming Environment User’s Guide to fields of the aggregate inside the loop will be replaced with the temporaries. This can be useful if scalar replacement is unsafe or undesirable for portions of a routine, but needed to achieve good performance in specific loops. The loop variant can also be used to achieve parallelization of the loop in the previous example: | #pragma mta no inline | void doit(int *c) { | int i; | | #pragma mta assert noalias *this | #pragma mta assert loop can replace *this | for (i = 1; i < n; ++i) { 5 L | b[i] = b[i-1] + c[i-1]; ** scalar replacing *this | } | }; | }; The exact syntax of these pragmas is described in Appendix C.3 of Cray XMT Programming Environment User's Guide. 9.2 Optimizing Calls to memcpy and memset The compiler option -enable_memcmd_opt enables a compiler optimization that replaces calls to memcpy/memset with versions of the functions that were built for the current parallel mode, which the compiler can inline. This allows the compiler to potentially merge the parallel region in the memory routine with any surrounding parallel region, which can reduce the cost of having to tear down and restart parallel regions in order to call memcpy or memset. However, when this optimization is enabled and these functions are called from within a parallel loop, this creates nested parallel regions. The result is a potentially significant performance degradation. A new compiler flag, -disable_memcmd_opt was added to disable this optimization in case there were performance problems, such as the case mentioned above. However, because the functions may be getting called indirectly, it may not always be easy to determine that a call to memcpy or memset is causing a performance problem. For example, this can happen is if a program calls a function in the C++ STL that calls memcpy. For this reason, the default behavior of the compiler is to have this optimization disabled and allow users to enable it with the option -enable_memcmd_opt. Use this option only when you know there is no risk of memcpy or memset being called from within a parallel loop. For additional control over the parallelism used by memcpy or memset, you can call directly versions of of these commands that use a single stream, single processor parallelism and multiprocessor parallelism. The memcpy functions are called memcpy_ss, memcpy_sp and memcpy_mp, respectively. The corresponding memset functions are called memset_ss, memset_sp and memset_mp, respectively. These functions are declared in string.h and are documented in the memcpy(3) and memset(3) man pages. 98 S–2479–20Error Messages [A] Execution-time errors are directly related to exceptions. An exception is an unexpected condition raised by an event in your program, the operating system, or the hardware. Exceptions can trigger a trap when the stream that issued the exception is ready for execution, unless the trap is disabled. In cases where several exceptions occur simultaneously, the trap handler decides the order in which to process the exceptions. Use the list that follows to identify and troubleshoot common exceptions. create For example, this error will occur when you attempt to create more streams than were reserved. To prevent this error, you can use the STREAM_RESERVE operation to reserve the necessary number of streams before running the STREAM_CREATE operation again. data_alignment A data-alignment error has occurred. This error can occur when you access data that the compiler assumes is on an 8-byte boundary when it is not. data_hw_error A data-memory or network-hardware error has occurred. This occurs when the memory system detects an uncorrectable error while loading data from memory. data_prot A data protection level error has occurred. This error is equivalent to a segmentation error. Possible causes include attempting to access protected data, operating-system data, or data outside your addressable memory space. domain_signal A domain signal error has occurred. This message indicates the program is not allowing the operating system to interrupt it. This typically indicates a problem in the runtime system. float_extension An error using a floating-point number has occurred. A floating-point number is using the wrong extension. S–2479–20 99Cray XMT™ Programming Environment User’s Guide float_inexact An error using a floating-point number has occurred. An operation is attempting to use an inexact floating-point number. This type of error indicates an error in the source registers, the operation, or the value written to the destination. float_invalid An error using a floating-point number has occurred. An operation is attempting to use an invalid floating-point number. float_zero_divide An error using a floating-point number has occurred. An operation is attempting to divide a floating-point number by 0. float_overflow An error using a floating-point number has occurred. An operation using a floating-point number has caused an overflow to occur. This type of error indicates an error in the source registers, the operation, or the value written to the destination. float_underflow An error using a floating-point number has occurred. An operation using a floating-point number has caused an underflow to occur. This type of error indicates an error in the source registers, the operation, or the value written to the destination. poison Use of a poisoned register has occurred. A register is poisoned if it contains an uninitialized value. The exception occurs when you attempt to access the value in this register. Use of a poisoned register can sometimes occur when the compiler uses speculative loading. For example, the compiler may optimize a loop for n iterations and load n+1 values. Under normal conditions, the compiler does not use the n+1 value because the program correctly stops consuming prefetched data after n iterations. However, if the program accesses the n+1 value, it raises the poison exception. privileged A privilege error has occurred. This exception indicates that your program does not have the necessary privilege level to perform an operation. prog_hw_error A program-memory error has occurred. This indicates that while the processor was loading an instruction, there was a temporary or permanent problem with the physical memory. 100 S–2479–20Error Messages [A] prog_prot A program-protection error has occurred. This error occurs when the processor attempts to execute an instruction from a PC that is not a valid PC. unknown_trap A error has occurred that does not fit into any other category on this list. S–2479–20 101Cray XMT™ Programming Environment User’s Guide 102 S–2479–20User Runtime Functions [B] Functions in the runtime library support implicit and explicit parallelism, event logging, and trap handling. The compiler inserts calls to the runtime library into your code to handle programming constructs, such as the future statement, or command-line options, such as the -trace flag. In addition, some functions in the runtime library can be called directly by the user. This appendix contains a list of the runtime functions that you can call from your program. This list provides only a short description of the runtime functions. A more complete description of the functions and the syntax required to use them can be found on the referenced man pages. mta_create_team Adds teams. See the mta_create_team(3) man page. mta_create_thread_on_team mta_create_thread_all_teams mta_create_stream Creates a new thread on an existing team. See the mta_create_thread_all_teams(3) man page. mta_disable_auto_growth mta_enable_auto_growth mta_assess_growth Controls the automatic growth of processors. See the mta_disable_auto_growth(3) man page. mta_get_all_rt_teamids Returns the team identifiers for all runtime teams. See the mta_get_all_rt_teamids(3) man page. mta_get_clock Provides the number of clock ticks that have passed since the program began. See the mta_get_clock(3) man page. mta_get_max_teams Determines the maximum number of teams available to the program. See the mta_get_max_teams(3) man page. S–2479–20 103Cray XMT™ Programming Environment User’s Guide mta_get_num_teams Returns the number of currently executing teams. See the mta_get_num_teams(3) man page. mta_get_rt_teamid Returns the runtime identifier of the caller's team. See the mta_get_rt_teamid(3) man page. mta_get_team_index Returns a user runtime index for a team. See the mta_get_team_index(3) man page. mta_get_thread_name mta_set_thread_name mta_remove_thread_name Retrieves, sets, and removes user-defined thread names. See the mta_get_thread_name(3) man page. mta_get_threadid mta_get_parent_threadid Returns the runtime identifier of the calling thread or its parent thread. See the mta_get_threadid(3) man page. mta_lock_thread mta_unlock_thread Controls thread behavior when a synchronized data fault occurs. See the mta_lock_thread(3) man page. mta_log_event mta_log_short_event mta_log_long_event mta_log_event_record mta_log_short_event_record mta_log_long_event_record Sets user-defined event logging. See the mta_log_event(3) man page. 104 S–2479–20User Runtime Functions [B] mta_new_trap1_continuation mta_new_trap1_continuation_block mta_delete_trap1_continuation mta_register_trap1_continuation mta_unregister_trap1_continuation mta_update_trap1_value Creates, deletes, binds, or updates trap 1 continuation. See the mta_new_trap1_continuation(3) man page. mta_print_backtrace Prints the thread's call stack. See the mta_print_backtrace(3) man page. mta_probe_location Probes a memory location to determine whether it can be read or written. See the mta_probe_location(3) man page. mta_register_event_filter Installs a filter function for user-defined event logging. See the mta_register_event_filter(3) man page. mta_register_fatal_error_handler Binds a new fatal error handler. See the mta_register_fatal_error_handler(3) man page. mta_register_task_data Stores thread-specific data used to implement a common task. See the mta_register_task_data(3) man page. mta_register_team_exit_fn mta_unregister_team_exit_fn Binds or unbinds a team exit function. See the mta_register_team_exit_fn(3) man page. mta_register_tertiary_handler mta_get_tertiary_handler Binds a new tertiary trap handler or return the current tertiary trap handler. See the mta_register_tertiary_handler(3) man page. mta_report_trap_counters Sets reporting for trap counter statistics. See the mta_report_trap_counters(3) man page. S–2479–20 105Cray XMT™ Programming Environment User’s Guide mta_reserve_task_event_counter mta_get_task_counter mta_get_team_counter Reserves or queries hardware counters. See the mta_reserve_task_event_counter(3) man page. mta_set_crew_limit Sets the maximum number of crews that can be simultaneously active. The term crew is applied to the group of processors that are used when parallelizing the iterations of a loop across multiple processors. Applications use this type of parallelization when they are compiled using the multiprocessor mode. See the mta_set_crew_limit(3) man page. mta_set_domain_signal_mask Enables or disables domain signals in the calling thread. See the mta_set_domain_signal_mask(3) man page. mta_set_implicit_processors mta_get_implicit_processors mta_set_implicit_streams mta_get_implicit_streams Stores or retrieves the value for the number of implicit processors or implicit streams that are used for a calling thread for an implicitly parallelized region of code in a program. See the mta_set_implicit_processors(3) man page. mta_set_private_data mta_get_private_data Stores or retrieves private data for a thread. See the mta_set_private_data(3) man page. mta_set_rt_error_file Redirects runtime library messages to a file. See the mta_set_rt_error_file(3) man page. mta_set_trace_limit Modifies the number of times an individual trace event is recorded. See the mta_set_trace_limit(3) man page. mta_sleep Suspends a thread. See the mta_sleep(3) man page. 106 S–2479–20User Runtime Functions [B] mta_start_event_logging mta_suspend_event_logging mta_resume_event_logging mta_is_event_logging_on mta_set_event_flush Traces buffer controls for user-defined event logging. See the mta_start_event_logging(3) man page. mta_yield Yields an active stream to any other thread that needs the stream. See the mta_yield(3) man page. S–2479–20 107Cray XMT™ Programming Environment User’s Guide 108 S–2479–20Compiler Directives and Assertions [C] This appendix provides a complete list of compiler directives specific to the Cray XMT and accepted by the Cray XMT compiler. C.1 Compilation Directives A compilation directive is a command to compile a program in a particular way. #pragma mta autotouch [on|off|default] This directive automatically applies the touch generic whenever a future variable is referenced. The on option enables automatic touching, the off option disables automatic touching, and the default option reverts from autotouch to the default mode for that source module, as determined by the compile-line flags. #pragma mta adjust constructor priority adj This directive modifies the priority assigned to static constructors in a file. The adjusted priority is the priority just before the directive plus adj. The adjustment variable adj must be an integer in the range of -255 to 255, and the new priority must be in the range of 0 to 255. This directive remains in effect from the point at which it occurs until the end of the file or until another directive of the same kind is encountered. #pragma mta complex limited range [on|off|default] This directive specifies whether complex multiplication and division may be performed using the usual mathematical formulas for complex arithmetic or safer but slower arithmetic. The usual mathematical formulas for complex arithmetic use the following format: (a,b)*(c,d) = (ac-bd,ad+bc) (a,b)/(c,d) = ((ac+bd)/(cc+dd), (bc-ad)/(cc+dd)) The previous formulas, however, may cause spurious Not a Number (NaN) results or infinities if the norm of either complex number is larger than the maximum expressible real number or if the norm of the denominator of a division is smaller than the smallest expressible real number. Additionally, these formulas may not be as accurate as S–2479–20 109Cray XMT™ Programming Environment User’s Guide the safer complex arithmetic performed when complex limited range is off. This is especially true when the difference between two intermediate computations is very small, such as ac-bd, in the case of multiplication, and bc-ad, in the case of division. This directive applies to whatever follows it textually in the current file. The directive stays in effect until the end of the file or until another directive of the same kind is encountered. When the on or off options are used, the directive takes precedence over the -cxlimited and -no_cxlimited command-line options. When the default option is used, the directive enables the faster arithmetic if -cxlimited is specified on the command line. Otherwise, it disables the faster arithmetic. #pragma mta constructor priority pri This directive assigns a priority level of pri to the static constructors within the file, where pri is an integer in the range 0 to 255. This priority determines the treatment of constructors using the following rules: • Static constructors with priority j are executed before those of priority i, for i < j. No order is promised between modules compiled with the same constructor priority. • Static constructors with priority less than 200 are executed after the user runtime has been initialized. In particular, futures and system calls may be performed reliably by static constructors with priority less than 200. • Static constructors with priority less than 100 are executed after the system libraries have been initialized. For example, input/output operations may be reliably performed by static constructors with priority less than 100. The constructor priority directive overrides any -constructor_priority n compiler flag used on the command line. If neither the directive nor the compiler flag is used, the constructor priority defaults to 0. The constructor priority directive may occur at any point in a source code file provided no constructor priority or adjust constructor priority directives occur at an earlier point in the same file. The directive remains in effect from the point at which it occurs until the end of the file or until an adjust constructor priority directive is encountered. 110 S–2479–20Compiler Directives and Assertions [C] #pragma mta debug level [0|1|2|default|none] Set the debug level to the integer constant 0, 1, or 2, or to no debugging by specifying none. Or, set the debug level back to the level provided on the command line by specifying default. This directive overrides the -g , -g1 , and -g2 compiler flags. However, this directive does not affect any function that contains a call to setjmp or sigsetjmp, which is always compiled as if the -g2 option was specified. This directive has function-level granularity and affects any functions whose beginning follows the directive. This directive applies to whatever follows it textually in the current file. It stays in effect until the end of the file or until another directive of the same kind is encountered. #pragma mta fence This directive specifies a boundary in the source code across which the compiler is not allowed to move loads or stores of any aggregate or heap allocated variables. The effect of this directive is to limit the compiler's ability to move statements that have been marked with a fence directive. This directive is often used to prevent the compiler from moving calls to timing functions with respect to the code being timed, as in the following example. #pragma mta fence t0 = mta_get_clock(0); /* interval of interest */ ...... #pragma mta fence t1 = mta_get_clock(t0); This directive may prevent some compiler optimizations from being performed. S–2479–20 111Cray XMT™ Programming Environment User’s Guide #pragma mta fenv_access [on|off|default] This directive specifies whether the full floating-point environment is available. When fenv_access is on, strict rules against the optimization of floating-point operations are enforced. If it is off, extra optimizations are performed, but floating-point exceptions may be lost in certain cases. The compiler is allowed to attempt either one or both of two optimization techniques when fenv_access is off. The first technique is to evaluate floating-point operations at compile time. The second is to move floating-point operations to locations where they are executed with less frequency, such as outside a loop. In the following example, the addition in the statement that assigns a value to G can be performed at compile-time, but the addition in the statement that assigns a value to F cannot. void sub(void) { float F; float G; #pragma mta fenv_access off G = 2.5 + 3.1; #pragma mta fenv_access on F = 2.5 + 3.1; } This directive applies to whatever follows it textually in the current file. The directive stays in effect until the end of the file or until another directive of the same kind is encountered. The off and on options to the fenv_access directive takes precedence over the -no_float_opt command-line option. The default option to the directive enables floating-point environment access (disables floating-point optimization) if the -no_float_opt command-line option was used. Default disables floating-point environment access (enables optimization) if the command-line option was not used. The directive may also be specified in C as #pragma fenv_access [on|off|default]. #pragma mta for all streams This directive starts up a parallel region (if the code is not already in a parallel region) and cause the next statement or block of statements to be executed exactly once on every stream allocated to the region. If the pragmas appear in code that would otherwise not be parallel, they cause it to go parallel. 112 S–2479–20Compiler Directives and Assertions [C] You can use this pragma in conjunction with the use n streams to ask the compiler to allocate a certain number of streams per processor to the job. #pragma mta use 100 streams #pragma mta for all streams { // do something } However, there is no guarantee that the runtime will grant the requested number of streams if, for example, they are not available due to other jobs, the OS, or other simultaneous parallel regions in the current job. #pragma mta for all streams i of n This directive is similar to the for all streams pragma except that it also sets the variable n to the total number of streams executing the region, and the variable i to a unique per-stream identifier between 0 and n-1. For example: int i, n; int check_in_array[MAX_PROCESSORS * MAX_STREAMS_PER_PROCESSOR]; for (int i = 0; i < MAX_PROCESSORS * MAX_STREAMS_PER_PROCESSOR; i++) check_in_array[i] = 0; #pragma mta for all streams i of n { check_in_array[i] = 1; printf("Stream %d of %d checked in.\n", i, n); } Note that the integer variables i and n are declared separately from the pragma. For more information on the for all streams pragmas see Using the Cray XMT for all streams Pragmas in the CrayDoc Knowledge Base at http://docs.cray.com/kbase. S–2479–20 113Cray XMT™ Programming Environment User’s Guide #pragma mta fused muladd [on|off|default] This directive specifies whether the compiler is allowed to combine floating-point operations into a fused multiply-add operation. Default behavior is to allow fused multiply-add operations to be performed only when float optimization is turned on. When this option is turned on, the compiler is allowed to, but not required to, fuse multiply-add operations into one instruction. This directive applies to whatever follows it textually in the current file. The directive stays in effect until the end of the file or until another directive of the same kind is encountered. When the on or off option is used, the directive takes precedence over the -no_mul_add command-line option. When the default option is used, the directive disables the fused multiply-add operation if the -no_mul_add command-line option was used; it enables the fused multiply-add operation if no command-line option was used. The single round required directive overrides the fused muladd off directive. #ident "

" This directive inserts string-constant into the executable file generated from this code. Strings that have been incorporated into the executable in this manner can be retrieved from the executable using commands such as strings or in some cases what. One possible use of this directive would be to incorporate a version string such as the following into the executable. #ident "compiling.texinfo,v 1.15 2007/02/10 23:20:09" This directive can be placed anywhere in a C file and is the equivalent to declaring a static string constant. #pragma mta [no] inline When this directive is inserted immediately before a function declaration, the compiler inlines that function wherever possible throughout the user source program. If used with the no option, inlining of the specified function is prevented. When the [no] inline directive is not used, the compiler uses a standard, internal heuristic to decide whether a function should be inlined. When there is a conflict between the no inline directive and the command-line options -no_inline_all, -inline_all, -inline

or -no_inline , no inline takes precedence, regardless of whether it was specified on the command line or in a directive. The command-line option -no_inline_directed disables the inline directive but does not affect the no inline directive. 114 S–2479–20Compiler Directives and Assertions [C] #pragma mta instantiate [none|all|used|local|default] When used inside a template declaration, the effect of this directive is limited to the uses of that template. When used outside a template declaration, this directive sets the template instantiation mode for the text following the directive and stays in effect until the end of the file or until another directive of the same kind is encountered. This directive takes one of the following options: none No instantiations are created for any template entities. used All template entities that were used in the compilation, including all static data members for which there are template definitions, are instantiated. all All template entities that are declared or referenced in the compilation unit are instantiated. For each fully instantiated template class, all of its member functions and static data members are instantiated, whether used or not. Nonmember template functions are instantiated even if the reference was only a declaration. local Those template entities that were used in the compilation are instantiated. This option is similar to the used option, except that in this case, the functions are given internal linkages. That is, the compiler instantiates the functions and static data members used in the compilation as local static functions and local static variables. default The instantiate mode switches back to either the mode specified by the -instantiate switch on the compiler command line, or, if no command line switch was present, to the none option, which is the default behavior when no mode is specified. Where the mode specified with the instantiate pragma differs from that specified with the -instantiate switch on the compiler command line, the instantiate pragma takes precedence. #pragma mta max concurrency c The max concurrency c directive indicates that the next loop should limit the concurrency to c. This directive can be used on any parallel loop. For single processor parallel loops, the directive limits the number of streams used by the parallel loop to no more than c. For multiprocessor parallel loops, the directive estimates the number of processors to use for the loop to max(1,c/num_streams), where S–2479–20 115Cray XMT™ Programming Environment User’s Guide num_streams is the number of streams the compiler requests for each processor. For loop future parallel loops, the directive limits to c the number of futures created. The directive is ignored for explicityly serial loops and cannot be used on a loop that also uses the use n streams directive. This directive is useful for managing nested parallelism in application that have multiple parallel loops running concurrently, and to reduce or prevent contention for resources. For more information on using this pragma see Limiting Loop Parallelism in Cray XMT Applications in the CrayDoc Knowledge Base at http://docs.cray.com/kbase. #pragma mta max n processors The max n processors pragma limits the number of processors used by a multiprocessor parallel loop. This is useful for load balancing in applications that have multiple parallel loops running concurrently. For more information on using this pragma see Limiting Loop Parallelism in Cray XMT Applications in the CrayDoc Knowledge Base at http://docs.cray.com/kbase. #pragma mta max n streams per processor [may merge] This directive sets a limit of n on the number of streams per processor that will execute a parallel loop. This limit applies to an entire parallel region. Thus, by default, the compiler will not combine loops with different maximum stream specifications into the same region. This includes cases where one loop has a specified maximum and the other loop does not. However, if you add the optional may merge parameter, the compiler will ignore maximum stream specifications when deciding how to construct parallel regions (i.e., loops that would have been placed in the same region with no max streams pragma will still be placed in the same region if max streams pragmas with may merge are added). You can view how parallel regions are constructed in the canal report (see the Cray XMT Performance Tools User's Guide). For example, consider the following two loops: for (int i = 0; i < size_foobar; i++) { bar[i] = size_foobar - i; } for (int i = 0; i < size_foobar; i++) { foo[i] += bar[i]/2; } 116 S–2479–20Compiler Directives and Assertions [C] The output from canal shows that they are both placed into parallel region 1: | for (int i = 0; i < size_foobar; i++) { 3 P | bar[i] = size_foobar - i; | } | | for (int i = 0; i < size_foobar; i++) { 5 P | foo[i] += bar[i+c]/2; | } ... Parallel region 1 in main ... Loop 2 in main in region 1 ... Loop 3 in main at line 4 in loop 2 ... Loop 4 in main in region 1 ... Loop 5 in main at line 8 in loop 4 If you add a max streams pragma to one of the loops, they are no longer placed in the same region: | for (int i = 0; i < size_foobar; i++) { 3 P | bar[i] = size_foobar - i; | } | | #pragma mta max 50 streams per processor | for (int i = 0; i < size_foobar; i++) { 6 P | foo[i] += bar[i+c]/2; | } ... Parallel region 1 in main ... Loop 2 in main in region 1 ... Loop 3 in main at line 4 in loop 2 ... Parallel region 4 in main Using max 50 streams per processor ... Loop 5 in main in region 4 ... Loop 6 in main at line 9 in loop 5 Notice that canal also tells us that the requested maximum was applied to region 4, which is the region that contains the loop with the max streams pragma. S–2479–20 117Cray XMT™ Programming Environment User’s Guide However, when you add the may merge option these two loops remain in the same region: | for (int i = 0; i < size_foobar; i++) { 3 P | bar[i] = size_foobar - i; | } | | #pragma mta max 50 streams per processor may merge | for (int i = 0; i < size_foobar; i++) { 5 P | foo[i] += bar[i+c]/2; | } ... Parallel region 1 in main Using max 50 streams per processor ... Loop 2 in main in region 1 ... Loop 3 in main at line 4 in loop 2 ... Loop 4 in main in region 1 ... Loop 5 in main at line 9 in loop 4 Note that the compiler has placed both loops into the same region and that the stream limit was applied to the entire region. If multiple limits are specified for the same region the compiler uses the smallest limit. Two restrictions apply to the use of this pragma: • You cannot use this pragma with loop future loops. • If this pragma is used within the same region as a use n streams pragma with a conflicting value (for example a use value that is higher than the max value) the max n streams per processor pragma will take precedence over the use n streams pragma. 118 S–2479–20Compiler Directives and Assertions [C] #pragma no mem init This directive affects only the declaration statement immediately following the directive and tells the compiler not to specially initialize the full/empty bit (or bits) of any sync- or future-qualified variables defined in that declaration statement. The directive affects only the definition of variables, including class instance variables; it may not be used on field declarations inside classes. For example: struct C {/* note that a '#pragma mta no mem init'would be ineffective here */ sync int k; }; main() { #pragma mta no mem init static C c; /* use the pragma on the instance of the class rather than on the class definition */ } When the no mem init directive is not used, the compiler initializes the full/empty bit of a sync-qualified variable to full if the variable itself is initialized or to empty if the variable itself is not initialized. When the no mem init directive is used immediately before a declaration statement, the full/empty bits for any variables defined in that declaration are initialized to full if the variable itself is initialized. If the variable itself is not initialized, the initial state of the full/empty bit is undefined (although, in practice, uninitialized variables stored as static or global variables end up with their full/empty bit initialized to full.) For example: /* full-empty bit is set to full for a[0] and empty for a[1].*/ sync int a[2]={0}; #pragma mta no mem init /*full-empty bit is set to full for b[0] and is undefined for b[1].*/ sync int b[2]={0}; main(){} S–2479–20 119Cray XMT™ Programming Environment User’s Guide #pragma mta no scalar expansion This directive instructs the compiler not to expand scalar variables to vector temporaries in the next loop. Such expansion allows you to distribute the loop to enhance available parallelism or make effective use of registers. However, if the loop iterates only a few times, the increase in memory usage for the expansion may outweigh the benefits. In this case, you can use the no scalar expansion pragma to prevent expansion. For example, in the following code, the use of no scalar expansion ensures that the definition of T and its use remain in the same loop. void no_scalar_example(double X[], const int N) { extern double Y[], Z[]; #pragma mta no scalar expansion for (int i = 0; i < N; i++) { const double T = Y[i*2]; X[i] = T + Z[i*3]; } } #pragma mta once This directive, when placed inside an included file, instructs the C preprocessor to include this file only once in any single compilation unit regardless of the number of #include directives encountered. In the following example, the file foo.h is included in the file foo.c one time only. file foo.h: #pragma mta once int i; file foo.c: #include "foo.h" #include "foo.h" #include "foo.h" main() { } This directive may also be specified as #pragma once. The directive may occur at any point in the file to be included. #pragma mta single round required This directive specifies that the compiler generate a fused multiply-add instruction for every expression (or subexpression) of the form X + Y*Z, X - Y*Z, or Y*Z - X. This selection can be ambiguous, as shown in the following: A = B*C + D*E 120 S–2479–20Compiler Directives and Assertions [C] In this case, the compiler is forced to choose one of two possible implementations. To avoid ambiguity when control of rounding is important, you should use a sequence of simpler assignments to make the meaning clear. The scope of this directive is the entire source file. The use of this directive overrides the -no_mul_add compiler flag and the #pragma mta fused muladd off directive. #pragma mta trace [on|off|default] Enables or disables tracing of functions or returns to the default heuristic if trace default is used. In order to actually use the tracing information, however, a compiler flag must be set. By default, a heuristic is used to decide whether to trace a function based upon its size. This directive remains in effect until end-of-file or until overridden by another directive of the same type. This directive affects any function whose beginning follows the directive textually in the current file. #pragma mta trace level [int-const] This directive enables the tracing of functions that contain at least int-const lines, and disables the tracing of functions that contain fewer lines. This directive is disabled unless either the -trace or -trace_level option was specified on the command line. But after it is enabled, this directive takes precedence over the -trace and -trace_level command-line options. This directive remains in effect until end-of-file or until overridden by another directive of the same type. This directive affects any function whose beginning follows the directive textually in the current file. #pragma mta trace "

" This directive generates a user-defined tracepoint in the executable code. The tracepoint generated is named the value passed in string-name. Using the -notrace option on the compiler command line causes this directive to be ignored. For more information, see Cray XMT Performance Tools User's Guide. #pragma mta update This directive tells the compiler that the next statement is an update to a variable, and that the update should be done atomically. By default, the compiler does not necessarily make updates atomic. Using this directive does not place any restrictions on code movement around this update statement such as would occur if the variable were declared to be a sync-qualified variable. The variable to be updated may be of any simple arithmetic or logical type. The S–2479–20 121Cray XMT™ Programming Environment User’s Guide variable to be updated must occur as the target on the left side of the statement and must occur exactly once as a subexpression on the right side of the statement. For example, void update_example(double A[], int i, int j){ extern double V; extern double X; // This is allowed #pragma mta update V = 1.0 + X + 3.0*V; // This is allowed #pragma mta update A[i] = A[i] + A[j]; // But this is not allowed #pragma mta update A[i] = A[i] + A[i]; // compiler reports an error } This directive applies to the next statement only. The following four directives control how the compiler parallelizes the loop that immediately follows. #pragma mta block schedule When this directive appears before a loop that the compiler parallelizes, each thread assigned to the execution of the loop performs a contiguous subset of the total iterations. Each thread executes the same number of iterations, within 1. For example, if 100 iterations are performed by 20 threads, the first thread executes the first 5 iterations of the loop, the second thread executes the next 5 iterations, and so forth. #pragma mta block dynamic schedule This scheduling method combines aspects of both block and dynamic scheduling. At execution time, threads are assigned one block of iterations at a time through the use of a shared counter. After completing an assigned block, each thread receives its next block by accessing the counter. The number of blocks executed by each thread depends on the execution time of the particular iterations in the blocks assigned to the thread. #pragma mta interleave schedule When this directive appears before a loop that the compiler parallelizes, each thread assigned to the execution of the loop performs a subsequence of the total iterations, where the members of the subsequence are regularly spaced. Each thread executes the same number of iterations, within 1. For example, if 100 iterations 122 S–2479–20Compiler Directives and Assertions [C] are performed by 20 threads, the first thread executes iteration 1, iteration 21, iteration 41, and so forth. This scheduling leads to better load balancing for triangular loops. For example: void interleave_example(const double X[100][100], const double Y[100], double Z[100], const int N) { #pragma mta interleave schedule for (int i = 0; i < N; i++) { double sum = 0.0; for (int j = 0; j < i; j++) { sum += X[i][j] * Y[j]; } Z[i] = sum; } } Here, a block schedule results in poor load balancing with the first threads finishing before the last threads. With an interleaved schedule, the work is much better balanced. #pragma mta dynamic schedule At execution time, threads are assigned one iteration at a time through the use of a shared counter. After completing an assigned iteration, each thread receives its next iteration by accessing the counter. The number of iterations executed by each thread depends on the execution time of the particular iterations assigned to the thread. One thread may happen to receive all the long-running iterations, and thus might execute fewer iterations than any other thread. This method is preferred when the execution time for individual iterations may vary greatly, although its overhead makes it less desirable for general use. #pragma mta use n streams This directive indicates that the compiler should request at least n threads per processor for the next loop. When multiple loops are contained in the same parallel region, the largest n is used. In the absence of a directive, the compiler determines the number of threads needed to saturate the processor. This directive affects the next loop only. S–2479–20 123Cray XMT™ Programming Environment User’s Guide C.2 Parallelization Directives The compiler recognizes the following parallelization directives. #pragma mta parallel [on|off|default| single processor|multiprocessor|future] This directive enables or disables automatic generation of parallel code for a section of the program as well as choosing the form of parallelism to use. The single processor, multiprocessor, and future flags indicate the type of parallelism to use. The off flag turns off parallelism until it is turned back on or reaches the end of the file. The on flag turns on parallel-code generation using the last specified form of parallelism. The default flag uses the command-line option or the default form of parallelism. By default, automatic generation of multiprocessor parallel code is enabled. This directive applies to whatever follows it textually in the current file. It stays in effect until the end of the file or until another directive of the same kind is encountered. The directive is ignored if the -nopar flag is used on the command line. #pragma mta recurrence [on|off|default] This directive enables/disables automatic parallelization of recurrences and reductions. By default, recurrence-relation parallelization is enabled. Recurrence relations are parallelized, however, only in areas in which parallelization is otherwise allowed. This directive applies to whatever follows it textually in the current file. It stays in effect until the end of the file or until another directive of the same kind is encountered. The directive is ignored if the -nopar flag is used on the command line. #pragma mta restructure [on|off|default] This directive enables/disables loop restructuring and loop transformations. By default, loop restructuring is allowed in areas in which parallelization is allowed and it is turned off in areas in which parallelization is not allowed. This directive applies to whatever follows it textually in the current file. It stays in effect until the end 124 S–2479–20Compiler Directives and Assertions [C] of the file or until another directive of the same kind is encountered. The directive is ignored if the -nopar flag is used on the command line. #pragma mta loop loop_mod[, loop_mod, ...] This directive takes a comma-separated list of parallelization modes, loop_mod, consisting of no more than one selection from each of the following sets of possible loop modes: restructure, norestructure Enables/disables loop restructuring. recurrence, norecurrence Allows/disallows automatic parallel processing of recurrences. single processor, multiprocessor, future, serial Enables either a single or multiple processor or a future form of parallelism or disables parallelism. This directive enables the appropriate parallelization mode (or modes) for the next loop only. It is ignored if the -nopar flag is used on the command line. #pragma mta serial This directive disables parallelization for a section of the program. It is equivalent to the parallel off directive. It is ignored if the -nopar flag is used on the command line. C.3 Semantic Assertions Semantic assertions provide information to the compiler that could be proved true about the program even though that proof is beyond the capabilities of the compiler. Asserting this information often yields more effective compilation. In the following list, the term variable-list refers to a comma-separated list of variable names. S–2479–20 125Cray XMT™ Programming Environment User’s Guide The compiler recognizes the following semantic assertions: #pragma mta assert can replace variable-list This directive asserts that it is safe to use scalar replacement of the aggregates (objects or structs) in variable-list and the aggregates pointed to by pointers in variable-list. This pragma is also a request for scalar replacement of those aggregates even if the code was not compilied with the -scalar_replacement option. Items in variable-list must be aggregates or pointers to aggregates. Any pointers must either be marked with a noalias pragma or qualified with the restrict type qualifier. In addition, pointers must point only to a single aggregate during a given invocation of the routine in which the pragma appears. See Scalar Replacement Section of Optimization Guide for more information. #pragma mta assert loop can replace variable-list This directive asserts that it is safe to use scalar replacement of the aggregates (objects or structs) in variable-list and the aggregates pointed to by pointers in variable-list for the loop that immediately follows the pragma. This pragma is also a request for scalar replacement of those aggregates even if the code was not compilied with the -scalar_replacement option. Items in variable-list must be aggregates or pointers to aggregates. Any pointers must either be marked with a noalias pragma or qualified with the restrict type qualifier. In addition, pointers must point only to a single aggregate within the loop. See Scalar Replacement Section of Optimization Guide for more information. #pragma mta assert no replace variable-list This directive tells the compiler not to use scalar replacement of the aggregates (objects or structs) in variable-list and any aggregates pointed to by pointers in variable-list. This is useful for fine-tuning files that are compilied with the -scalar_replacement option. See Scalar Replacement Section of Optimization Guide for more information. 126 S–2479–20Compiler Directives and Assertions [C] #pragma mta assert parallel This directive can appear before a loop construct and asserts that the separate iterations of the loop may execute concurrently without synchronization. It does not guarantee that the compiler parallelizes the loop, but it is a strong suggestion to the compiler. This directive affects the next loop only. The directive is ignored if the -nopar flag is used on the command line. #pragma mta assert local variable-list This directive can appear inside a loop or inside the body of a function, or at the top of the loop or function. For a loop, it asserts that at the beginning of each iteration, the compiler can treat the listed variables as undefined, and that their values are not referenced after the completion of that iteration. For a function, it asserts that the variables are undefined on entry to the function, and that their values are not referenced after exiting the function. The behavior of this directive is the same regardless of whether the loop or function to which it is attached executes in a parallel or serial context. void assert_local_example(double B[], const int N) { double A[2]; for (int i = 0; i < N; i++) { #pragma mta assert local A A[0] = i; A[1] = 2*i; B[i] = A[0]*A[1]; } } In the previous example, the directive asserts that A is used as a scratch array in the loop. This directive must be inside the loop in order to affect the loop. S–2479–20 127Cray XMT™ Programming Environment User’s Guide #pragma mta assert no dependence variable-list #pragma mta assert nodep variable-list This directive can appear before a loop construct and asserts that if a word of memory is accessed during execution of the loop through any load or store derived from a variable in variable-list, the word is accessed from exactly one iteration of the loop. You can also use the word nodep in place of no dependence. For example: void nodep_example(const int INDEX[], double IA[100][100], const int N) { // You know that index[I] is never 1. #pragma mta assert noalias *IA #pragma mta assert no dependence *IA for (int i = 0; i < N; i++) { IA[i][1] = IA[i][INDEX[i]]; } } #pragma mta assert may reorder variable-list #pragma mta may reorder variable-list This directive allows the compiler to reorder accesses of the variables in variable-list with respect to other volatile and global references in the code. This directive is used to remove unnecessary restrictions that may be placed on the order of execution. For example, in the following code, if SYNCARRAY$ is a sync-qualified array, the order of accesses to the various elements of the array are serialized, and the loop is not parallelized: void may_reorder_example(sync int SYNCARRAY$[10000]) { for (int i = 0; i < 10000; i++) { SYNCARRAY$[i] = 0; } } 128 S–2479–20Compiler Directives and Assertions [C] However, if we add a #pragma mta may reorder SYNCARRAY$ directive before the loop, each reference to SYNCARRAY$ may occur before or after any of the other references. Explicit serialization is not imposed, and the loop is parallelizable. void may_reorder_example(sync int SYNCARRAY$[10000]) { #pragma mta may reorder SYNCARRAY$ for (int i = 0; i < 10000; i++) { SYNCARRAY$[i] = 0; } } #pragma mta assert may not reorder variable-list #pragma mta may not reorder variable-list This directive is used to deactivate the preceding may reorder directive. The following example tells the compiler that accesses to SYNCARRAY$ can be reordered only in the loop shown. void maynot_reorder_example(sync int SYNCARRAY$[10000]) { int i; for (i = 0; i < 10000; i++) { #pragma mta may reorder SYNCARRAY$ SYNCARRAY$[i] = 0; #pragma mta may not reorder SYNCARRAY$ } } #pragma mta assert noalias variable-list #pragma mta noalias variable-list This directive tells the compiler that the variables in variable-list are not used as aliases for any other variables. This information allows the compiler to perform a more accurate dependence analysis of loops involving these variables and to more aggressively parallelize the code. This directive must follow the declaration of the variables in variable-list and must lie within the scope in which these variables are defined. The directive may also take the form #pragma noalias variable-list. #pragma mta assert par_newdelete This directive is placed before the definition of a new array to indicate that when the elements of the array are constructed, the constructors should be invoked in parallel. To do this, use the following syntax for automatic or external definitions. #pragma mta assert par_newdelete aclass foo[100]; In this case, the destructors are not fired in parallel; there is no way to cause destructors to be fired in parallel for these kinds of definitions. S–2479–20 129Cray XMT™ Programming Environment User’s Guide Alternatively, you can use the following syntax for dynamically allocated arrays. #pragma mta assert par_newdelete foo = new aclass[100]; This directive is placed before the deletion of a dynamically allocated array to indicate that when the elements of the array are destructed, the destructors should be invoked in parallel. To do this, use the following syntax: #pragma mta assert par_newdelete delete [] foo; foo = 0; C.4 Implementation Hints The following directives provide implementation hints to the compiler about the expected behavior of the program. The intent is to provide guidance for effective optimization. #pragma mta expect count integer-expression This directive can appear before a loop construct. The integer-expression is a constant expression and serves as an estimate of the number of times the loop will iterate. The compiler optimizes the implementation of the loop based on this value. A constant integer-expression is one that can be evaluated completely by the front end of the compiler. It may not use the following: • An expression that syntactically looks like a function call (such as sizeof or C++ style-type conversions) • Floating-point literals • GNU extensions It may refer to members of enumerations. #pragma mta expect [true|false] This directive can appear before a logical if and specifies the expected value of the associated predicate. You can use this directive for branch prediction and choosing the best parallel implementation of a containing loop depending on sparse versus dense branching. #pragma mta expect case n This directive is similar to the expect [true|false] directive except that n is an integer. This directive must only appear before a switch statement. It tells the compiler that case arm n is expected. 130 S–2479–20Compiler Directives and Assertions [C] The compiler tests for case n first, and all other cases after that. n must be an integer constant, in any radix. It may not be an integer expression, nor may it be a member of an enumeration. #pragma mta expect (predicate) This directive can appear before any executable statement and suggests that the compiler should optimize code near that point. This suggestion is based on the assumption that the predicate typically evaluates to true. This directive is deprecated and should not be used. #pragma mta expect parallel Deprecated form of expect parallel context directive that follows. #pragma mta expect parallel context This directive is inserted immediately before a function declaration. It tells the compiler that the following function is expected to be called in a highly parallel context. In this case, the compiler reduces the total number of instructions issued by the function rather than the serial execution time. By default, the compiler assumes that a function is called in a serial context unless the function is marked with the expect parallel context directive or the -parcontext flag was used on the compiler command line. This directive affects the next function only. #pragma mta expect serial context This directive is inserted immediately before a function declaration. It tells the compiler that the following function is expected to be called in a serial context. In this case, the compiler reduces the serial execution time for the function. By default, the compiler assumes that a function is in a serial context unless the -parcontext flag was used on the compiler command line or the function is marked with the expect parallel context directive in the code. The expect serial context directive overrides the -parcontext compiler flag for the function immediately following the directive. This directive affects the next function only. S–2479–20 131Cray XMT™ Programming Environment User’s Guide 132 S–2479–20Condition Codes [D] You can test the condition codes generated by an expression by using the MTA_TEST_CC intrinsic. The eight possible condition code values and their default meanings are shown in the following table. The Examples column show the operations that meet the criteria for the condition code, where 0, p, and n stand for zero, a positive integer, and a negative integer, respectively. For more information, see Testing Expressions Using Condition Codes on page 34 and Chapter 4 of the Cray XMT Principles of Operation. Table 2. Condition Codes Name Meaning Examples COND_ZERO_NC Zero, no carry 0 = 0+0 COND_NEG_NC Negative, no carry n = p+n, n = p-p COND_POS_NC Positive, no carry p = p+p, p = p-n COND_OVFNAN_NC Overflow/NaN, no carry n = p+p, n = p-p COND_ZERO_C Zero, carry 0 = n+p, 0 = n-n COND_NEG_C Negative, carry n = n+n, n = n-p COND_POS_C Positive, carry p = n+p, p = n-n COND_OVFNAN_C Overflow/NaN, carry p = n+n, p = n-p Most of the important condition masks have one or more names. The named condition masks are shown in Table 3. For more information, see Cray XMT Programming Model. Table 3. Condition Masks Name Description Condition Mask: Manifest IF_ALWAYS Always IF_NEVER Never Condition Mask: Equality IF_EQ y = z (integer, unsigned, float) S–2479–20 133Cray XMT™ Programming Environment User’s Guide Name Description IF_ZE x = 0 (integer, unsigned, float) IF_F x = 0 (logical) IF_NE y != z (integer, unsigned, float) IF_NZ x != 0 (integer, unsigned, float) IF_T x != 0 (logical) Condition Mask: Integer Comparison IF_ILT y < z (integer) IF_IGE y >= z (integer) IF_IGT y > z (integer) IF_ILE y <= z (integer) IF_IMI x < 0 (integer) IF_IPZ x >= 0 (integer) IF_IPL x > 0 (integer) IF_IMZ x <= 0 (integer) Condition Mask: Unsigned Comparison IF_ULT y < z (unsigned) IF_UGE y >= z (unsigned) IF_UGT y > z (unsigned) IF_ULE y <= z (unsigned) Condition Mask: Float Comparison IF_FLT y < z (float) IF_FGE y >= z (float) IF_FGT y > z (float) IF_FLE y <= z (float) Condition Mask: Other Tests IF_IOV x overflowed (integer) IF_FUN y and z are unordered (float) IF_CY Carry IF_NC No carry Condition Mask: Specific Conditions IF_0 Zero, no carry IF_1 Negative, no carry IF_2 Positive, no carry 134 S–2479–20Condition Codes [D] Name Description IF_3 Overflow/NaN, no carry IF_4 Zero, carry IF_5 Negative, carry IF_6 Positive, carry IF_7 Overflow/NaN, carry IF_N0 Not Zero, no carry IF_N1 Not Negative, no carry IF_N2 Not Positive, no carry IF_N3 Not Overflow/NaN, no carry IF_N4 Not Zero, carry IF_N5 Not Negative, carry IF_N6 Not Positive, carry IF_N7 Not Overflow/NaN, carry S–2479–20 135Cray XMT™ Programming Environment User’s Guide 136 S–2479–20Data Types [E] This chapter provides information about the C and C++ language data types that you can use with Cray XMT compilers. The floating-point types are float, double, and long double. Their sizes are 4, 8, and 16 bytes, respectively. The integer types short and unsigned short are each 4 bytes long. The data types int, long, long long, and their unsigned equivalents are each 8 bytes long. The compiler flag -short16 converts all short and unsigned short integers to 2 bytes. The compiler flag -i4 converts all short and unsigned short integers to 2 bytes and all int and unsigned int to 4 bytes. The two character types char and unsigned char are each 1 byte long. Additionally, the C++ compiler supports a 1-byte boolean type, bool, and the boolean constants true and false. The compiler flag -no_bool turns off recognition of these three keywords. S–2479–20 137Cray XMT™ Programming Environment User’s Guide The Cray XMT C and C++ compilers also support the ten nonstandard integer types in the following list. The -short16 and -i4 compiler flags do not affect the size of these types, so it is preferable that you use these in exported include files. __short16 A 2-byte (16-bit) value. unsigned __short16 A 2-byte (16-bit) value. __short32 A 4-byte (32-bit) value. unsigned __short32 A 4-byte (32-bit) value. __int16 A 2-byte (16-bit) value. unsigned __int16 A 2-byte (16-bit) value. __int32 A 4-byte (32-bit) value. unsigned __int32 A 4-byte (32-bit) value. __int 64 An 8-byte (64-bit) value. unsigned __int64 An 8-byte (64-bit) value. 138 S–2479–20Keywords [F] The C and C++ languages reserve certain words for use as keywords. You cannot use these words for any other purpose. For example, you cannot use them as identifiers such as variable names. Some of these reserved words are required by the standards for the C and C++ languages; others support programming on the Cray XMT. Table 4. C/C++ Keywords Recognized by the Cray XMT Compiler auto default float return switch while break do for short typedef case double goto signed union char else int sizeof unsigned const enum long static void continue extern register struct volatile When you use the -traditional compiler switch on the C command line, it disables the keywords const, signed and volatile. Table 5. Standard C++ Keywords Recognized by the Cray XMT Compiler and const_cast namespace protected try and_eq delete new public typeid bitand dynamic_cast not reinterpret_cast typename bitor explicit not_eq static_cast using bool false operator template virtual catch friend or this wchar_t class inline or_eq throw xor compl mutable private true xor_eq S–2479–20 139Cray XMT™ Programming Environment User’s Guide The -no_bool compiler switch disables the bool, false and true keywords. The -no_wchar compiler switch disables the wchar_t keyword. The -cfront compiler switch disables the bool, explicit, false, true and typename keywords. The -no_alternative_tokens compiler switch disables the alternate operator keywords and, and_eq, bitand, bitor, compl, not, not_eq, or, or_eq, xor, and xor_eq. In addition to the keywords required by the language standards, the Cray XMT platform uses several additional reserved words. Most of the additional keywords reserved by Cray for use on the Cray XMT have two forms: one beginning with an alphabetic character and one beginning with a double underscore (__). Use the -no_mta_ext compiler switch to disable Cray XMT keywords beginning with a letter of the alphabet. However, Cray XMT keywords beginning with a double underscore are not affected by the -no_mta_ext compiler switch. In addition, the keywords __int16, __int32, __int64, __short16 and __short32 are not affected by the -i4 and -short16 compiler switches. For this reason, you sometimes see the double underscore format in header files to preserve the meaning of the keywords. When using the type qualifier keywords to qualify a pointer type, follow the same rules as for the standard C and C++ type qualifiers. For example, in the following declaration: int * sync f; f is a sync variable of type pointer to int, but in the following declaration: sync int * f; f is a pointer to a sync variable of type int. 140 S–2479–20Keywords [F] The following reserved words have been added by Cray to both the C and C++ languages for use on the Cray XMT platform. future __future Both a type qualifier and a statement. Future variables are initially set to a full state. A future variable is set to an empty state when the future statement executes and set to a full state when the return statement of the future executes. A read or write operation runs successfully when a future variable is set to a full state and leaves the variable set to a full state. For an example that shows the use of the future variable and future statement, see Cray XMT Programming Model. __int16 Integer type. A 2-byte value; may be signed or unsigned. See Appendix E, Data Types on page 137. __int32 Integer type. A 4-byte value; may be signed or unsigned. See Appendix E, Data Types on page 137. __int64 Integer type. An 8-byte value; may be signed or unsigned. See Appendix E, Data Types on page 137. restrict Type qualifier. Similar in function to the noalias compiler directive. See Semantic Assertions on page 125. When you declare a pointer with the restrict type, it indicates that the code does not use aliases for that pointer and the compiler can perform additional optimizations, such as the implicit parallelization of loops. __short16 Integer type. A 2-byte value; may be signed or unsigned. See Appendix E, Data Types on page 137. __short32 Integer type. A 4-byte value; may be signed or unsigned. See Appendix E, Data Types on page 137. sync __sync Type qualifier. The system atomically reads sync variables when in a full state and then sets them to an empty state. The system atomically writes sync variables when in an empty state and then sets them to a full state. The system automatically sets uninitialized sync variables to an empty state unless you use the -no_purge compiler switch; the system sets initialized sync variables to a full state. task __task Reserved for future use. S–2479–20 141Cray XMT™ Programming Environment User’s Guide The following reserved words have been added by Cray to the C language for use on the Cray XMT platform. Keywords beginning with an underscore have also been added by Cray to the C++ language. The keywords new, delete, and protected are required by the C++ standard and did not need to be added to that language. new __new Unary operator; has the same format as the new operator in the C++ language. Allocates space for an object of the specified type, initializes the full-empty bit of any sync or future variables that the new object contains, and returns the address of the new object. The system initializes the sync variable to an empty state and the future variable to a full state. The actual contents of these variables, as for any variables contained by the new object, is undefined. delete __delete Unary operator; has the same format as the delete operator in the C++ language. Deallocates space that was previously allocated using the new operator. protected __protected Reserved for future use. 142 S–2479–20MTA_PARAMS [G] The environment variable MTA_PARAMS is used by the Cray XMT user runtime. The following list contains the values that you can set for MTA_PARAMS. debug_data_prot Waits for the debugger to attach rather than exiting when a data protection or poison error occurs. This parameter is useful while troubleshooting a specific problem. However, Cray does not recommend that you use this parameter during normal operations because any error that occurs causes the runtime to wait for the debugger to attach. This results in the runtime holding on to resources previously used by the program. do_backtrace Dumps registers of all active streams when a trap occurs. This parameter may be useful during troubleshooting, although it generates a lot of information. If the runtime system has become corrupted, the registers may fail to dump. echo Prints a list of parameters to the screen. This parameter toggles on and off. exit_on_trace_fail Sets the default behavior to kill the program execution when tracing fails to initialize. ft_traps options Enables various floating-point traps depending upon which options you set. See the section of Programming Considerations for Floating-point Operations in Cray XMT Programming Model. You can select from the following list of options: i Invalid. Traps invalid floating-point numbers. z Zero-divide. Traps operations that are attempting to divide a floating-point number by 0.0. This type of operation would create a NaN. o Overflow. Traps overflows that occur. S–2479–20 143Cray XMT™ Programming Environment User’s Guide u Underflow. Traps underflows that occur. Underflows produce a rounded result smaller in magnitude than 0x0010000000000000, or about 2.225e-308. x Inexact. Traps subnormal numbers. max_readypool_retries n Sets the maximum number n of retries that an idle thread can take when checking random ready pools for new work. mmap_buffer_size n Sets the variable size of the persistent mmap buffers, where n is the size in words. The maximum value, which is also the default, is 16,777,216 words (16 GB). The size of the persistent buffers determines how much tracing data can be gathered before requiring a dump of the gathered data to the trace.out file. must_dump_size n Specifies the minimum number of words that must be present in a trace buffer before allowing the trace buffer to dump to the mmap buffer. The default value is 512 words. If an application terminates prematurely and the trace.out file is missing information, reduce the size of this buffer to force more frequent dumping. num_procs n Sets the maximum number of processors to use. This parameter is the same as using the command mtarun -m n. num_readypools n Sets the maximum number n of ready pools available for the entire task. Ready pools are used to schedule futures. no_prereserve Prevents the runtime from reserving 3 streams to use for attaching the debugger. 144 S–2479–20MTA_PARAMS [G] pc_hash n, m, l Specifies the hash size n, age threshold m, and dump threshold l of an event. The has size determines the number of event types that can be hashed at one time. The age threshold determines the age at which an event is considered stale, in which case it will be discarded rather than reported. The age threshold also determines the frequency with which events are captured in event records. The dump threshold is the minimum number of events that must have been hashed to a particular location before that location is captured as an event record when the next age threshold sample is taken. stream_limit n Sets the maximum number of streams to use on each processor. The system imposed limit is 100 streams. However, while debugging a program, it may be easier to perform debugging if this parameter is set to a smaller number. The minimum value is 5. S–2479–20 145Cray XMT™ Programming Environment User’s Guide 146 S–2479–20LUC API Reference [H] The XMT-PE contains two user-level libraries for LUC, libluc.a, that use a C++ interface. One version of libluc.a is built for Linux applications and one is built for MTK applications. Both versions present the same interface to LUC applications. For LUC applications, you use the

header file. H.1 LucEndpoint Class The LucEndpoint class defines a LucEndpoint object. S–2479–20 147Cray XMT™ Programming Environment User’s Guide The LucEndpoint class provides the interface methods that the application uses to call functions on a remote server. class LucEndpoint { public: /********************************************* * Shared functions *********************************************/ // initialize the service and start the client or server thread virtual luc_error_t startService(uint_t threadCount=1, uint_t myRequestedPid=PTL_PID_ANY); // stop the client or server thread and shutdown the service virtual luc_error_t stopService(void); // returns the endpoint ID virtual luc_endpoint_id_t getMyEndpointId(void); // set per-endpoint configuration values virtual luc_error_t setConfigValue(luc_config_key_t key, uint64_t value); // read per-endpoint configuration values virtual luc_error_t getConfigValue(luc_config_key_t key, uint64_t *value); /********************************************* * Client functions *********************************************/ // client asynchronous RPC virtual luc_error_t remoteCall(luc_endpoint_id_t serverEndpoint, luc_service_type_t serviceType, int serviceFunctionIndex, void *userData, size_t userDataLen, void * userHandle, LUC_Completion_Handler userCompletionHandler); // client synchronous RPC virtual luc_error_t remoteCallSync(luc_endpoint_id_t serverEndpoint, luc_service_type_t serviceType, int serviceFunctionIndex, void *inputData, size_t inputDataLen, void *outputData, size_t *outputDataLen); /********************************************* * Server functions *********************************************/ virtual luc_error_t registerRemoteCall(luc_service_type_t serviceType, int serviceFunctionIndex, LUC_RPC_Function_InOut theFunction); }; 148 S–2479–20LUC API Reference [H] H.2 luc_allocate_endpoint Function Use luc_allocate_endpoint to construct LucEndpoint objects. The default value for LucServiceType is LUC_CLIENT_SERVER. See LUC Type Definitions on page 159. LucEndpoint *luc_allocate_endpoint(LucServiceType_t etype); H.3 LUC Methods The LucEndpoint class uses the following methods: • startService • stopService • getMyEndpointID • remoteCall • remoteCallSync • registerRemoteCall • setConfigValue • getConfigValue H.3.1 startService Method Initializes the LucEndpoint object. Syntax luc_error_t startService(uint_t threadCount=1, ptl_pid_t requestedPid = PTL_PID_ANY); This method puts the object into a state where it can initiate and respond to RPC requests. It initializes internal network components and creates the required number of threads. The MTK version of the library allocates I/O buffers for the endpoint as part of this initialization. For client only objects, the threadCount parameter is ignored. The MTK version of the library ignores both parameters. S–2479–20 149Cray XMT™ Programming Environment User’s Guide Parameters threadCount Specifies the number of server threads that are assigned to an object. Note: The MTK LUC library ignores the threadCount parameter. requestedPid Specifies a Portals process ID to use when setting up the endpoint. By default, the LUC library chooses a Portals process ID to use. Note: MTK ignores the requestedPid parameter. Return Codes LUC_ERR_OK The service was stopped. LUC_ERR_ALREADY_STARTED User attempted to startService on a previously started LucEndpoint object H.3.2 stopService Method Stops the LucEndpoint object. Syntax luc_error_t stopService(void); Undoes the work of startService. stopService waits for running threads to finish, then terminates them. It frees up any memory and network resources associated with the endpoint that were allocated in a previous startService call. Return Codes LUC_ERR_OK The service was stopped. LUC_ERR_NOT_STARTED The service has not yet been started. To start the service, use the startService method. H.3.3 getMyEndpointID Method Returns the ID of the LucEndpoint object. Syntax luc_endpoint_id_t GetMyEndpointId(void); 150 S–2479–20LUC API Reference [H] Gets the ID of the endpoint. This method is valid only after startService has returned. Return Codes This method returns the endpoint's identifier on successful completion. LUC_ENDPOINT_INVALID The endpoint is invalid because the service has not yet been started. To start the service, use the startService method. H.3.4 remoteCall Method Makes an asynchronous remote procedure call. Syntax luc_error_t remoteCall(luc_endpoint_id_t serverEndpoint, luc_service_type_t serviceType, int serviceFunctionIndex, void *userData, size_t userDataLen, void * userHandle, LUC_Completion_Handler userCompletionHandler); The asynchronous RPC mechanism is useful in cases where the caller does not need assurance that the remote call actually happened. Locally detected errors may be returned but remote errors are not returned directly. Remote-side success or failure are returned if the caller provides a completion handler. The completion handler is guaranteed to execute once and only once—when the remote call is known to have executed or has been abandoned. This method call is valid only on started objects. Multiple concurrent callers of this method and the synchronous version are supported. S–2479–20 151Cray XMT™ Programming Environment User’s Guide Parameters serverEndpoint Specifies the endpoint identifier for the desired server of this RPC. serviceType serviceFunctionIndex These parameters specify the particular remote function to invoke on a server. The server uses the same values in its registerRemoteCall method. userData userDataLen Specifies an optional pointer to input data and the length of the data. userHandle Contains the value passed to the specified userCompletionHandler when it is invoked. userCompletionHandler Contains a function pointer for a function to be called when the remote procedure call completes. Return Codes LUC_ERR_OK The remote procedure call was launched. LUC_ERR_IO_ERROR An underlying transport error occurred. The remote procedure call may or may not have launched. LUC_ERR_TOO_LARGE The remote procedure is trying to return more data than the client can accept. This return code is generated when servers return data to an asynchronous caller. LUC_ERR_NOT_STARTED The service has not yet been started. To start the service, use the startService method. LUC_ERR_BAD_ADDRESS Indicates an attempt to use a NULL input or output buffer while specifying a non-zero size for the corresponding buffer. This error is returned by the remoteCall and remoteCallSync methods. 152 S–2479–20LUC API Reference [H] H.3.5 remoteCallSync Method Makes a synchronous remote procedure call. Syntax luc_error_t remoteCallSync(luc_endpoint_id_t serverEndpoint, luc_service_type_t serviceType, int serviceFunctionIndex, void *inputData, size_t inputDataLen, void *outputData, size_t *outputDataLen); The synchronous procedure call is used in synchronous programming models or in cases where the caller expects the remote function to return data. This method is valid only on started objects. Multiple concurrent callers of this method and the asynchronous version are supported. Parameters serverEndpoint Specifies the endpoint identifier for the desired server of this RPC. serviceType serviceFunctionIndex Specifies the particular remote function to invoke on a server and its service type. The server uses the same values in its registerRemoteCall method. inputData inputDataLen Specify an optional pointer to input data and the length of the data. outputData (input parameter) Specifies an optional buffer for return data from the RPC. outputDataLen (input/output parameter) As an input parameter, specifies the maximum amount of data that the application will accept from the RPC (the allocated size of outputData). When remoteCallSync returns, this value will be changed to the actual amount of returned data. S–2479–20 153Cray XMT™ Programming Environment User’s Guide Return Codes LUC_ERR_OK The remote procedure call was completed. Data may have been returned. LUC_ERR_NOT_STARTED The service has not yet been started. This error is returned by the stopService, To start the service, use the startService method. LUC_ERR_BAD_ADDRESS Indicates an attempt to use a NULL input or output buffer while specifying a non-zero size for the corresponding buffer. asynchronous return codes Any of the asynchronous calls return codes and completion handler codes may be returned to indicate failures. Refer to the return codes section in remoteCall Method on page 151. application defined return codes Any return codes defined by the application. H.3.6 registerRemoteCall Method Registers a remote application function with the server. Syntax luc_error_t registerRemoteCall(luc_service_type_t serviceType, int serviceFunctionIndex, LUC_RPC_Function_InOut theFunction); This method registers the specified function to be executed whenever an incoming request matches the specified service type and function index to be associated with the application function. This method operates independent of startService and stopService. It may be called for an object in any state. (Remote procedure calls are unregistered only when the object is destroyed). 154 S–2479–20LUC API Reference [H] Parameters serviceType Specifies the service type of the service being provided. serviceFunctionIndex Specifies the specific function (by index) being provided by theFunction. theFunction Specifies the application defined function to be called by LUC when RPC requests arrive at the endpoint with a matching serviceType and serviceFunctionIndex. Return Codes LUC_ERR_OK The function was registered successfully. LUC_ERR_BAD_PARAMETER The specified service type or function index is out-of-range. LUC_ERR_ALREADY_REGISTERED The specified service type or function index is already occupied. LUC_ERR_OTHER The prototype can handle only a fixed number of function registrations for each server object. H.3.7 setConfigValue Method Sets configuration values for LUC. Syntax luc_error_t setConfigValue( luc_config_key_t key, uint64_t value); Parameters key Identifies the configuration option to set. The following options can be set: LUC_CONFIG_LOG_LEVEL This configuration key alters the amount of LUC internal debugging information that is printed to standard error. S–2479–20 155Cray XMT™ Programming Environment User’s Guide Values to use for this option: LUC_DBG_NONE — The library logs assertions that are fatal to the application. LUC_DBG_LOW — The library logs fatal assertions and errors. LUC_DBG_MEDIUM — The library logs errors and warnings. LUC_DBG_HIGH — The library logs errors, warnings, and verbose information about RPCs and the endpoints. LUC_CONFIG_SERVER_RPC_COUNT This configuration key sets the number of RPCs that a server endpoint should be able to handle at once. Values to use for this option: 1 to 13106, inclusive. LUC_CONFIG_CLIENT_RPC_TIMEOUT The number of seconds that a server endpoint will wait for an expected message from a client before failing the RPC. Values to use for this option: Any number greater than zero. LUC_CONFIG_SERVER_RPC_TIMEOUT The number of seconds that a server endpoint will wait for an expected message from a client before failing the RPC. Values to use for this option: Any number greater than zero. LUC_CONFIG_MAX_NEARMEM_SIZE This configuration key adjusts the amount of nearby memory allocated for the endpoint's small I/O buffer region. This buffer region may not be disabled. This key is not valid for Linux endpoints. 156 S–2479–20LUC API Reference [H] Values to use for this option: powers-of-two from 1 MB to 256 MBs, inclusive. LUC_CONFIG_SWAP_CLIENT_INBOUND LUC_CONFIG_SWAP_CLIENT_OUTBOUND LUC_CONFIG_SWAP_SERVER_INBOUND LUC_CONFIG_SWAP_SERVER_OUTBOUND This configuration key uses boolean flags to enable byte swapping on messages sent to a LUC client, from a LUC client, to a LUC server, and from a LUC server, respectively. These are not valid for Linux endpoints. Values to use for this option: 0 and 1. LUC_CONFIG_CLIENT_RPC_COUNT This configuration key sets the maximum number of concurrent client RPCs on a single endpoint. Values to use for this option: 1 to 13106, inclusive. LUC_CONFIG_MAX_LOCAL_ENDPOINTS This configuration key sets the maximum number of started LUC endpoints that may exist in a single Linux process. This key is not valid for MTK endpoints. Values to use for this option: 1 to 512, inclusive. LUC_CONFIG_MAX_LARGE_NEARMEM_SIZE This configuration key adjusts the amount of nearby memory allocated for the endpoint's large I/O buffer region. This key is not valid for Linux endpoints. Values to use for this option: powers of two from 1 MB to 2 GB, inclusive. A special value of zero (0) may be used to disable this memory region and force all I/O memory requests to be handled by the small memory buffer. LUC_CONFIG_MAX_LARGE_MEM_REQUEST This configuration key sets the largest internal memory request that will be handled by the endpoint's large I/O buffer region. S–2479–20 157Cray XMT™ Programming Environment User’s Guide This key is not valid for Linux endpoints. Values to use for this option: powers of two from 1 MB to 256 MBs, inclusive. LUC_CONFIG_SMALL_NEARMEM_SIZE This configuration key adjusts the amount of nearby memory allocated for the endpoint's small I/O buffer region. This key is not valid for Linux endpoints. Values to use for this option: powers of two from 1 MB to 256 MBs, inclusive. This buffer region may not be disabled. LUC_CONFIG_MAX_SMALL_MEM_REQUEST This configuration key sets the largest internal memory request that will be handled by the endpoint's small I/O buffer region. This key is not valid for Linux endpoints. Values to use for this option: powers of two from 64 KBs to 256 MBs, inclusive. value Identifies the value to set for the corresponding configuration key. Return Codes LUC_ERR_OK The operation was successful. LUC_ERR_INVALID_KEY The key parameter is not one of the predefined LUC configuration keys (LUC_CONFIG_* ). LUC_ERR_INVALID_STATE The setConfigValue method cannot change the key value because of the endpoint's current state. The endpoint must be stopped to set the nearby memory region configuration values. H.3.8 getConfigValue Method Returns the value for a specified configuration option for LUC. Syntax luc_error_t getConfigValue( luc_config_key_t key, uint64_t *value); 158 S–2479–20LUC API Reference [H] Parameters key Identifies the configuration option to get. For a list of configuration options, see setConfigValue Method on page 155. value Returns a pointer to the value for the corresponding configuration key. Return Codes LUC_ERR_OK The operation was successful. LUC_ERR_INVALID_KEY The key parameter is not one of the predefined LUC configuration keys (LUC_CONFIG_* ). H.4 LUC Type Definitions LucServiceType defines the type of the LucEndpoint object. typedef enum { LUC_SERVER_ONLY = 1, LUC_CLIENT_ONLY, LUC_CLIENT_SERVER } LucServiceType_t; Endpoints may be constructed to behave as a client and a server or they can be specialized to be one or the other. The LucServiceType typedef describes what type of LucEndpoint object is being created. LUC remote procedure calls can be grouped by their intended service type. The following service types are predefined. The programmer can specify other application specific values or use the predefined values. typedef u_int32_t luc_service_type_t; #define LUC_ST_QueryManager 0 #define LUC_ST_QueryEngine 1 #define LUC_ST_Coordinator 2 #define LUC_ST_Restore 3 #define LUC_ST_Snapshot 4 #define LUC_ST_UpdateManager 5 #define LUC_ST_UpdateEngine 6 #define LUC_ST_OutputLog 7 #define LUC_ST_Any 8 #define LUC_ST_ErrorLog 9 Error return codes are described with the methods that return them. The programmer can specify other application specific error return codes or use the predefined values. typedef int32_t luc_error_t; S–2479–20 159Cray XMT™ Programming Environment User’s Guide H.5 LUC Callback Functions The LucEndpoint class uses the following callback functions: • LUC_RPC_Function_InOut • LUC_Mem_Avail_Completion • LUC_Completion_Handler H.5.1 LUC_RPC_Function_InOut The LUC runtime calls LUC_RPC_Function_InOut callback when a remote client makes a request. The application must call the registerRemoteCall method to register LUC_RPC_Function_InOut callback functions. The application should return LUC_ERR_OK when successful. The application should not return or redefine any other predefined return codes. Syntax typedef luc_error_t (*LUC_RPC_Function_InOut)(void *inData, uint64_t inDataLen, void ** outData, uint64_t *outDataLen, void ** completionArg, LUC_Mem_Avail_Completion *completionFctn, luc_endpoint_id_t callerEndpoint); 160 S–2479–20LUC API Reference [H] Parameters inData (input parameter) Specifies a pointer to a buffer containing input data to the remote function. NULL if there is no input data. inDataLen (input parameter) Specifies the length of the inData buffer. outData (output parameter) Specifies a pointer to the output data returned by the application. NULL if there is no output data. outDataLen (output parameter) Specifies the length of the data returned by the application if there is returning data. completionArg (output parameter) Specifies the value to pass to completionFctn. completionFctn (output parameter) Specifies a pointer to a LUC_Mem_Avail_Completion callback function called when the buffer is available. Used when LUC_RPC_Function_InOut returns data to the LUC runtime and needs to be notified that the buffer is available for use. callerEndpoint (input parameter) Specifies the input endpoint identifier of the client's LucEndpoint object passed to the remote function. H.5.2 LUC_Mem_Avail_Completion The LUC_Mem_Avail_Completion callback function notifies LUC_RPC_Function_InOut that its buffer is available for use. Syntax typedef void (*LUC_Mem_Avail_Completion)(void * userHandle); Parameters userHandle LUC passes in the completionArg value returned by the initiating LUC_RPC_Function_InOut function. S–2479–20 161Cray XMT™ Programming Environment User’s Guide H.5.3 LUC_Completion_Handler The LUC_Completion_Handler callback function is used by a client for asynchronous remote procedure calls. LUC_Mem_Avail_Completion Syntax typedef void (*LUC_Completion_Handler) (luc_endpoint_id_t originalDestAddr, luc_service_type_t originalServiceType, int originalFunctionIndex, void * userHandle, luc_error_t remoteError); The LUC runtime will call the function specified in the remoteCall method that follows this signature when the remote call has completed. Parameters originalDestAddr originalServiceType originalFunctionIndex Specifies the destination address, service type, and function index. LUC passes in the values used by the remoteCall method that initiated this RPC being completed. userHandle LUC passes in the value specified by the remoteCall method that initiated this RPC being completed. remoteError The ultimate error code for the RPC from either the LUC library or the server application's registered function. All of the values returned by remoteCallSync (including application defined return codes) may be specified here. H.6 LUC Return Codes The meaning of some predefined return codes are dependent on the method that returns the code. Applications may define application specific codes. LUC_ENDPOINT_INVALID Indicates that the object has not been started and does not have a valid endpoint identifier. 162 S–2479–20LUC API Reference [H] LUC_ERR_OK • The function was registered successfully. • This object is ready to accept remote requests. • The remote procedure call was launched. • The remote procedure call was completed. • The endpoint has been stopped successfully. • The function was prepared for transmission. The application's completion handler is guaranteed to fire with a real status at some later point. LUC_ERR_MAX Special value set to be the highest numerical error code generated by the library. Applications may specify their own error codes to be greater than this value. LUC_ERR_BAD_ADDRESS Indicates an attempt to use a NULL input or output buffer while specifying a non-zero size for the corresponding buffer. This error is returned by the remoteCall and remoteCallSync methods. LUC_ERR_NOT_REGISTERED The caller tried to make an RPC call to an unregistered service type/function index pair that was not registered with registerRemoteCall. LUC_ERR_OTHER • The library can handle only a fixed number of function registrations for each server object. The library supports the registration of 64 functions for each endpoint. • Failed to create the desired threads. LUC_ERR_ALREADY_REGISTERED The specified service type or function index is already occupied. S–2479–20 163Cray XMT™ Programming Environment User’s Guide LUC_ERR_BAD_PARAMETER • The specified service type or function index is out of range. • The specified configuration value is out of range. LUC_ERR_RESOURCE_FAILURE A transient resource allocation failure has occurred. The caller should retry the operation at a later time. LUC_ERR_TOO_LARGE The remote procedure is trying to return more data than the client is able to accept. This return code will be generated whenever servers try to return data to an asynchronous caller. LUC_ERR_LIBRARY The (Linux) LUC Library received an unexpected error from the Portals Library. LUC_ERR_ALREADY_STARTED User attempted to startService on a previously started LucEndpoint object LUC_ERR_TIMEOUT Client failed to get a response from the server in a timely manner. The server is busy or a message was lost in transit. LUC_ERR_NOT_IMPLEMENTED Method not implemented. Returned by remoteCall and remoteCallSync for objects that were created as LUC_SERVER_ONLY. Returned by registerRemoteCall for objects that were created as LUC_CLIENT_ONLY. 164 S–2479–20LUC API Reference [H] LUC_ERR_FIO The (MTK) LUC Library received an unexpected error from the Fast I/O System Call Library. LUC_ERR_INVALID_ENDPOINT The endpoint parameter to the method was invalid. LUC_ERR_ALREADY_STOPPED User attempted to stopService on a previously stopped, or never started, LucEndpoint object. LUC_ERR_IO_ERROR An underlying transport error occurred. The remote procedure call may or may not have fired. LUC_ERR_NOT_STARTED The service has not yet been started. This error is returned by the stopService, remoteCall, and remoteCallSync methods. To start the service, use the startService method. LUC_ERR_CANCELLED Endpoint was stopped while this RPC was in progress. LUC_ERR_INVALID_KEY The key parameter for the setConfigValue or getConfigValue methods is not one of the predefined LUC configuration keys (LUC_CONFIG_* ). LUC configuration keys are defined in getConfigValue Method on page 158. LUC_ERR_INVALID_STATE The setConfigValue method cannot change the key value because of the endpoint's current state. The endpoint must be stopped to set the nearby memory region configuration values. S–2479–20 165Cray XMT™ Programming Environment User’s Guide 166 S–2479–20Glossary barrier In code, a barrier is used after a phase. The barrier delays the streams that were executing parallel operations in the phase until all the streams from the phase reach the barrier. Once all the streams reach the barrier, the streams begin work on the next phase. block scheduling A method of loop scheduling used by the compiler where contiguous blocks of loop iterations are divided equally and assigned to available streams. For example, if there are 100 loop iterations and 10 streams, the compiler assigns 10 contiguous iterations to each stream. The advantages to this method are that data in registers can be reused across adjacent iterations, and that there is no overhead due to accessing a shared iteration counter dependence analysis A technique used by the compiler to determine if any iteration of a loop depends on any other iteration (this is known as a loop-carried dependency). dynamic scheduling In a dynamic schedule, the compiler does not bind iterations to streams at loop startup. Instead, streams compete for each iteration using a shared counter. fork Occurs when processors allocate additional streams to a thread at the point where it is creating new threads for a parallel loop operation. full-empty state Indicates whether a variable contains a value (full) or not (empty). Generic read and write operations use this state to determine whether they can perform an operation on the variable. For example, a writeef operation can only write a value to a variable if the state is empty. After the write operation, it sets the state to full. S–2479–20 167Cray XMT™ Programming Environment User’s Guide future Implements user-specified or explicit parallelism by creating a continuation that points to a sequence of statements that may be executed by another idle thread. Futures also optionally contain a return value. Execution of code that uses the return value is delayed until the future completes. The thread that spawns the future uses parameters to pass data to the thread that executes the future. In a program, the term future is used as a type qualifier for a synchronization variable used to return the value of a future or as a keyword for a future statement. induction variable A variable that is increased or decreased by a fixed amount on each iteration of a loop. inductive loop A loop that contains no loop-carried dependencies and has the following characteristics: a single entrance at the top of the loop; controlled by an induction variable; and has a single exit that is controlled by comparing the induction variable against an invariant. interleaved scheduling A method of executing loop iterations used by the compiler where contiguous iterations are assigned to distinct streams. For example, for a loop with 100 iterations and 10 streams, one stream performs iterations 1, 11, 21,... while another stream performs iterations 2, 12, 22, ..., and so on. This method is typically used for triangular loops because it reduces imbalances. One disadvantage to using this method is that there is loss of data reuse between loop iterations because adjacent iterations are not executed by the same stream. join Occurs when threads that are forked for a parallel operation finish the operation. As threads finish and drop the streams they are running on, the streams join back together until there is a single stream running the thread. linear recurrence A special type of recurrence that can be parallelized. loop-carried dependences The value from one iteration of a loop is used during a subsequent iteration of the loop. This type of loop cannot be parallelized by the compiler. 168 S–2479–20Glossary recurrence Occurs when a loop uses values computed in one iteration in subsequent iterations. These subsequent uses of the value imply loop-carried dependences and thus usually prevent parallelization. To increase parallelization, use linear recurrences. reduction A simple form of recurrence that reduces a large amount of data to a single value. It is commonly used to find the minimum and maximum elements of a vector. Although similar to a recurrence, it is easier to parallelize and uses less memory. region An area in code where threads are forked in order to perform a parallel operation. The region ends at the point where the threads join back together at the end of the parallel operation. S–2479–20 169 TM Cray XMT™ Programming Model S–2367–20© 2011 Cray Inc. All Rights Reserved. This document or parts thereof may not be reproduced in any form unless permitted by contract or by written permission of Cray Inc. U.S. GOVERNMENT RESTRICTED RIGHTS NOTICE The Computer Software is delivered as "Commercial Computer Software" as defined in DFARS 48 CFR 252.227-7014. All Computer Software and Computer Software Documentation acquired by or for the U.S. Government is provided with Restricted Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7014, as applicable. Technical Data acquired by or for the U.S. Government, if any, is provided with Limited Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7013, as applicable. Cray, LibSci, and PathScale are federally registered trademarks and Active Manager, Cray Apprentice2, Cray Apprentice2 Desktop, Cray C++ Compiling System, Cray CX, Cray CX1, Cray CX1-iWS, Cray CX1-LC, Cray CX1000, Cray CX1000-C, Cray CX1000-G, Cray CX1000-S, Cray CX1000-SC, Cray CX1000-SM, Cray CX1000-HN, Cray Fortran Compiler, Cray Linux Environment, Cray SHMEM, Cray X1, Cray X1E, Cray X2, Cray XD1, Cray XE, Cray XEm, Cray XE5, Cray XE5m, Cray XE6, Cray XE6m, Cray XMT, Cray XR1, Cray XT, Cray XTm, Cray XT3, Cray XT4, Cray XT5, Cray XT5 h , Cray XT5m, Cray XT6, Cray XT6m, CrayDoc, CrayPort, CRInform, ECOphlex, Gemini, Libsci, NodeKARE, RapidArray, SeaStar, SeaStar2, SeaStar2+, The Way to Better Science, Threadstorm, and UNICOS/lc are trademarks of Cray Inc. Linux is a trademark of Linus Torvalds. Lustre is a trademark of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. UNIX, the “X device,” X Window System, and X/Open are trademarks of The Open Group in the United States and other countries. All other trademarks are the property of their respective owners. RECORD OF REVISION S–2367–20 Published May 2011 Supports release 2.0 GA running on Cray XMT and Cray XMT Series compute nodes and on Cray XT 3.1UP02 service nodes. This release uses the System Management Workstation (SMW) version 5.1UP03.Contents Page Introduction [1] 5 Hardware [2] 7 2.1 Processors . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 Processor Streams . . . . . . . . . . . . . . . . . . . . . . 7 2.1.2 Processor Instructions . . . . . . . . . . . . . . . . . . . . . 8 2.1.3 Processor Saturation . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Physical Memory . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Data Memory . . . . . . . . . . . . . . . . . . . . . . . . . 10 Compilers [3] 13 Implicit and Explicit Parallelism [4] 15 4.1 Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.1.1 Inductive Loops . . . . . . . . . . . . . . . . . . . . . . . 15 4.1.2 Loops with Independent Iterations . . . . . . . . . . . . . . . . . . 16 4.1.3 Linear Recurrences . . . . . . . . . . . . . . . . . . . . . . 17 4.1.4 Reductions . . . . . . . . . . . . . . . . . . . . . . . . 18 4.2 Compiler Analysis and Optimization of Parallelism . . . . . . . . . . . . . . 18 4.2.1 Optimizing Parallelization Across Multiple or Single Processors . . . . . . . . . . 22 4.2.2 Compiler Implementation of Parallelization . . . . . . . . . . . . . . . 22 4.2.3 Nested Loop-level Parallelism . . . . . . . . . . . . . . . . . . . 26 4.2.4 Loop Future Parallel Region Implementation . . . . . . . . . . . . . . . 26 4.3 Explicit Parallelism . . . . . . . . . . . . . . . . . . . . . . . 28 The User Runtime Library [5] 29 5.1 Work Pools . . . . . . . . . . . . . . . . . . . . . . . . . 30 Synchronization [6] 31 6.1 Future Variables . . . . . . . . . . . . . . . . . . . . . . . . 31 6.2 Sync Variables . . . . . . . . . . . . . . . . . . . . . . . . . 32 S–2367–20 3Cray XMT™ Programming Model Page 6.3 Cray XMT System Implementation of Synchronization . . . . . . . . . . . . . 32 Shared Memory [7] 35 Lightweight User Communication Library (LUC) [8] 37 8.1 LUC Endpoints . . . . . . . . . . . . . . . . . . . . . . . . 37 8.2 LUC Data Flow . . . . . . . . . . . . . . . . . . . . . . . . 39 8.3 LUC Communication Flow . . . . . . . . . . . . . . . . . . . . . 40 The Snapshot Library [9] 43 Programming Scenarios [10] 45 10.1 Creating a Dataflow Algorithm . . . . . . . . . . . . . . . . . . . . 45 10.2 Creating a Breadth-first Search . . . . . . . . . . . . . . . . . . . . 47 Figures Figure 1. Data Word with Tag Bits . . . . . . . . . . . . . . . . . . . 10 Figure 2. LUC Endpoints . . . . . . . . . . . . . . . . . . . . . . 39 Figure 3. LUC RPC Data Flow . . . . . . . . . . . . . . . . . . . . 40 Figure 4. LUC Communication Flow . . . . . . . . . . . . . . . . . . . 41 Figure 5. Snapshot Library Data Paths . . . . . . . . . . . . . . . . . . 43 4 S–2367–20Introduction [1] This document contains conceptual content that was originally published in various technical notes and in Cray XMT Programming Environment User's Guide. This document provides an overview of the physical and logical structure of the Cray XMT system and the concepts for parallel programming on the Cray XMT. You need to understand these concepts before you design your program. The features of the Cray XMT system that enable parallel programming include: • Hardware that enables multithreaded operations. This includes multithreaded processors and globally accessible memory. • The Multithreaded Kernel (MTK) operating system. The operating system runs an MTK kernel on the Cray XMT system compute nodes. The OS is based on BSD 4.4 UNIX and provides most standard UNIX commands and shells. • The Cray XMT compilers. The compilers are part of the Cray XMT Programming Environment (XMT-PE) and include compilers for both C and C++ programs. • The User Runtime Library. • The Snapshot Library. S–2367–20 5Cray XMT™ Programming Model 6 S–2367–20Hardware [2] This chapter introduces hardware concepts and provides an overview for how the Cray XMT hardware is leveraged for parallelization. For a complete description of Cray XMT Series hardware see Cray XMT System Overview. To parallelize operations efficiently at each level of the program, the hardware is designed to enable: • Single-cycle context switching. This is the ability of the processor to switch quickly and efficiently between multiple threads of execution without using the operating system. • Global memory access. All processors have access to the many physical memory banks on the Cray XMT system. • Lightweight synchronization operations. Threads can synchronize accesses to shared memory by using load and store instructions that interact with the full-empty bit available on each word of physical memory. This synchronization occurs without the use of the operating system. 2.1 Processors The processors switch among threads without using software. On each clock cycle, each processor switches to a different thread and issues one instruction from that thread. If a particular thread is waiting for resources, such as memory, I/O, or synchronization, the processor remains busy executing instructions from other threads. The processor uses a pipeline to accept an instruction from a different thread at each clock tick. 2.1.1 Processor Streams Each physical processor in the Cray XMT system supports 128 instruction streams at a time. Instruction streams, or streams, for short, are programmed like a wide-instruction RISC processor. Each stream has 32 64-bit general-purpose registers. The processor selects streams for execution and executes a single instruction from each stream in turn. Each instruction consists of three operations. The processor allocates, creates, and destroys streams dynamically. S–2367–20 7Cray XMT™ Programming Model Streams are either active or idle. A stream is active while it is holding the state of a running thread. While a thread is running, its stream continually offers the next ready instruction to the processor for execution until there are no more instructions. An active stream competes with other streams to issue instructions, while idle streams do not. The processor switches between active streams and issues an instruction from a new stream at every clock tick. Code is executed sequentially by a single stream on a processor. When a parallel loop is encountered, the user runtime allocates additional streams to execute each iteration of the loop in parallel. The additional streams can be from the same processor or from other processors. 2.1.2 Processor Instructions An active stream issues processor instructions. Every instruction is 64-bits long and contains four fields known as lookahead, M-operation, A-operation, and C-operation. lookahead Provides the hardware with the maximum number of processor instructions that may begin execution before the memory reference for a single active operation is completed. M-operation Memory operations, such as loads, stores, and FETCH_ADD operations. A-operation Arithmetic operations, such as floating-point arithmetic (including multiply-add), integer arithmetic (including an integer multiply-add), and bitwise operations. C-operation Control operations, such as conditional branch operations, unconditional jumps, function calls, and returns. Additionally, this operation may be from a subset of the floating-point and integer arithmetic operations. An instruction may be encoded to contain more than one operation field. For example, an instruction may contain both memory and control operations (MC-operation) or memory, arithmetic, and control operations (MAC-operation). Each instruction requires one clock tick to issue, regardless of the number of operations it contains. 8 S–2367–20Hardware [2] The lookahead field controls the overlap of memory operations by allowing the execution of additional instructions after a memory reference. A stream that issues a memory reference can execute the number of instructions set in the lookahead field before being forced to wait for the memory reference to complete. For example, in the following loop, the compiler schedules the loads from b[j], c[j], and d[j]before they are used to compute a[j], and sets the lookahead for each load to point to its corresponding use. This lookahead enables the stream to issue all three loads without having to wait for any of them to return. Thus, all three loads can be in flight at the same time. for (int j = 0; j < n; j++) { a[j] = b[j] + c[j] + d[j]; } The maximum possible value for the lookahead field is 7. 2.1.3 Processor Saturation A processor can potentially issue a new instruction at every clock tick, or 500 million per second. If no stream is ready at a particular tick, no instruction is issued. This missed opportunity to issue an instruction is a phantom. If the processor is 100% busy, it is issuing instructions at every clock tick and there are no phantoms. When this occurs, the processor is saturated. Another factor that affects saturation is the length of the processor pipeline. The pipeline consists of instructions from different streams that are in the process of being executed. A stream may only have one instruction in the pipeline at a time. Instructions from all active streams are multiplexed onto the pipeline by the processor. Although a processor can potentially issue a new instruction at every clock tick, the processor pipeline only allows an individual stream to progress at a rate of one instruction for every 21 clock ticks. You can use the machine's low-level performance counters to measure both the number of clock ticks and the number of phantoms over a particular region of code, such as a loop. For more information, see Cray XMT Performance Tools User's Guide. 2.2 Physical Memory Memory is categorized in two ways: nearby and global. Each processor has an associated memory unit which is nearby, that is, local, to it. At boot time, the operating system divides up a portion of the nearby memory on each processor to use for global memory. All processors on a machine have access to all memory on the machine. S–2367–20 9Cray XMT™ Programming Model For global memory, the hardware hashes memory addresses so that cells of data memory that appear to be adjacent are actually distributed across different memory banks. By distributing logically adjacent data across different banks of physical memory, the risk of network and memory bank bottlenecks is greatly reduced. However, there are cases where the use of nearby memory is more efficient because the processor takes fewer cycles to access it. System performance is dependent on having a large number of threads executing concurrently. This is how the system tolerates the memory latency that occurs when a thread is performing a complex computation. Multiple streams can issue instructions from their threads to the processor pipeline. One aspect of the memory subsystem that can affect system performance is a hot spot. A hot spot can occur when many streams try to access the same location in memory. To avoid a certain class of hot spots, the software replicates read-only data (such as floating-point constants) when a program is running on multiple processors. Even though the hardware and software mitigate the performance bottlenecks caused by hot spots, you should consider the potential for hot spots when you write your code. 2.3 Data Memory A cell of data memory on the system is a 66-bit, byte-addressable word. Each word has two tag bits associated with it that hold the memory state for this word: full-empty and extended. The following diagram illustrates a data word. Figure 1. Data Word with Tag Bits tagbits full-empty extended 63 data values 0 The first state bit is the full-empty bit. The full-empty bit is used for synchronized memory operations, such as accesses to sync and future variables. You can also interact with the full-empty bit using the XMT generic functions. 10 S–2367–20Hardware [2] The second state bit is the extended bit. If the extended bit is toggled on, then additional information about the hardware, such as traps, is associated to the word. Data memory addressing on the Cray XMT is big-endian. S–2367–20 11Cray XMT™ Programming Model 12 S–2367–20Compilers [3] The Cray XMT Programming Environment (XMT-PE) includes compilers for C and C++ programs. The compilers automatically find and implement parallelism contained in your program. The compiler first analyzes the complete program, and then it uses a variety of techniques to parallelize and optimize loops at different levels in your program. The compilers recognize specific C and C++ directives and language constructs that are used by the Cray XMT system. These directives and language constructs give access to architectural features of the machine. For example, you change the full-empty bit on a data word when you use a C/C++ function that changes the full-empty state. These extensions enable you to leverage the machine hardware to achieve maximum performance without having to learn the details of the hardware architecture. The compilers also support direct access to certain machine instructions, which are described in the Cray XMT Programming Environment User's Guide. The compilers automatically create links to the user runtime library, librt. The XMT-PE includes the user runtime library. The user runtime library provides support for many multithreaded features, such as lightweight synchronization and user-level traps, user thread management and load balancing, event logging, and other services. Many discussions in this guide and in the Cray XMT Programming Environment User's Guide focus on how your program can use the compiler to its best advantage. S–2367–20 13Cray XMT™ Programming Model 14 S–2367–20Implicit and Explicit Parallelism [4] Implicit parallelism is expressed as a loop using the constructs that are part of the standard programming languages. In C or C++, the loops might be for loops, while loops, or even some recursive functions. It is not possible for the compiler to parallelize all loops. The compiler must be able to determine how many iterations a loop contains before the loop begins. This type of loop is known as an inductive loop. The loop must also use a syntax that the compiler recognizes. 4.1 Loops The compiler can only parallelize loops that fall into one of the following categories: • Loops with independent iterations • Linear recurrences • Reductions You cannot express every task in your program in the form of one of these loops. You need procedure calls, straight-line code, conditionals, and other types of loop constructs. The compiler can handle all of these constructs correctly, performing the usual sorts of optimizations applicable to serial code. When the compiler parallelizes a loop, Amdahl's law still holds in that the sequential portions of a program may limit the improvement available from parallelism. When you write your program, you should formulate as many computations as possible in the form of loops that the compiler can automatically parallelize. 4.1.1 Inductive Loops An inductive loop has these characteristics: • It has a single entrance at the top of the loop. • It is controlled by a linear induction variable. An induction variable is a variable that is increased or decreased by a fixed amount (the stride) on every iteration of a loop. • It has a single exit that is controlled by comparing the induction variable against an invariant. S–2367–20 15Cray XMT™ Programming Model The compiler can determine the number of iterations that an inductive loop contains, and the values of the induction variables in each iteration, without having to execute the loop. This allows iterations to potentially execute in parallel and independently calculate the values of their induction variables. Any loop structures created by standard C and C++ language constructs and loops built from explicit goto statements can be inductive. The compiler must be able to recognize an induction variable, its stride, and the loop exit test. Additionally, the compiler must be able to determine that the stride and the other components of the loop-exit test are all loop invariant. For example, in the following C++ loop: for(i = 1; i <= n; i++) { x[i] = 0.0; } Here the induction variable is i, its stride is 1, and the only other component in the loop-exit test is the loop-invariant term n, so this loop is inductive. In this case, the number of iterations is the maximum of 0 and n. It is also possible to have inductive loops with many induction variables, such as in the following C++ example: for (j = 0; j < n; j += 10) { x[i++] = y[j]; } In this case, the loop has two induction variables, i and j, with strides of 1 and 10, respectively. 4.1.2 Loops with Independent Iterations If each iteration of an inductive loop can execute independently of the others, then the loop is parallelizable. For example, the compiler can parallelize a loop that zeroes the elements of a vector: for (i = 1; i <= n; i++) { v[i] = 0.0; } The compiler uses a technique called dependence analysis to determine whether any iteration of the loop depends on any other iteration. If there are no dependences between iterations (known as loop-carried dependences), the iterations may execute in parallel. A loop-carried dependence exists if one iteration writes a location in memory and another iteration reads from or writes to the same location. Since the compiler is conservative in its dependence analysis, it may conclude that there is a loop-carried dependence that prevents parallelization even when there is no such dependence. You can sometimes help the compiler determine that there are no dependences by using the semantic assertion pragmas , or the restrict type qualifiers. See Cray XMT Programming Environment User's Guide for more information. 16 S–2367–20Implicit and Explicit Parallelism [4] In the previous example, the order in which the loop clears elements does not matter so long as it clears all elements eventually. Therefore, the loop may be run in parallel. The only location that is written to is v[i]. Because no two iterations can write to the same location in v, the compiler can conclude that there is no loop-carried dependence and that the iterations may execute in parallel. The compiler uses a technique called dependence analysis to determine whether any iteration of the loop depends on any other iteration. If there are no dependences between iterations (known as loop-carried dependences), the iterations may execute in parallel. A loop-carried dependence exists if one iteration writes a location in memory and another iteration reads from or writes to the same location. Since the compiler is conservative in its dependence analysis, it may conclude that there is a loop-carried dependence that prevents parallelization even when there is no such dependence. You can sometimes help the compiler determine that there are no dependences by using the semantic assertion pragmas, or the restrict type qualifier. See Cray XMT Programming Environment User's Guide for more information on pragmas and type qualifiers. In the previous example, the order in which the loop clears elements does not matter so long as it clears all elements eventually. Therefore, the loop may be run in parallel. The only location that is written to is v[i]. Because no two iterations can write to the same location in v, the compiler can conclude that there is no loop-carried dependence and that the iterations may execute in parallel. 4.1.3 Linear Recurrences A recurrence is a loop that uses values computed in one iteration in subsequent iterations. These subsequent uses imply loop-carried dependences and thus usually prevent parallelization. However, a special class of recurrence, the linear recurrence, can be solved in parallel. For example, the following loop contains a recurrence because x[i - 1] refers to a value computed by the previous iteration. for (i = 1; i < n; i++) { x[i] = x[i - 1] + m; } You can rewrite the previous example so that it runs in parallel. For example, in the following code snippet, there are no loop-carried dependences, so the compiler can parallelize the loop. for (i = 1; i < n; i++) { x[i] = x[0] + i * m; } S–2367–20 17Cray XMT™ Programming Model However, some code cannot be rewritten so easily, such as in this example: for (i= 1; i < n; i++) { x[i] = x[i - 1] + y[i]; } The previous example can be efficiently solved in parallel using a form of cyclic reduction. In general, the compiler attempts to parallelize recurrences of the following form: x[i] = x[i - k] * f[i] + g[i] A loop such as the one in the previous example can be parallelized if it meets the following conditions: • The operators * and + are sufficiently like multiplication and addition, for example, bitwise and and or operations. • All loop-carried dependences are simple and cross a small constant number of iterations. In this case, dependencies are carried for k iterations. 4.1.4 Reductions A reduction is a simple form of recurrence that reduces a large amount of data to a single value. For example, the following loop computes a sum of the first n elements of the vector x. s = 0; for (i = 0; i < n; i++) { s += x[i]; } An example from linear algebra is the calculation of the inner product, also known as the dot product, of two vectors. s = 0; for (i = 0; i < n; i++) { s += x[i] * y[i]; } A common use of reduction is to find the maximum and minimum elements of a vector. Although similar to a recurrence, a reduction is simpler to solve in parallel and requires less memory. 4.2 Compiler Analysis and Optimization of Parallelism The compiler can optimize both serial and parallel loops using a variety of techniques. 18 S–2367–20Implicit and Explicit Parallelism [4] Serial loops are loops that cannot be run in parallel. Loops that are not run in parallel impact performance of the program. When optimizing a serial loop, the compiler attempts to minimize the number of operations in the loop body and pack the operations into as few instructions as possible. Each iteration is run sequentially. When the compiler encounters a parallel loop, such as the following: for(i = 0; i < n; i++) { x[i] = a*x[i] + y[i]; } It generates the following code: Fork ...execute some iterations Join The fork operation creates a number of new threads. Each thread, including the original, executes a subset of the loop's iterations. As each thread completes its share of the iterations, it falls into the join operation where each thread turns itself off until there is only one thread. The last thread to arrive at the join proceeds to execute the serial code that follows the parallel loop. The number of threads created by the fork depends upon several factors. You can use a compiler directive to specify a lower bound for the number of threads to request. Otherwise, the compiler estimates the number of threads required to saturate the processors and/or the network and compares this with an estimate of the number of loop iterations, adjusting the number to achieve good load balance. At execution time, the fork operation requests this number of threads and, if they are available, uses exactly that amount. If the machine is relatively busy, fewer threads may be available, so fewer are allocated to the loop. As a result, this particular loop may run more slowly, but the machine as a whole remains busy. The compiler uses one of three different approaches to choose iterations for each thread to execute: • Block scheduling • Interleaved scheduling • Dynamic scheduling In a block-scheduled loop, the compiler assigns a contiguous subset of the iterations to each stream. If the loop has 100 iterations and 10 streams are available, the compiler assigns a contiguous block of 10 iterations to each stream. S–2367–20 19Cray XMT™ Programming Model The advantages of block scheduling include a lower overhead than dynamic scheduling, and the possibility for reuse of data in registers across adjacent iterations. For example, in the following loop: for(i = 1; i < n; i++) { b[i] = (a[i - 1] + a[i])/2; } Using block scheduling, the value that the stream loads on iteration i is also used for the next iteration. If the compiler uses the interleaved or dynamic scheduling methods, the next iteration may not be performed by the same stream, so each operation requires a separate load. Block scheduling may lead to load imbalances if the execution times of all the loop iterations are not about the same. For example, in a loop nest such as the following: for(i = 0; i < n; i++) { for(j = 0;j < i;j++) { a[i][j] = 0.0; } } Because the later i iterations run longer than the early ones, a block schedule for the outer loop assigns more work to the last stream than to the first. This can lead to a degradation in performance. The problem can be avoided by using either of the other two scheduling methods. A minor disadvantage of block scheduling is the cost of computing the loop bounds and the chunk size. The chunk size is the number of iterations per block assigned to any individual stream. In an interleaved schedule for a parallel loop, the compiler assigns contiguous iterations to distinct streams. For a loop with 100 iterations and 10 streams available, one stream performs iterations 1, 11, 21, ..., another stream performs iterations 2, 12, 22, ..., and so on. The compiler uses this form of scheduling for triangular loops, as shown in the previous example. Interleaved scheduling cannot eliminate load imbalances entirely. It only reduces imbalances for triangular loops. Loops that contain iterations that are data-dependent can still have load-balance problems. The use of interleaved scheduling also avoids the startup costs of blocked scheduling. The major disadvantage of an interleaved schedule is the loss of data reuse between loop iterations in comparison to blocked schedules. A minor disadvantage is that the generated code for inner loops is not as good when the stride of the induction variables is unknown. 20 S–2367–20Implicit and Explicit Parallelism [4] In a dynamic schedule, the compiler does not bind iterations to streams at loop startup. Instead, streams compete for each iteration using a shared counter. For example, in the following loop: #pragma mta dynamic schedule #pragma mta assert parallel for(i = 0; i < n; i++) { compute(i, x, y, z); } The compiler will use dynamic scheduling to execute the iterations of this loop. When the program reaches the beginning of the loop, a shared counter is initialized to 0. Each stream then performs an atomic INT_FETCH_ADD operation to determine its value of i and then begins execution of iteration i. When that iteration is complete, the stream again performs an INT_FETCH_ADD operation on the shared counter to find its next iteration. Dynamic scheduling is a good choice for avoiding load-balancing problems. The disadvantages are the overhead that occurs by using the atomic counter (the other two types of scheduling have no such overhead) and some loss of optimization for the loop body. Block dynamic scheduling is an alternative form of dynamic scheduling that assigns small blocks of iterations rather than individual iterations. This accomplishes many of the load balancing benefits of pure dynamic scheduling with less contention on the share counter. Effective with the 1.5 release, the compiler will favor block dynamic scheduling over pure dynamic scheduling to avoid hotspotting issues seen on larger systems. The compiler automatically chooses a good approach for each parallel loop; however, you can override this choice by using a compiler directive. A particular strength of the compiler for the Cray XMT is its ability to take advantage of the hardware's lightweight synchronization to achieve atomic updates. These atomic updates enable the compiler to parallelize a larger class of algorithms. The compiler can use atomic updates to prevent loop-carried dependences involving associative operations. For example, in the following loop, the compiler parallelizes the loop even though the loop has a dependence on bucket. for(i = 0; i < n; i++) { k = key[i]; bucket[k] = bucket[k] + 1.0; } In this example, the compiler uses the hardware's lightweight synchronization to lock the location bucket[k] while the thread updates it. This avoids a race condition that, in this case, occurs if another thread simultaneously attempts to update the same location in bucket. S–2367–20 21Cray XMT™ Programming Model 4.2.1 Optimizing Parallelization Across Multiple or Single Processors The compiler controls whether to parallelize loops in your program for use on a single processor or multiple processors. The compiler optimizes the parallelization of your code in different ways depending upon whether you select single-processor mode or multiprocessor mode when you compile. Single processor parallelism uses less memory and has lower startup costs than multiple processor parallelism. In single processor mode, the compiler implements fork and join operations inline, using very short instruction sequences. However, in single processor mode, only limited speedup is possible. The speedup is the speed at which a parallel algorithm runs, in comparison to a sequential algorithm. While the speedup depends on the exact code and other active threads, a speedup of 20 to 40 times is typical in single processor mode. The speedup is limited by two factors. First, there must be adequate parallelism—a loop with only 10 iterations does not does offer much potential for speedup. Second, speedup is limited by processor saturation—if 30 threads are enough to saturate a processor, then no greater speedup is possible in single processor mode. Multiprocessor mode uses more memory than single-processor mode, but it offers much greater potential for speedup. In the multiprocessor mode, the compiler implements fork-and-join connections using scalable techniques suitable for thousands of threads. In this case, much greater speedups are possible if there is sufficient parallelism available in the loop. You can set the compiler to use the appropriate mode when you compile your program. See Cray XMT Programming Environment User's Guide for more information. 4.2.2 Compiler Implementation of Parallelization The compiler minimizes the overhead that occurs with parallelism by generating code for each function as a whole. It does this by creating a top-level serial structure for each function. This serial structure has ordinary code and control flow with embedded parallel regions. A parallel region is a sequence of phases separated by barriers; each phase is a set of sections. When the program is run on the Cray XMT system, a parallel region begins with a fork. A fork creates new threads to execute loop iterations and other parallel code. After the fork, the threads begin executing the first phase of the region. 22 S–2367–20Implicit and Explicit Parallelism [4] A phase is a set of one or more sections of code that the program executes in parallel. The code in a section may consist of either a parallel loop or a serial block of code. If the section is a serial block of code, it is executed by a single stream while other streams execute code from other sections. If the section is a parallel loop it is executed by many streams. Within a single phase, there may be more than one section consisting of a parallel loop. The streams execute these loops in series within the phase. As soon as a stream completes its share of one loop, it immediately begins work on the next loop. The region ends when the threads involved in the parallel computation finish executing the code in the region and return to serial code. At this point, the program no longer needs multiple streams, so the streams join back into a single stream. The streams implement the join by using a shared counter initialized with the number of streams. As each stream finishes the region, it decrements the counter and quits until there is only one stream left. The following is a schematic representation of a sample parallel region. +------------------- | Fork | | +---------- | | Section A | | Section B | Phase 1 | Section C | | Section D | | Section E | +---------- Region | | Barrier | | +---------- | Phase 2 | Section W | +---------- | | Barrier | | +---------- | | Section X | Phase 3 | Section Y | | Section Z | +---------- | | Join +------------------- S–2367–20 23Cray XMT™ Programming Model At the beginning of the region, a fork creates a number of threads, and execution of the first phase is begun. As each thread in Phase 1 completes its share of the work in Phase 1, it falls into a barrier. The barrier delays the threads until every thread reaches the barrier. Once all the threads reach the barrier, the threads begin work on the next phase. In the example, a barrier delays the threads twice. By grouping several sections into Phases 1 and 3, the compiler minimizes the lag time created by the barriers. As each thread completes its share of the last phase, it falls into the join and, with the exception of the final thread, turns itself off. The last thread to reach the join continues normal execution after the region. A barrier has approximately the same overhead as a join. Using barriers instead of join-fork pairs between phases saves the cost of the intermediate forks, and helps amortize the cost of the initial fork. 24 S–2367–20Implicit and Explicit Parallelism [4] For example, consider the following loops: a[0] = x for (i = 1; i < n; i++) { a[i]=i; } for (j = 0; j < n; j++) { b[j] = c[j] + d[j] } The compiler can execute the first and second loops in parallel. They may have a different number of iterations. A naive implementation by the compiler would look like this: a[0] = x +------------ | Fork | | +---------- | | do iterations of 1st loop | +---------- | | Join +------------ +-------------- | Fork | | +---------- | | do iterations of 2nd loop | +---------- | | Join +-------------- A better implementation would look like this: a[0] = x +------------ | Fork | | +---------- | | do iterations of 1st loop | +---------- | | Barrier | | +---------- | | do iterations of 2nd loop | +---------- | | Join +-------------- S–2367–20 25Cray XMT™ Programming Model The second diagram is better because it substitutes a single barrier for a join-fork combination. The following diagram shows the best implementation: +------------ | Fork | | +---------- | | a[0] = x | | do iterations of 1st loop | | do iterations of 2nd loop | +---------- | | Join +-------------- In principle, all the sections in a phase (this example has three sections in a single phase) can be executed in parallel. In the last implementation, one thread does the assignment to a[0] and then falls into the first loop. The other threads immediately begin the first loop. As threads finish the first loop, they begin the second loop. 4.2.3 Nested Loop-level Parallelism When your code contains multiple levels of loop structures nested within one another the compiler can often parallelize such structures. The compiler may collapse a pair of nested loops into a single loop, or the pair may be implemented as nested parallel regions, or only the outer loop may be parallelized. The structure of the loop nest determines how the compiler implements parallelism. However, the compiler will not implement nested parallel regions in multiprocessor mode. Typically, loop collapse is preferable to nested parallel regions because less overhead is required for thread management. For more information on managing loop-level parallelism in your application see Limiting Loop Parallelism in Cray XMT Applications and Optimizing Loop-Level Parallelism in Cray XMT Applications in the CrayDoc Knowledge Base at http://docs.cray.com/kbase. 4.2.4 Loop Future Parallel Region Implementation The MTK runtime can assign threads to the execution of a parallel region in three ways: • Single processor, where the runtime assigns a fixed number of streams within a single processor to the job of executing iterations of the loop or loops in the parallel region. • Multiple processors, where more than one processor can assign a fixed number of streams to the parallel region. • Loop future, where the number of streams assigned to execute the parallel region can grow during execution if more streams become available. 26 S–2367–20Implicit and Explicit Parallelism [4] Because of its high overhead, loop future is probably the least common parallel region implementation style on the Cray XMT. However, it is an appropriate style to use when the amount of work in a loop is dynamic and unpredictable, or if load balancing between iteration is an issue. In particular, loop future is often the best approach for implementing a deeply nested, parallel, recursive computation. The compiler generates a work descriptor, which is a piece of code that can execute multiple iterations of a loop in a dynamically scheduled fashion. The runtime places a continuation, which is a pointer to the work descriptor, on a work queue that is accessed by streams looking for new work. Presupposing that the loop will have many iterations that can be executed in parallel when a stream dequeues a continuation, the runtime enqueues additional continuations, then begins claiming iterations of the loop to execute. As more streams become free and look for new tasks on the work queue, they do the same—dequeue one continuation, enqueue up to eight more, and begin claiming loop iterations. The continuations for every active loop future parallel region are placed on the same work queue, which enables some load balancing across multiple regions, and which may correspond to multiple levels of a recursively parallel loop. Be aware that recursive adding of the continuations in the work queue can result in the allocation of many streams to this loop before the procedure completes. Another loop, nested inside the loop that is implemented with loop future, may find few streams available for its execution. This resource intensity is in addition to the already high overhead that loop futures entail. In contrast to loop future parallel region implementation, a future variable is a variable that is defined in a way that allows the programmer to assume that a thread may eventually be spawned to compute its value. Future variables are similar to loop futures, in regard to the treatment of the work queue by the runtime. A future variable can be assigned a value normally, but it can also be associated with a future statement, which resembles the transfer of control to a function that computes the value of the future variable. However, when program execution reaches the future statement, a continuation pointing to the function is placed on the work queue. Subsequently, a stream that is looking for work may remove this continuation and invoke the function to compute the value of the future variable. Full/empty bits indicate whether the value has been stored in the future variable. Following the apparent invocation of the future variable's defining function, it a thread references it, but it is not yet computed, then that thread waits until the future variable is assigned a value. Anticipating this delay, a programmer may choose to invoke the touch function on the future variable, When a thread calls touch, it effectively invokes the future statement function and computes the value of the future variable itself, if it has not already been computed. Thus, using future variables amounts to a form of lazy evaluation, where the value of a variable may or may not be computed ahead of the time at which it is referenced. S–2367–20 27Cray XMT™ Programming Model 4.3 Explicit Parallelism You may not be able to use loop parallelism in certain situations, such as searching a linked data structure or implementing a recursive algorithm in parallel. In this case, you can use a construct that supports explicit parallelism. This construct, known as futures, explicitly indicates which sections of code may execute concurrently with other sections. Futures implement user-specified or explicit parallelism by creating work that can be claimed and executed by a new thread. A future is a sequence of code that can be executed by a newly created thread that is running concurrently with other threads in the program. Futures can also delay the execution of code that uses values computed by the future, until the future completes. The thread that spawns the future uses parameters to pass information to the future. For more information on using futures in your code see Cray XMT Programming Environment User's Guide. 28 S–2367–20The User Runtime Library [5] The Cray XMT user runtime library provides software support for futures and synchronization, event logging, and compiler-generated parallelism. The user runtime environment is unique because it performs tasks that, on some systems, are typically done by the operating system, such as trap handling and work scheduling. The runtime library provides the following functionality: • Trap handling • Asynchronous operating system calls • Debugging support • Event logging • Resource acquisition and management • Work scheduling and management Jobs executing concurrently on the system must share all available resources, such as individual instruction streams and functional units of a particular processor, adjacent words in memory modules, all network connections, and I/O. The operating system partitions these resources for load balancing and efficient usage. User runtime processes compete for the same resources, so the operating system imposes high-level limits on each user runtime, such as a limit on the number of streams it issues for each job on each processor. The operating system manages these limits. The operating system also limits the number of processors that each job can use. If a job needs additional processors, it must request them from the operating system. If a job does not make a specific request, the operating system cannot resolve dynamic resource management or load balancing issues. The runtime handles most of these types of resource management issues in your program. S–2367–20 29Cray XMT™ Programming Model 5.1 Work Pools The user runtime environment maintains two pools that contain representations of work executed by a single thread: the ready pool and the unblocked pool. The ready pool represents continuations, which contain all the information necessary to execute a future, including a pointer to the future body and an argument list. The unblocked pool represents threads that have suspended execution but are now ready to continue. Note: For the Cray XMT, unblocked pools are distributed among the processors so that a thread running on a particular processor is put on that processor's unblocked pool. This preserves any locality of references that occurs when structures are allocated in nearby memory. When a stream executes a future statement, it adds a continuation to the ready-pool queue. This continuation does not become a thread until it begins execution, at which point the user runtime allocates stack space for the new thread. The ready pool can represent large amounts of potential parallelism while occupying little space. As a thread finishes using resources for which another thread is waiting, the user runtime places the waiting thread on the unblocked pool. The user runtime processes user threads from the unblocked pool and removes threads from the unblocked pool as resources are available to run them. Because threads occupy more space than continuations, the user runtime removes threads on the unblocked pool before it removes continuations from the ready pool. 30 S–2367–20Synchronization [6] Synchronization is controlled over the execution of one thread through the use of the full-empty state of variables shared with other threads. For example, when one thread waits before reading a value written by another thread, the two threads synchronize at the read. If the value is already written, the first thread does not wait. A reference to a future variable following the spawning of an associated future is a common example of synchronization. Synchronization on the Cray XMT is efficient for the following reasons: • Little computation is expended in determining whether a thread should suspend or resume. • The acts of suspension and resumption require only a few operations. • The resources of a suspended thread are quickly reassigned to a ready thread. Each word of memory includes 2 bits of memory state. One of these state bits is the full-empty bit, upon which the full-empty state of synchronization variables is based. A synchronization variable holds the value that is being shared by multiple threads in your program. There are two types of synchronization variables: future and sync. References to future or sync variables respect the full-empty state of the variable by completing only if the state is appropriate to satisfy the particular type of reference. If the full-empty state is not appropriate, the thread issuing the reference suspends until the state changes. The stream that was executing the suspended thread becomes available to execute other code, so suspension does not idle the processor. 6.1 Future Variables Future variables are used to store the results of future computations. Once a future variable has been associated with a spawned future, references to the variable will succeed only after the future completes. Upon allocation, the future variable is set to full. Subsequent read and write operations succeed only if the variable is full and do not change the full-empty state. When a future is spawned, the compiler sets the associated future variable to empty. When the future returns, the variable is automatically set to full. Also, if the future returns a value, the value is written into the future variable. References to the future variable that are made after the future is spawned but before it completes are thus delayed. S–2367–20 31Cray XMT™ Programming Model Future variables can be used independently of the future construct. To facilitate synchronization, your program can explicitly change the full-empty state of a future variable to empty by using the purge generic function. 6.2 Sync Variables As future variables naturally implement barriers, sync variables implement locks. Upon allocation, an uninitialized sync variable is empty and an initialized sync variable is full. Subsequent references to sync variables respect the full-empty state of the variable. A read completes only if the state is full, and a write completes only if the state is empty. Each kind of access toggles the full-empty state. If several threads are suspended waiting to read a sync variable and the variable is written, only one suspended thread resumes to read and empty the variable. The remaining suspended threads wait for subsequent writes to the variable. 6.3 Cray XMT System Implementation of Synchronization Knowing how synchronization is implemented can help you understand the performance implications of various synchronization methods more clearly. The compiler translates accesses to normal, sync variables into underlying machine operations that correspond closely to generic functions. For example, a load from a variable x of type sync int is much like the readfe(x) generic function. When the full-empty bit at x is full, the stream loads the value and resets the state of the full-empty bit at x to empty. This occurs at the same rate as a load from a normal variable. It requires only one load operation. If the full-empty bit is empty, the load operation fails, and the stream retries it. At the processor, the failed operation is given low priority so it does not interfere with other operations that might succeed immediately. Sometimes, the operation waits a number of instructions before forcing its turn. But more often, the streams that are issuing will occasionally not make use of the slot allotted for memory operations, allowing the retry to be issued without forcing. If the full-empty bit is still empty, the load fails and continues to retry until the full-empty bit becomes full or until it fails more than the retry-count number of times. Such a failure is called a retry timeout exception. While the operation is retried, the stream that issued the memory operation may execute up to seven more instructions, depending on the dependences within the code, as represented by this instruction's lookahead number. When the stream has no more instructions to execute, it must wait until the operation finishes execution or the retry timeout occurs. During this time, the stream does not compete for processor resources, so waiting is efficient. Although the stream is unavailable to other threads, the waiting has no impact on processor throughput. 32 S–2367–20Synchronization [6] In the case of a retry timeout, the operation traps. The trap handler saves the program counter's value along with the stream's register values and puts this thread context on a list of threads waiting for the cell's full-empty bit to become full. If this is the first thread waiting on this memory location, the runtime writes a special value into the memory location to indicate that there are waiting threads. The stream returns to user runtime code and searches the work pools for additional work. In this way, the stream becomes available again for productive work after being delayed by the retry timeout and the trap handler code. The trap handler is user runtime code and does not interact with the operating system. The next time that a stream accesses this memory cell, instead of retrying the memory operation, it raises the data block exception and traps immediately. The stream then enters the trap-handler code. If the memory operation is also a load, the stream enters the thread's context into the waiting list. But if the operation is a store, it moves the context of the first waiting thread to the unblocked pool, records the value to store in that context, and continues with the thread from which it trapped. If the load operation was the only one on the list, the user runtime writes the new value in the memory location. In the previous scenarios, each stream issues about one hundred instructions. Delays caused by user runtime processes only affect the synchronized streams. When synchronization occurs within the time of a retry timeout, the number of instructions issued by each stream is even lower. The user runtime synchronizes future-qualified accesses in a similar way. If the first waiting thread that fails a memory operation is an access to a future variable, the user runtime places all waiting threads on the unblocked pool when the variable becomes full (e.g., when the future completes). The record keeping that the user runtime uses for synchronization is not visible to the compiler. Because the user runtime handles synchronous operations dynamically, your program must not deallocate a memory location when it is being used for synchronization. Functions that return to their caller before satisfying all pending synchronization on automatic variables can cause errors. S–2367–20 33Cray XMT™ Programming Model 34 S–2367–20Shared Memory [7] The Cray XMT system enables separate, unrelated processes to share data by mapping the same physical memory into the address space of each process. Each process that has the memory mapped can access the same data. This allows processes to communicate using memory. Each program that shares the address space can access the same piece of data. This approach is more efficient than creating a copy of the data in different parts of memory. Physical memory that is shared by multiple programs is known as shared memory. Shared memory can be accessed by both synchronized and nonsynchronized memory operations. An instance of shared memory is referred to as a shared memory region and is identified by the name of a file in a file system. In order to create a shared memory region, you create an empty file, open the file, then call mmap to allocate the shared memory. Other processes may then specify the same file name to identify the shared memory region and mmap it into their address spaces. The file name is used only for identification purposes. The data in the shared memory region exists only in memory while the system is operating and is not written into the file. For this reason, the data is lost across system reboots. MTK does not support the traditional UNIX file mmap semantics where data contained in a file is read into memory and mapped into the process address space. Every program that shares memory can perform read-and-write operations on the data, depending upon its permissions. S–2367–20 35Cray XMT™ Programming Model 36 S–2367–20Lightweight User Communication Library (LUC) [8] The Lightweight User Communication Library (LUC) enables communication between Linux service nodes and the MTK operating system that runs on compute nodes. An application take advantage of the extended capabilities of the service node while its client applications running on the compute nodes perform computational tasks. The LUC implementation uses a client/server remote procedure call (RPC) paradigm where communication occurs between endpoint objects that sit on both the client and server. For intersystem communication to occur, at least one client-side and one server-side endpoint must be active. Each endpoint encapsulates all of the information needed to make and respond to RPC calls between two processes. When the client has a package of data that it needs to deliver to the server, it sends a message to the server endpoint over the high-speed network. On the client side, the LUC library allocates nearby memory in the user buffer for data storage. On the server side, the LUC library allocates memory in nearby memory for the transfer of data and later copies the data to distributed memory, or leaves it in nearby memory. Data is transferred over Cray Seastar chips using Cray Fast I/O. The LUC programming model comprises the following components: • Endpoints • Data flow • Communication flow 8.1 LUC Endpoints In the LUC programming model, an endpoint is an instance of the LucEndpoint class, which is defined in the LUC class library. A LucEndpoint object can be a server object, a client object, or both a client and a server object. For each server object, the programmer provides a set of function calls that can be called from a client object. The LucEndpoint class provides the interface methods that the application uses to call the functions on the remote server. The properties and methods of LucEndpoint are described in Cray XMT Programming Environment User's Guide. S–2367–20 37Cray XMT™ Programming Model When an application allocates an instance of a LucEndpoint object, it specifies the service type as being one of the following: LUC_SERVER_ONLY, LUC_CLIENT_ONLY, LUC, or some other application-defined type. If the object is a server object, the application calls the registerRemoteCall method to register its remote functions. LucEndpoint objects, whether client or server, are allocated in a passive state and are not able to respond to requests until activated. The application calls the startService method to activate the object. Linux applications may specify the number of threads that are to be created to service the RPC calls. The number of threads is dependent on the number and length of function calls that will be made—assign more threads to an object that is expected to have many lengthy calls, so that the object is not waiting for threads to be freed. The MTK version of the library creates the number of threads given as an argument to the startService call. The default thread count is 1. During object initialization, the LUC runtime assigns a system-wide unique identifier to each object. Because LUC does not provide a name service, the application must provide a way for clients to determine the Luc_endpoint_id associated with the servers that the clients will communicate with. When the service is ready to exit, the application calls the stopService method to deactivate the endpoint. This terminates any worker threads that were created by LUC. In addition, the Fast I/O data streams are closed, and all Fast I/O requests are cancelled. Figure 2 illustrates multiple LucEndpoint objects executing on compute (MTK) nodes and service (Linux) nodes. This shows how multiple LucEndpoint objects can execute on a single compute node or service node, and are multiplexed transparently through the single Cray SeaStar chip associated with the node. The multiple endpoints on the node share the available bandwidth. For simplicity, this figure shows each LucEndpoint as either a client or server; however, it is also possible for a single endpoint to serve as both a client and a server and a single server object can communicate with multiple client objects concurrently. In addition, a single compute node or service node process can host multiple LucEndpoint objects. LucEndpoint objects are not inherited across fork or execve. When a process calls fork(2) the child process has the same memory image as the parent but any LucEndpoint objects that were allocated in the parent are not active in the child process. Any calls the child process makes using the LucEndpoint object will fail and return an error code. When a process calls execve(2), any active LUC endpoint objects in the process will be deactivated. 38 S–2367–20Lightweight User Communication Library (LUC) [8] Figure 2. LUC Endpoints A A unique endpoint ID and type (client or server) are included in the data stored in each endpoint. B C D E Threadstorm P0 High Speed Network Endpoints Compute Node SeaStar Threadstorm P1 SeaStar B C A E Endpoints Service Node Opteron S101 SeaStar D A Client Endpoints pass through the HSN then communicate with the server endpoint specified in the RPC Opteron S100 SeaStar 8.2 LUC Data Flow Figure 3 shows how data is moved through the system. MTK application buffers may be in distributed or nearby memory. Distributed memory buffers will be copied into nearby memory buffers by the LUC library. Linux memory is nearby and suitable for remote direct memory access (RDMA) 1 ; they do not require memory copies. The LUC library allocates nearby memory buffers suitable for data transfer and performs the necessary copies to nearby memory for RPC input data and from nearby memory for RPC output data. The Cray Fast I/O driver issues requests to the firmware to do a RDMA into the remote Linux node's preallocated memory buffers. 1 Remote Direct Memory Access (RDMA) is a data transfer protocol in which the data is transmitted directly from the memory of one computer to the memory of another without involving the CPU. S–2367–20 39Cray XMT™ Programming Model Figure 3. LUC RPC Data Flow MTK RDMA Linux Copy Distributed or Nearby Memory User Buffer - LUC - Nearby Memory User Buffer - Nearby Memory 8.3 LUC Communication Flow Figure 4 describes the communication that occurs between the client and server LucEndpoint objects when the client submits a request for a remote procedure call. The communications are implemented within the LUC library and are transparent to the application. They are presented here to describe how LUC works. In this example, the client resides on a Linux service node and invokes a remote function call to a server that resides on an MTK compute node. If the remote function requires input data, the data is sent as a part of the remote function call request (for small input data), or it is sent as a separate packet (for large input data). This example illustrates the case where a large amount of input data is required. When the server receives the RPC request, it allocates a nearby user-level memory buffer to receive the input data, then notifies the client. The client also has a user-level buffer, which contains the large input data residing in its address space. The Cray Portals driver issues requests to the firmware to do a RDMA into the remote MTK node's preallocated local memory buffers. The RDMA is performed directly from the user-level buffers into the compute node memory. No copying of data is required. 40 S–2367–20Lightweight User Communication Library (LUC) [8] Figure 4. LUC Communication Flow Client Server Send RPC Command Send Large Input Data (Optional) Receive Large Output Data Receive Alive / OK to Send Send RPC Response Buffer Receive Large Input Data Receive RPC Command Receive RPC Response Identify Input Buffer Send Large Output Data (Optional) Send Alive / OK to Send Execute Function Allocated by application. Mapped by LUC Allocated by LUC When the RDMA completes, the firmware executing on the MTK Cray SeaStar chip queues a notification for the multiplexor driver. The completion notification is passed along to the LUC library, and the remote procedure call that was waiting for the input data is executed. Note: There is no copy operation — the remote procedure call uses the nearby memory input buffer provided by the LUC library. S–2367–20 41Cray XMT™ Programming Model When the remote function call completes, the output data is either sent as part of the RPC response (for small output data), or is sent as a separate packet (for large output data). If a separate output packet is required, the server remote function provides a user-level buffer that contains the output. This buffer may reside in global distributed memory. On the compute nodes, because the system interconnect can only perform RDMA operations to nearby memory, LUC copies the data from the user buffer into a nearby memory buffer that it has allocated for this purpose. LUC then issues a Fast I/O request to transfer the data to the user's output buffer. On the service nodes the MTK Cray SeaStar driver performs the RDMA directly from the LUC output data buffer into the user's output buffer on the Linux node and no copying of the data is required. LUC provides both synchronous and asynchronous communication. The remoteCallSync method provides synchronous communication. The method returns when the remote function has been executed or an error has occurred. The return value is the value returned from the remote function. The remoteCall method provides asynchronous communication. The programmer provides the address of a callback function, LUC_Completion_Handler as a parameter to the remoteCall method. The LUC run time calls LUC_Completion_Handler when the remote function completes or an error occurs. A LUC operation fails if a message does not arrive at its intended destination, such as when the server is busy and drops a message. LUC has internal timeouts to detect the loss of messages, and returns an error to the application when a timeout occurs. RPCs may also fail with a timeout if the server is oversubscribed. This may occur on the first transfer in the LUC protocol for an RPC request, in which the RPC command buffer is sent from the client to the server, if there is no available buffer on the server to receive the RPC command. In all cases where the RPC fails due to a timeout, the application is responsible for retrying the RPC call. 42 S–2367–20The Snapshot Library [9] The Cray XMT snapshot library provides a high speed bulk data transfer facility that moves data between memory regions within an MTK application and files hosted on the XMT Linux service partition. The primary use of the snapshot library is to load and save large data sets that are being stored on a Lustre file system. For example, an application might use the snapshot library to load a large data set at the beginning of a run, process the data, then use the snapshot library to save the processed data in a file at the end of a run. An application might also use the snapshot library to save intermediate copies of the processed data during the course of a run. The snapshot library uses the Fast IO (FIO) mechanism on the compute partition to transfer data, in parallel, to and from files on the service partition using instances of a helper program called fsworker that provide file system access on login nodes. Multiple instances of fsworker are used in parallel to provide higher throughput. This figure shows the most common data communication paths between an application using the snapshot library and a file on the compute partition. The data moves, in four distinct stages, between a global memory buffer in the application and a file on a Lustre file system hosted by the service partition. Figure 5. Snapshot Library Data Paths Global Memory Linux Service Partition Threadstorm Compute Nodes Snapshot Client Compute Node FIO Lustre File System FC Portals OSS OSS OSS FSW FSW FSW Application Data Buffer Compute Node Compute Node S–2367–20 43Cray XMT™ Programming Model The easiest way to understand this is to imagine data going to a file from the application. In this case, the data is copied by each compute node into the FIO transport and sent to its corresponding fsworker on a login node in the Linux service partition. Each fsworker then uses Linux system calls to write data into the Lustre file, which results in the data moving across the Portals transport from the login node to one or more Lustre OSS nodes. From there, the data moves through Fibre Channel (FC) to the actual storage device. Moving data from a file to the application simply reverses the order of the stages and the direction of the data flow through each stage, ultimately resulting in data being copied from compute nodes into the application's global memory buffer. 44 S–2367–20Programming Scenarios [10] This chapter provides two algorithms that illustrate concepts for developing an application for the Cray XMT system. 10.1 Creating a Dataflow Algorithm Dataflow algorithms are used for computations when the order of operations is irregular and/or unknown at compile time. Logically, a dataflow algorithm is a graph where the nodes are operations, and the edges are the inputs and outputs of the operations. All operations whose input edges are full — that is, have data values — can execute. Synchronized read and write operations are used as a lightweight mechanism to control the execution of computations. Synchronization allows operations to wait for inputs and write output edges that enable downstream neighbors to execute. Use the following generic functions to synchronize read and write operations: • readfe(&addr) • readff(&addr) • writeef(&addr, value) Deadlock can occur if no boundary conditions are initialized. The boundary conditions are the operations, or nodes, at the boundary of the dataflow graphs. The boundaries can be executed immediately because they require either no inputs or inputs from previous statements. If they are not initialized, the interior operations will wait indefinitely for their data. To avoid this, you must initialize the boundary cells and purge the interior cells by using the purge(&addr) generic function. To prevent deadlock, computations must be executed in the dataflow order. You can use loops for structured dataflow and breadth-first activation lists for unstructured dataflow. For more information, see the generics(3) man page. S–2367–20 45Cray XMT™ Programming Model Example 1. Three-point wavefront stencil using a dataflow algorithm A simple example of a dataflow algorithm is a three-point wavefront stencil. In physics, a wavefront is the locus (a wave propagating in three dimensions of a surface) of points having the same phase. Consider a two-dimensional array of size Rank with the left column and top row set to 1, and interior elements that have the following value: a(i, j) = (a(i-1, j) + a(i -1, j -1) + a(i, j -1)) / 3, 0 < i < Rank and 0 < j < Rank On the Cray XMT, the simplest code for a dataflow algorithm that uses these values is shown in the following example. for (i = 0; i < RANK; i++) for (j = 0; j < RANK; j++) purge(a[i, j]); for (j = 0; j < RANK; j++) { a[0,j] = 1.0; a[j,0] = 1.0; } #pragma mta assert parallel #pragma mta interleave schedule for (i = 1; i < RANK; i++) { for (j = 1; j < RANK; j++) { double N = readff(a[i-1] + j); double W = readff(a[i ] + j - 1); double NW = readff(a[i-1] + j - 1); double V = (N + W + NW) / 3.0; writeef(a[i, j], V); } } The preceding example contains three for loops. The first loop sets the full-empty state of each element to empty, preventing downstream elements from executing until their upstream neighbors have computed their values. The second loop sets the left column and top row to 1, setting the full-empty state of each element to full. The third loop implements the dataflow algorithm. Each interior element tries to read its upstream neighbors. The read operations wait until the neighbors set the full-empty state to full. When the read operations are complete, the element calculates its value and writes it to memory. The write operation changes the full-empty state to full and allows the read operations executed by downstream neighbors to complete. In this manner, the algorithm flows from the upper left corner of the array to the lower right corner. 46 S–2367–20Programming Scenarios [10] The order of iterations in the third loop must be reasonably close to the execution order of the operations. The statement purge(a[i, j]) initializes boundary cells and purges interior cells. Additionally, you should not use a block schedule for the third loop (for example, by using the #pragma mta block schedule statement). If you do use a block schedule, at first, only the stream assigned the first block of rows has operations to execute. In fact, at any one time, at most two streams can execute. The compiler must use interleaved or dynamic scheduling to permit many streams to execute simultaneously, to maximize parallelism. 10.2 Creating a Breadth-first Search Graph algorithms such as breadth-first search map well to the Cray XMT programming paradigm due to features such as global shared memory, lightweight threads, and the full-empty state. Breadth-first search (BFS) is a well known and widely used graph algorithm. For BFS, let G = (V, E) be an undirected, connected graph, where V is the set of vertices and E is the set of undirected edges. Starting at a node S in V, BFS systematically visits all the nodes in V exactly once. It explores the graph by first visiting all the neighbors of S, then neighbors of the neighbors, and so on. Example 2. Breadth-first search A simple recursive algorithm for BFS for the Cray XMT is shown in the following C++ example. void BFS(int node, graph *A) { int i; int *Visited = A->Visited; int *Neighbors = A->Neighbors; int *numNeighbors = A->numNeighbors; int firstNode = numNeighbors[node]; int lastNode = numNeighbors[node + 1]; #pragma mta assert parallel #pragma mta loop future for (i = firstNode; i < lastNode; i++) { int neighbor = Neighbors[i]; int visited = int_fetch_add(&Visited[neighbor], 1); if (visited == 0) BFS(neighbor, A); } } BFS has a for loop that iterates through the neighbors of node. The loop is preceded by two pragmas. The first tells the compiler that all the iterations can be run in parallel. The second says to use loop future parallelism for scheduling the loop. Loop future parallelism works well for recursive loops such as this one, because it allows for better load balancing across all levels of the parallelism. The int_fetch_add generic function is used to atomically increment the visited field; this guarantees that only one thread will execute the recursive call to BFS for a given node. S–2367–20 47 TM Cray XMT™ Debugger Reference Guide S–2467–20© 2001, 2005, 2007–2009, 2011 Cray Inc. All Rights Reserved. This document or parts thereof may not be reproduced in any form unless permitted by contract or by written permission of Cray Inc. BSD Open Source License Notice: Copyright (c) 2008, Cray Inc. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name Cray Inc. nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. Your use of this Cray XMT release constitutes your acceptance of the License terms and conditions. U.S. GOVERNMENT RESTRICTED RIGHTS NOTICE The Computer Software is delivered as "Commercial Computer Software" as defined in DFARS 48 CFR 252.227-7014. All Computer Software and Computer Software Documentation acquired by or for the U.S. Government is provided with Restricted Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7014, as applicable. Technical Data acquired by or for the U.S. Government, if any, is provided with Limited Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7013, as applicable. AMD and AMD Opteron are trademarks of Advanced Micro Devices, Inc. GNU is a trademark of The Free Software Foundation. Opteron is a trademark of Advanced Micro Devices, Inc. Lustre is a trademark of Sun Microsystems, Inc. in the United States and other countries. UNIX, the “X device,” X Window System, and X/Open are trademarks of The Open Group in the United States and other countries. All other trademarks are the property of their respective owners. RECORD OF REVISION S–2467–20 Published May 2011 Supports release 2.0 GA running on Cray XMT compute nodes and on Cray XT 3.1UP02 service nodes. This release uses the System Management Workstation (SMW) version 5.1UP03. 1.4 Published December 2009 Supports release 1.4 running on Cray XMT compute nodes and on Cray XT 2.2.UP01 service nodes. This release uses the System Management Workstation (SMW) version 4.0.UP02. 1.3 Published March 2009 Supports release 1.3 running on Cray XMT compute nodes and on Cray XT 2.1.5HD service nodes. This release uses the System Management Workstation (SMW) version 3.1.09. 1.2 Published August 2008 Supports release 1.2 running on Cray XMT compute nodes and on Cray XT 2.0.49 service nodes. This release uses the System Management Workstation (SMW) version 3.1.04.Contents Page Overview [1] 7 1.1 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.1.1 Loading the Module . . . . . . . . . . . . . . . . . . . . . . 7 1.1.2 Compiling for Debugging . . . . . . . . . . . . . . . . . . . . 8 1.1.3 Working Directories . . . . . . . . . . . . . . . . . . . . . . 9 1.1.4 Environment Variables . . . . . . . . . . . . . . . . . . . . . 10 1.2 Getting Started . . . . . . . . . . . . . . . . . . . . . . . . 10 1.2.1 Selecting a Program to Debug . . . . . . . . . . . . . . . . . . . 11 1.2.1.1 File Commands . . . . . . . . . . . . . . . . . . . . . . 11 1.2.1.2 Module Commands . . . . . . . . . . . . . . . . . . . . 13 1.2.1.3 Object Directory Commands . . . . . . . . . . . . . . . . . . 14 1.2.1.4 Shared Library Directory Commands . . . . . . . . . . . . . . . . 15 1.2.2 Running the Program . . . . . . . . . . . . . . . . . . . . . 16 1.2.2.1 Working Directory . . . . . . . . . . . . . . . . . . . . . 16 1.2.2.2 Program I/O . . . . . . . . . . . . . . . . . . . . . . 17 1.2.2.3 Environment Variables . . . . . . . . . . . . . . . . . . . . 17 1.2.2.4 Runtime Arguments . . . . . . . . . . . . . . . . . . . . 19 1.3 Debugging a Currently Running Job . . . . . . . . . . . . . . . . . . . 19 1.4 Ending a Debugging Session . . . . . . . . . . . . . . . . . . . . . 20 Breakpoints and Watchpoints [2] 21 2.1 Breakpoints and Watchpoints . . . . . . . . . . . . . . . . . . . . . 21 2.1.1 Setting Breakpoints . . . . . . . . . . . . . . . . . . . . . . 23 2.1.1.1 Special Breakpoint Situations . . . . . . . . . . . . . . . . . . 24 2.1.2 Setting Watchpoints . . . . . . . . . . . . . . . . . . . . . . 26 2.1.3 Deleting Breakpoints and Watchpoints . . . . . . . . . . . . . . . . . 26 2.1.4 Disabling Breakpoints and Watchpoints . . . . . . . . . . . . . . . . 27 2.1.5 Break Conditions . . . . . . . . . . . . . . . . . . . . . . 28 2.1.6 Commands Executed on Breaking . . . . . . . . . . . . . . . . . . 30 2.2 Continuing . . . . . . . . . . . . . . . . . . . . . . . . . . 31 S–2467–20 3Cray XMT™ Debugger Reference Guide Page 2.3 Stepping . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Understanding Multithreading [3] 35 3.1 Thread Names . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2 Thread States . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3 Focus Thread . . . . . . . . . . . . . . . . . . . . . . . . . 37 Examining the Stack [4] 41 4.1 Stack Frames . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2 Backtraces . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.3 Selecting a Frame . . . . . . . . . . . . . . . . . . . . . . . . 43 4.4 Information on a Frame . . . . . . . . . . . . . . . . . . . . . . 45 Examining Source Files [5] 47 5.1 Printing Source Lines . . . . . . . . . . . . . . . . . . . . . . . 47 5.2 Searching Source Files . . . . . . . . . . . . . . . . . . . . . . 49 5.3 Specifying Source Directories . . . . . . . . . . . . . . . . . . . . 50 5.4 Examining Instructions . . . . . . . . . . . . . . . . . . . . . . 51 Examining Data [6] 53 6.1 Expressions . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.2 Program Variables . . . . . . . . . . . . . . . . . . . . . . . . 54 6.3 State Bits . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.4 Artificial Arrays . . . . . . . . . . . . . . . . . . . . . . . . 56 6.5 Format Options . . . . . . . . . . . . . . . . . . . . . . . . 57 6.6 Output Formats . . . . . . . . . . . . . . . . . . . . . . . . 59 6.6.1 Examining Memory . . . . . . . . . . . . . . . . . . . . . . 60 6.7 Automatic Display . . . . . . . . . . . . . . . . . . . . . . . 62 6.8 Value History . . . . . . . . . . . . . . . . . . . . . . . . . 64 6.9 Convenience Variables . . . . . . . . . . . . . . . . . . . . . . 65 6.10 Registers . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.11 Register Examples . . . . . . . . . . . . . . . . . . . . . . . 67 Examining Symbols [7] 69 7.1 Archive Symbol Visibility . . . . . . . . . . . . . . . . . . . . . 70 Altering Execution [8] 73 8.1 Assignment to Variables . . . . . . . . . . . . . . . . . . . . . . 73 8.1.1 Altering Variables Kept in Registers . . . . . . . . . . . . . . . . . 74 8.2 Changing the Full/Empty Bit . . . . . . . . . . . . . . . . . . . . . 75 4 S–2467–20Contents Page Stored Sequences of Commands [9] 79 9.1 User-defined Commands . . . . . . . . . . . . . . . . . . . . . . 79 9.2 Command Files . . . . . . . . . . . . . . . . . . . . . . . . 80 9.3 Commands for Controlled Output . . . . . . . . . . . . . . . . . . . 81 Options and Arguments for mdb [10] 83 10.1 Mode Options . . . . . . . . . . . . . . . . . . . . . . . . 83 10.2 File-specifying Options . . . . . . . . . . . . . . . . . . . . . . 84 10.3 Communication Options and Variables . . . . . . . . . . . . . . . . . . 84 10.4 Breakpoint-behavior Options . . . . . . . . . . . . . . . . . . . . 85 10.5 Miscellaneous Options . . . . . . . . . . . . . . . . . . . . . . 85 10.6 Other Arguments . . . . . . . . . . . . . . . . . . . . . . . 85 Appendix A GNU General Public License 87 A.1 Preamble . . . . . . . . . . . . . . . . . . . . . . . . . . 87 A.2 Terms and Conditions . . . . . . . . . . . . . . . . . . . . . . 88 A.3 How to Apply These Terms to Your New Programs . . . . . . . . . . . . . . 90 Appendix B Using mdb under GNU Emacs 93 Appendix C mdb Input and Output Conventions 95 Glossary 99 S–2467–20 5Overview [1] This guide is for application programmers who develop code that runs on Cray XMT systems. The mdb debugger is based on the Free Software Foundation's GDB debugger (version 3.5), as adapted for use on Cray XMT systems. The mdb debugger provides both source-level and machine-level debugging. While using the mdb debugger you can: • Launch your program and specify conditions that might affect the behavior of the program • Set breakpoints and watchpoints to make the program either suspend execution at a specified point in the code or upon meeting a specified condition • Examine program information after such a stop to determine exactly what is happening at this point in the program execution • Modify conditions and resume program execution as desired Additional information is in the mdb(1) man page and in the mdb online help system, which you can view by entering help at the mdb command line prompt. 1.1 Prerequisites Before using the mdb debugger, verify that the programming environment module is loaded and that your application was compiled for debugging, as described in Compiling for Debugging on page 8. 1.1.1 Loading the Module The mdb debugger is included as part of the Cray XMT programming environment, which is available by loading the mta-pe module. Programs intended for use on Cray XMT systems cannot be compiled or debugged in a cross-compiler environment. You must be logged into the Cray XMT system in order to use the programming environment. workstation% ssh -X XMT_system Password: XMT_system> module load mta-pe S–2467–20 7Cray XMT™ Debugger Reference Guide To see which modules are available on your system, type module avail. To see which modules are loaded in your user environment, type module list. 1.1.2 Compiling for Debugging When you debug a program, the level of detail mdb provides depends on the level of debugging information that was generated and stored in the program library when you compiled the program. This information includes the location and data type of each variable or function, as well as the relationship between lines in the source code and addresses in the executable code. There is an inverse relationship between debugging information and compiler optimization. The more debugging information that you request when compiling the code, the less optimization can be performed by the compiler. As a result, code compiled for debugging runs more slowly than code compiled for normal execution. Conversely, as the level of compiler optimization is increased, the speed of program execution increases, but the correlation between your source code and the resulting executable code decreases. In some cases, this means that mdb may not be able to perform those debugging operations that depend on a close correspondence between source and executable code. In these cases, the debugger typically generates a message warning that due to compiler optimizations the effect of the debugging operation cannot be guaranteed. However, there may be situations where the debugger may lack sufficient information to determine that this problem even exists. 8 S–2467–20Overview [1] Use the following compiler options to determine the level of debugging information generated and compiler optimizations performed when compiling your program. -g2 Most debugging information; least optimization. The compiler generates parallelism for future statements but does not automatically parallelize loops or other code. The debugger supports basic debugging operations at statement boundaries. At a breakpoint, you may read or modify any visible variable or memory state, and you may resume execution in ways that are fully consistent with your source code. -g1 | -g Moderate debugging information and optimization. The compiler automatically parallelizes code, with some restrictions, based in part on only volatile-qualified data being updated from outside the program. At a breakpoint, you may read any visible variable or memory state, and you modify any volatile-qualified data and resume execution in ways that are fully consistent with your source code. (omit -g option) Least debugging information; most optimization. If you omit the -g option, the compiler places no restrictions on optimizations and retains no debugging information. You can set breakpoints and pause and resume program execution, but the source and executable code may diverge substantially. You can read global data, but the debugger may not be able to find other variables. For example, to compile your program and generate the most debugging information and perform the fewest compiler optimizations, use the following command: XMT_system> cc -g2 myprogram.c Note: As part of the optimization process, the compiler may inline a routine by replacing a statement that calls a routine with the actual body of that routine, if the compiler determines that the resulting code will be more efficient. This optimization can occur at all debugging levels, and each inlined routine retains the debugging level that was specified when it was compiled. For routines inlined by the compiler, in most debugging operations, mdb presents the debugging information as if the routine was invoked normally, and not inlined. 1.1.3 Working Directories The mdb debugger inherits its working directory from the shell used to launch the debugging session. This working directory also serves as the default location for debugger commands that specify files. After you launch mdb, use the cd command to change the default directory. For more information, see Working Directory on page 16. S–2467–20 9Cray XMT™ Debugger Reference Guide 1.1.4 Environment Variables Environment variables are typically used to define such things as your user name, home directory, terminal type, search path, and other conditions that affect program operation. The mdb debugger inherits its environment variable settings from the shell used to launch the debugging session. After you launch mdb, use the info environment and set environment commands to view and change environment variable settings. For more information, see Environment Variables on page 17. 1.2 Getting Started To begin using mdb: 1. Log on to the Cray XMT system. workstation% ssh -X XMT_system 2. Load the Programming Environment module. XMT_system> module load mta-pe 3. If needed, change to your working directory. XMT_system> cd workdir 4. Enter the shell command mdb. XMT_system/workdir> mdb [-mtarun-args arguments] [program] Use the -mtarun-args option to pass runtime arguments through to mtarun. For more information, see Running the Program on page 16. If you specify a program name with the mdb command, the debugger reads the symbol table in the named file. If you do not enter a program name here, you can specify it later. For more information, see Selecting a Program to Debug on page 11. After the mdb copyright information displays, you will have an mdb prompt: (mdb) Note: The default debugger prompt is (mdb). If desired, you can change this to a user-defined prompt string. For more information, see Appendix C, mdb Input and Output Conventions on page 95. In addition to the -mtarun-args and program arguments, the mdb command supports many other options, including the use of a configuration file. For more information, see Chapter 9, Stored Sequences of Commands on page 79. 10 S–2467–20Overview [1] At this point, you are ready to begin debugging a program. Note: In addition to the mdb(1) man page, the mdb debugger includes an online help system which you can access by entering help at the (mdb) prompt. The help command recognizes many keywords; for example, to find file-related help content, enter the following command: (mdb) help files For more information, see the online help system. 1.2.1 Selecting a Program to Debug There are two ways to specify the program you want to debug. • Enter the name of the executable file when you start mdb. XMT_system/workdir> mdb a.out In this case, mdb reads the executable file and symbol table in as part of the startup process. If you have done this, skip to Running the Program on page 16. • Alternatively, you can use mdb commands to load or swap files after mdb is running. For example: (mdb) file a.out In this case, mdb takes less time to start up and you have more control over which files are loaded and when they are loaded. The following subsections describe the commands used to load or swap files after mdb is running. 1.2.1.1 File Commands Use the following commands to load files or change files during a debugging session. attach[pid] [device_filename] Attach to a process that was started up outside of mdb. This command may take as argument a process id or a device filename. For a process id, you must have permission to send the process a signal, and it must have the same effective uid as the debugger. For a device filename, the file must be a connection to a remote debug server. Before using attach you must use the exec-file command to specify the program running in the process, and the symbol-file command to load its symbol table. Alternatively, use the file command, which performs both functions. (mdb)file filename (mdb)attach filename or pid S–2467–20 11Cray XMT™ Debugger Reference Guide info files Print the path and name of the executable file currently loaded into mdb and the path and name of the file from which the symbol table was loaded. For example: (mbd) info files Executable file "/XMT_system/users/smith/a.out" Symbols from "XMT_system/users/smith/a.out" (mdb) file filename Load the specified executable file into mdb, along with the associated symbol table. If you do not specify a directory and the file is not in the current working directory, mdb uses the environment variable PATH as a list of directories to search for the file. If a file is already resident, you are asked to confirm that you want to load the new file. For example: (mbd) file b.out Load new executable from "b.out"? (y or n) y Reading executable from XMT_system/users/smith/b.out Reading symbols from XMT_system/users/smith/b.out done (mdb) exec-file filename Load only the executable file. Do not load the symbol table. symbol-file filename Load the symbol table from the specified file. If you do not specify a directory and the file is not in the current working directory, mdb uses the environment variable PATH as a list of directories to search for the file. The symbol-file command does not actually read the symbol table in full. Instead, it scans the symbol table quickly to determine which source files and symbol tables are present. The details are read later, one source file at a time, as needed. The purpose of this two-stage reading strategy is to make mdb start up faster. For the most part, it is invisible to the user, except for occasional messages indicating that the symbol table details for a particular source file are being read. Use the set verbose command to control whether these messages are printed; for more information, see Appendix C, mdb Input and Output Conventions on page 95. 12 S–2467–20Overview [1] symbol-file To clear the symbol table, enter symbol-file without specifying a filename. Note: While the file, exec-file, and symbol-file commands accept both absolute and relative file names as arguments, the file names are always stored as absolute file names. Using the symbol-file command causes mdb to purge the contents of its convenience variables, value history, and all breakpoints and auto-display expressions. This is done because these values may contain pointers to the internal data recording symbols and data types that are part of the old symbol table which is being discarded. 1.2.1.2 Module Commands The mdb debugger gets the information it needs to run your program not only from the executable file, but also from any module linked with your program. Note that in this context the term module does not refer to a software module, rather to a source file and all of the files it explicitly includes, as well as the corresponding objects that result from compilation of the source module (typically, a portion of a program library or a traditional object file). The information contained in a module includes type definitions, source line mappings, and the locations of local and static variables. The mdb debugger assimilates the information in each module of your program, and reads in a module when your program needs information from that module. S–2467–20 13Cray XMT™ Debugger Reference Guide If, during a debugging session, you reach a point where mdb has yet to incorporate the module information you want to use (for example, the full definition of a type structure) and you know where the information is located, use the following commands to force mdb to load the appropriate module. info modules Print the names of all modules in the program. If the module is compiled separately, it is listed separately. Constituents of larger files—for example, archives built with ar or program libraries produced by whole-program compilation—are listed as part of the parent library. load modulename Read information from modulename, including type definitions and source file information. mdb uses the relative path information specified in the root program library to locate modulename. If you have moved your program executable after building it, use the set linkdir command to provide the necessary path information. For more information, see Specifying Source Directories on page 50. 1.2.1.3 Object Directory Commands The path to each object file used by the linker is recorded in the root program library. If the path is absolute, it is used as-is. If the path is relative, mdb appends the path to the linkdir variable. (For more information about the linkdir variable, see Specifying Source Directories on page 50.) However, if you provide mdb with an object file search path, the debugger looks for an object file, first in the directories in the search path, and then in the path used by the linker. 14 S–2467–20Overview [1] The information from an object file that is already loaded into mdb is not affected by later modifications to the object path. When you start mdb, the object file search path is empty. Use the following commands to add or change object file search paths. info objdirectories info objdir info obj Print the current object search path as a list of directories. objdirectory dirname objdir dirname obj dirname Add directory dirname to the front of the object search path. You may specify several directory names with this command, by separating the directories with a colon (:) or a blank space. If you specify a directory that is already in the source path, it is moved forward and searched earlier. objdirectory To clear the search path, enter objdirectory without specifying a dirname. You are asked to confirm that you want to clear the object file search path. 1.2.1.4 Shared Library Directory Commands Each executable file contains a list of shared libraries, along with the set of paths to the shared libraries that was specified when the executable was linked. The mdb debugger reads in the symbols for the executable from each of the shared libraries, using these paths to find the libraries. However, the shared libraries or the executable may be moved between compilation and debugging. The mdb debugger uses the shared library path to provide a list of directories to search for shared library files. For each shared library, the debugger first tries, in order, the directories in the list, until it finds a file with the desired name. The set of shared library paths from the executable is permanently affixed to the end of the list. S–2467–20 15Cray XMT™ Debugger Reference Guide When you start mdb, the shared library search path is empty. Use the following commands to add or change shared library search paths. info sharedlibpath Print the shared library path, and show which directories it contains. sharedlibpath dirname Add directory dirname to the front of the shared library search path. You may specify several directory names with this command, by separating the directories with a colon (:) or a blank space. If you specify a directory that is already in the path, it is moved forward and searched earlier. If you know before the start of the debugging session that you need to use the sharedlibpath command, start the debugger without using a file name argument. After the debugger is initialized, use the sharedlibpath to specify a list of directories, and then use the file command to load the executable and read in the symbol table. sharedlibpath To clear the search path, enter sharedlibpath without specifying a dirname. You are asked to confirm that you want to clear the shared library search path. 1.2.2 Running the Program After your program is loaded, use the run command to execute it. (mdb) run The run command creates an inferior process, loads the program into the inferior process, and sets it in motion. The execution of your program is affected by certain information the inferior process receives from its superior. The mdb debugger provides ways to specify this information, which you must do before executing the program. (You can change runtime conditions after starting the program, but these changes do not take effect until the program is restarted.) The following subsections discuss the different runtime conditions. 1.2.2.1 Working Directory Each time you start your program with run, it inherits its working directory from the current working directory of mdb. The mdb debugger in turn inherits its working directory from its parent process, which is typically the shell. 16 S–2467–20Overview [1] The mdb working directory also serves as the default directory for the file-handling commands described in Selecting a Program to Debug on page 11. Use the following commands to view or reset the working directory. pwd Print the current working directory. cd directory Reset the working directory to directory. Note: The ls command cannot be used within an mdb debugging session. 1.2.2.2 Program I/O By default, a program run under mdb pipes I/O to the same terminal that is used by mdb. Use sh-style redirection commands in the run command to redirect input and output. For example, to start your program and redirect its output to the file outfile, enter this command. (mdb) run > outfile 1.2.2.3 Environment Variables Environment variables are used to specify your user name, home directory, search paths, and so on. The mdb debugger inherits its environment variables from the shell session used to start the debugging session. There is one environment variable that is specific to mdb. MDB_MTARUN_ARGS If this environment variable is set when mdb is invoked, the contents of MDB_MTARUN_ARGS are passed along as command-line arguments to mtarun, to be used when your program is executed. Note: If this environment variable is set and mdb is invoked using the -mtarun-args option, the arguments listed in the -mtarun-args option take precedence. S–2467–20 17Cray XMT™ Debugger Reference Guide Use the following commands to view and change the values of environment variables. info environment Print the names and values of all environment variables currently set. You can abbreviate this command to i env. info environment varname Print the value of the environment variable varname. You can abbreviate this command to i env varname. set environment varname [value] set environment varname=[value] Set the environment variable varname to value. The value is optional; if it is omitted, the variable is set to a null value. You can abbreviate this command to set e varname value. When set from within a debugging session, the environment variable value applies only to the program being debugged. When you exit from the debugging session, the environment variable is restored to its previous value or state. unset environment varname delete environment varname Remove the environment variable varname from the environment. This is different from using the set environment to set the variable is set to a null value, as it renders the variable undefined. You can abbreviate this command to d e varname. When unset from within a debugging session, the environment variable no longer applies to the program being debugged. However, when you exit from the debugging session, the environment variable is restored to the value or state it had before entering the debugging session. 18 S–2467–20Overview [1] 1.2.2.4 Runtime Arguments In normal operations, many programs require the use of runtime arguments appended to the mtarun command in order to run. There are several ways to pass these runtime arguments into a debugging session: • Use the mdb command -mtarun-args arguments option when starting the debugger to specify mtarun arguments. If you use this method, the arguments you specify remain in force unless superseded within the debugging session. • You can set the MDB_MTARUN_ARGS environment variable, either before or after starting the debugging session. If it is set before starting the debugging session and you invoke mdb using the -mtarun-args option, the -mtarun-args arguments supersede the environment variable values. If you set it after starting the debugging session, the environment variable values override the -mtarun-args (if used), but are unset when you exit the debugging session. • After starting the debugging session, use the set mtarun-args command from within mdb to specify mtarun arguments. The values you set remain in force until superseded or unset, or until the end of the debugging session. • You can use the Cray Extensions to the mdb command line arguments. For more information, see the mdb(1) man page. Additionally, your program may require runtime arguments specific to your program. These can be set from within the debugging session by using the set args arguments command prior to issuing the run command. The arguments you set this way remain in force until superseded or until the end of the debugging session; to unset these arguments, use the set args command with no arguments. 1.3 Debugging a Currently Running Job The most straightforward way to debug a running job is to issue the mdb with the following arguments: mdb file pid where file is the running program and pid is its process ID. In this case mdb will attach automatically to a process that was started up outside of mdb. S–2467–20 19Cray XMT™ Debugger Reference Guide Alternatively, if mdb is already running use the following sequence of commands to load the executable file and its associated symbol file, then attach to the running process: (mdb)exec-file filename (mdb)symbol-file filename (mdb)attach device_filename or pid The attach command takes as an argument either a process ID or a device filename. To use a process ID, you must have permission to send the process a signal, and it must have the same effective uid as the debugger. To use a device filename, the file must be a connection to a remote debug server. 1.4 Ending a Debugging Session To end a debugging session and exit mdb, enter either quit or q at the (mdb) prompt. You will exit to your current working directory. (mdb) quit XMT_system/workdir> The Ctrl-C command does not exit from the debugger, but rather terminates the action of any debugger command currently in progress and returns to the (mdb) prompt. It is generally safe to use Ctrl-C at any time, because the debugger attempts to synchronize the interrupt to a time when it is safe. However, there is a possibility that using Ctrl-C during expression evaluation may leave locks in a held state. To kill the currently running inferior process, use the mdb kill command. Be aware that, on large systems, the kill command may take some time to complete. The default time-out for the kill command is 100 seconds. Use the mdb set kill-timeout option to change this value. 20 S–2467–20Breakpoints and Watchpoints [2] The primary purpose of using a debugger is so that you can stop it before its planned point of normal termination, or if it fails to run that far, so that you can investigate its behavior and find out what went wrong. When mdb stops your program, all threads stop. The state of your entire program is suspended, and you can examine and modify the state, depending on the debugging level with which you compiled the source code. 2.1 Breakpoints and Watchpoints A breakpoint stops all the threads in your program whenever some thread reaches a certain point in the program. You set breakpoints explicitly with mdb commands, specifying the place where the program should stop by line number, function name, or exact address in the program. You can add other conditions to control whether the program stops. A watchpoint is a data breakpoint that stops all threads in your program when a watched expression changes. A watched expression stops the program when its value is written, though not necessarily changed, or the full/empty bit of any constituent memory word changes (see State Bits on page 55). The full/empty bit is the only state bit that can toggle as a result of a read; the others change only during writes. For example, an assignment of zero to a variable whose previous value is zero stops the program. This is similar to a read of a watched sync variable because the full/empty bit changes from full to empty. After suspending your program, mdb tells you the previous value of the changed watchpoint expression. You can see the new value by printing the expression. When mdb suspends your program due to a watchpoint, the current instruction of a thread that changed a watched expression may be some distance beyond the instruction that triggered the watchpoint. For example, a watchpoint expression may have been changed when control was in the previous stack frame. This long stopping distance is caused by a combination of instruction pipelining by the hardware, multiple operations per instruction, jump operations, and compiler optimizations. To reduce this effect, compile your program at a lower optimization level. Watchpoints in mdb are as efficient to use as breakpoints. The implementation of watchpoints is based on the hardware trap bits associated with each data memory word (see State Bits on page 55). S–2467–20 21Cray XMT™ Debugger Reference Guide Watchpoints and breakpoints are differentiated by the commands you use to create them (see Setting Breakpoints on page 23 and Setting Watchpoints on page 26). Most of the commands for enabling, disabling, and deleting breakpoints also apply to watchpoints (see Deleting Breakpoints and Watchpoints on page 26 and Disabling Breakpoints and Watchpoints on page 27). Each breakpoint and watchpoint is assigned a number when it is created; these numbers are successive integers starting with 1. In many of the commands for controlling various features of breakpoints and watchpoints, you use this number to say which point you want to change. Each breakpoint or watchpoint may be enabled or disabled; if a point is disabled, it has no effect on the program until you enable it again. The command info breakpoints or info watchpoints prints a list of each breakpoint and watchpoint that is set but not deleted: its number, type (breakpoint or watchpoint), disposition (whether the point is marked to be disabled or deleted when reached), whether or not the point is enabled, where in the program it is, and any special features in use for the point (conditions, command sets). Disabled points are included in the list, but marked as disabled (not enabled). You can abbreviate info breakpoints as info break or even i b. info break with an integer argument lists only the associated breakpoint or watchpoint. The following example shows a breakpoint on main, and a watchpoint on the variable foo. [1] (mdb) info breakpoints Num Type Disp En Address What 1 break keep y 0x524 in main (/home/users/xxx/main.c line 33) 2 watch keep y foo (Unlike GDB, info break or info watch in mdb does not set either the convenience variable $_ or the default examining-address for the x command.) When your program stops due to a breakpoint, mdb prints out the name of the function containing the breakpoint and the function argument values. You can cause mdb to omit the argument values by issuing the set print-function-args off command (see Format Options on page 57). When a set of threads hit multiple breakpoints or watchpoints simultaneously, mdb displays the names of the threads that hit them. If the thread that previously had the focus is among the stopped threads, it retains the focus. However, if this thread has expired or is not stopped because of a breakpoint, watchpoint, step, or fatal error, mdb arbitrarily chooses a stopped thread to be the focus. All breakpoint commands are executed at this time. 22 S–2467–20Breakpoints and Watchpoints [2] 2.1.1 Setting Breakpoints Breakpoints are set with the break command (abbreviated b). You have several ways to say where the breakpoint should go. Two ways require mdb to be focused: break without any argument and break with an offset argument. break function Set a breakpoint at entry to function function. (If your program is linked with an archive, the state of the mdb visibility variable may either affect your ability to access function or determine in which of several functions named function the breakpoint is set. See Archive Symbol Visibility on page 70 for details. break +offset, break -offset Within the current source file, set a breakpoint some number of lines forward or back from the position at which the focus thread stopped in the currently selected frame. The focus thread and selected frame determine the current source file: the current source file contains the line the focus thread in which stopped executing in the selected frame. (See Chapter 5, Examining Source Files on page 47.) break linenum Set a breakpoint at line linenum in the current source file. The breakpoint stops the program immediately before any thread executes any of the code on that line. You may not be able to set a breakpoint by line number within a file-static function compiled without debugging information. The compiler may inline these functions at every call and not maintain a stand-alone version. If this is the case, mdb has insufficient information to set a breakpoint by line number at an arbitrary, inlined instruction of the function. However, you can set a breakpoint using the function name—which sets a breakpoint at the first line of every inlined instance of the function. (mdb) break 10 No line number 10. (mdb) break foo Breakpoint 1 at (0:0x412) (main): foo.c line 10. (In the last line above, (main) is an artifact of the program not having a stand-alone implementation of foo.) break filename:linenum Set a breakpoint at line linenum in source file filename. S–2467–20 23Cray XMT™ Debugger Reference Guide break filename:functionname Set a breakpoint at entry to function functionname found in file filename. Specifying a file name as well as a function name is superfluous except when multiple files contain similarly named functions. break *address Set a breakpoint at address address. You can use this to set breakpoints in parts of the program that do not have debugging information or source files. If the instruction at address is inlined code, mdb does not set any additional breakpoints in corresponding locations in the stand-alone version or other inlined instances; here mdb does not maintain the illusion of normal function call. break Set a breakpoint at the next instruction to be executed by the focus thread in the selected stack frame (see Chapter 4, Examining the Stack on page 41). break ... if cond Set a breakpoint with condition cond; evaluate the expression cond each time the breakpoint is reached, and stop only if the value is nonzero. ... stands for one of the possible arguments described above (or no argument) specifying where to break. See Break Conditions on page 28, for more information on breakpoint conditions. tbreak args Set a breakpoint enabled only for one stop. args are the same as in the break command, and the breakpoint is set in the same way, but the breakpoint is automatically disabled the first time it is hit. See Disabling Breakpoints and Watchpoints on page 27. mdb allows you to set any number of breakpoints at the same place in the program. This is useful when the breakpoints are conditional Break Conditions on page 28. 2.1.1.1 Special Breakpoint Situations If you set a breakpoint by any means other than break *address, and the breakpoint is within code that has been inlined by the compiler, mdb maintains the illusion of normal function call. The breakpoint appears to be set within the body of the function—and thus is set in all other inlined copies of the function. If you set a breakpoint by its address, and the breakpoint is within inlined code, mdb creates a single breakpoint. 24 S–2467–20Breakpoints and Watchpoints [2] If you set a breakpoint on a line that contains a future statement, the break occurs at the first statement within the future body, rather than immediately before the future statement is executed. 8 foo(){ 9 j = 5; 10 future i () { 11 k += 3; The command break 10 places a breakpoint immediately before the statement in line 11, as opposed to immediately after the statement in line 9. Setting a breakpoint on a line with a future and another statement, however, can cause the break to take place outside the future body. mdb does not allow breakpoints on a small set of instructions (instructions that contain a MAC operation, or any of the following operations: DATA_OPA_SAVE, DATA_OPD_SAVE, DATA_OP_REDO, LEVEL_ENTER, LEVEL_RTN, RESULTCODE_SAVE, TRAP_RESTORE, or TRAP_SAVE). If you try to set a breakpoint on one of these instructions, mdb will ask you to choose a different instruction for the breakpoint. S–2467–20 25Cray XMT™ Debugger Reference Guide 2.1.2 Setting Watchpoints Use a watchpoint to stop your program immediately after the value of an expression is written, without having to identify the thread updating the expression or the instruction where the modification takes place. Watchpoints are set with the watch command (abbreviated w) and an expression Expressions on page 53. watch expression Set a watchpoint on expression. If expression is a data memory address, the address is preceded with an asterisk. [1] (mdb) watch *0x40102bc000 watch ... if cond Set a watchpoint with condition cond; evaluate the expression cond each time the watched expression changes, and stop only if the value is nonzero. ... stands for an new expression or the number of a previous watchpoint. To ensure that cond is in scope, cond cannot reference any stack variables. See Break Conditions on page 28, for more information on watchpoint conditions. twatch expression Set a watchpoint enabled only for one stop. The watchpoint is automatically disabled the first time it is hit. See Disabling Breakpoints and Watchpoints on page 27. info watchpoints This command prints a list of watchpoints and breakpoints. It is the same as info break. 2.1.3 Deleting Breakpoints and Watchpoints With the clear command you can delete breakpoints according to where they are in the program. With the delete command you can delete an individual breakpoint or watchpoint by specifying its number. 26 S–2467–20Breakpoints and Watchpoints [2] If your program stopped because one or more breakpoints were hit, it is not necessary to delete the breakpoints for the breaking threads to proceed past them. mdb automatically ignores all breakpoints in the first instruction to be executed by each breaking thread. clear Delete any breakpoints at the next instruction to be executed by the focus thread in the selected stack frame (see Selecting a Frame on page 43). When the innermost frame is selected, this is a good way to delete the breakpoint at which the focus thread is stopped. clear function, clear filename:function Delete any breakpoints set at entry to the function function. clear linenum, clear filename:linenum Delete any breakpoints set at or within the code of the specified line. delete bnum(s) Delete the breakpoint(s) or watchpoint(s) of the numbers specified as arguments. Note: If you delete or disable a watchpoint while it is being processed, your program may behave in incorrect or undefined ways. Before deleting or disabling watchpoints, check each thread to make sure none is in the vicinity of the use of a watched location. If you find one, try advancing it using step or next, or continue to hit the watchpoint. 2.1.4 Disabling Breakpoints and Watchpoints You disable and enable breakpoints and watchpoints with the enable and disable commands, specifying one or more numbers as arguments. Use info break or info watch to print a list of breakpoints and watchpoints if you do not know which numbers to use. A breakpoint or watchpoint can have any of four different states: • Enabled. The point stops the program. A breakpoint made with the break command or a watchpoint made with the watch command starts out in this state. • Disabled. The point has no effect on the program. • Enabled once. The point stops the program, but when it does so, it becomes disabled. A breakpoint made with the tbreak command or a watchpoint made with the twatch command starts out in this state. • Enabled for deletion. The point stops the program, but immediately after it does so, it is deleted permanently. S–2467–20 27Cray XMT™ Debugger Reference Guide You change the state of a breakpoint or watchpoint with the following commands: disable breakpoints bnum(s) disable bnum(s) Disable the specified breakpoint(s) or watchpoint(s). A disabled point has no effect but is not forgotten. All options such as ignore counts, conditions, and commands are remembered in case the breakpoint or watchpoint is enabled again later. enable breakpoints bnum(s)enable bnum(s) Enable the specified breakpoint(s) or watchpoint(s). enable breakpoints once bnum(s) enable once bnum(s) Enable the specified breakpoint(s) and watchpoint(s) temporarily. Each will be disabled again the next time it stops the program (unless you have used one of these commands to specify a different state before that time comes). enable breakpoints delete bnums enable delete bnums Enable the specified breakpoints and watchpoints to work once and then die. Each point is deleted the next time it stops the program (unless you have used one of these commands to specify a different state before that time comes). 2.1.5 Break Conditions You can also specify a condition for a breakpoint or a watchpoint. A condition is a boolean expression in your programming language. (See Expressions on page 53.) A breakpoint or watchpoint with a condition evaluates the expression each time a thread reaches the breakpoint or modifies the watched expression, and the program stops only if the condition is true. Because the expression cond must always be in scope for watchpoints, cond cannot reference any stack variables. Note: Avoid break conditions with side effects—for example, printing diagnostics, updating counters, or using generic functions on sync or future variables (see Changing the Full/Empty Bit on page 75). The behavior of break conditions with side effects is unpredictable: A break condition may be evaluated multiple times when the breakpoint or watchpoint is hit. The evaluation order of conditions for several breakpoints sharing the same address or several watchpoints associated with a common memory word is undetermined. 28 S–2467–20Breakpoints and Watchpoints [2] Use breakpoint or watchpoint commands as an alternate to break conditions with side effects. Command sets behave predictably and are usually more convenient and flexible for the purpose of performing side effects when a breakpoint or watchpoint is reached (see Commands Executed on Breaking on page 30). You can specify break conditions when a breakpoint or watchpoint is set by using if in the arguments to the break or watch command. See Setting Breakpoints on page 23 and Setting Watchpoints on page 26. They can also be changed at any time with the condition command. condition bnum expression Specify expression as the break condition for breakpoint or watchpoint number bnum. From now on, this point stops the program only if the value of expression is true (nonzero, in C). expression is not evaluated at the time the condition command is given. See Expressions on page 53 for more information. condition bnum Remove the condition from breakpoint or watchpoint number bnum. It becomes an ordinary unconditional breakpoint or watchpoint. A special case of a breakpoint or watchpoint condition is to stop only when the point has been reached a certain number of times. Every breakpoint and watchpoint has an ignore count, which is an integer. Most of the time, the ignore count is zero, and therefore has no effect. But if a thread reaches a breakpoint or watchpoint whose ignore count is positive, then instead of stopping the program, it decrements the ignore count by one and continues. As a result, if the ignore count value is n, the breakpoint or watchpoint does not stop the program the next n times it is reached. ignore bnum count Set the ignore count of breakpoint or watchpoint number bnum to count. The next count times the point is reached, it will not stop. To make the breakpoint or watchpoint stop the next time it is reached, specify a count of zero. cont count Continue execution of all threads in the program, setting the ignore count of the breakpoint or watchpoint that the program stopped at to count minus one. Thus, the program does not stop at this point until the time defined by count is reached. This command is allowed only when the program stopped due to a breakpoint or watchpoint. At other times, the argument to cont is ignored. See Continuing on page 31 for more information. Note: If a breakpoint or watchpoint has an ignore count of greater than 0 and a condition, the condition is not checked. S–2467–20 29Cray XMT™ Debugger Reference Guide 2.1.6 Commands Executed on Breaking You can give any breakpoint or watchpoint a series of commands to execute when the program stops due to that point. For example, you might want to print the values of certain expressions, or enable other breakpoints or watchpoints. commands bnum Specify commands for breakpoint or watchpoint number bnum. The commands themselves appear on the following lines. Type the end command to terminate the commands. To remove all commands from a breakpoint or watchpoint, use the command commands and follow it immediately by end; that is, give no commands. With no arguments, commands refers to the last breakpoint or watchpoint set. When your program is suspended due to a set of breakpoints or watchpoints being hit simultaneously, for each breaking thread mdb executes the command sequence of each point in the set—in some arbitrary order of the command sequences. mdb automatically resumes execution only if each breakpoint or watchpoint in the set has a command sequence and each command sequence includes the cont command. For each command sequence, mdb ignores all continuation commands other than cont, as well as all commands that follow any continuation command, including those after cont. For example, mdb ignores step or finish in a command sequence—and any commands that follow. If you use a stepping command like step or next to advance a thread to an instruction that has some number of breakpoints, each of which has a command sequence that includes a cont command, mdb executes each command sequence but does not does not resume execution. If the first command specified in a breakpoint or watchpoint command sequence is silent, the usual message about stopping at a breakpoint is not printed. This may be desirable for points that are to print a specific message and then continue. If the remaining commands too print nothing, you will see no sign that the breakpoint or watchpoint was reached at all. silent is not really a command; it is meaningful only at the beginning of the commands for a breakpoint or watchpoint. The commands echo and output that allow you to print precisely controlled output are often useful in silent breakpoints or watchpoints. See Commands for Controlled Output on page 81. 30 S–2467–20Breakpoints and Watchpoints [2] For example, here is how you could use breakpoint commands to print the value of x at entry to foo whenever it is positive. [1] (mdb) break foo if x>0 [1] (mdb) commands silent echo x is\040 output x echo \n cont end One application for breakpoint commands is to correct one bug so you can test another. Put a breakpoint immediately after the erroneous line of code, give it a condition to detect the case in which something erroneous has been done, and give it commands to assign correct values to any variables that need them. End with the cont command so that the program does not stop, and start with the silent command so that no output is produced. Here is an example: [1] (mdb) break 403 [1] (mdb) commands silent set x = y + 4 cont end A similar pseudo-command is once. When multiple threads hit a breakpoint at the same time, by default all of the commands for that breakpoint will be executed for each thread. If the first command specified is once, then the commands will be executed for only one of the threads. The silent and once commands may be used together; for example, if silent is the first command then once may be second. 2.2 Continuing After your program stops, you will mostly likely want it to run some more if the bug you are looking for has not yet occurred. cont Continue running the entire program from the current suspended state; all threads are active (not only the focus thread). cont/1 Continue running only the focus thread from the current suspended program state; all other threads remain suspended by mdb, regardless of individual thread state. S–2467–20 31Cray XMT™ Debugger Reference Guide If the program stopped because a thread hit a breakpoint, you might expect that continuing would stop immediately at the same breakpoint when it was hit again by the same thread. In fact, cont takes special care to prevent that from happening. You do not need to delete the breakpoint to proceed through it after stopping at it. You can, however, specify an ignore count for the breakpoint that the program stopped at, by means of an argument to the cont command. See Break Conditions on page 28. 2.3 Stepping Stepping means setting only the focus thread in motion for a limited time, so that control returns automatically to the debugger after the focus thread executes one line of code or one machine instruction. While the focus thread is stepping, all other threads remain suspended. During a step, the focus thread may toggle the full/empty state (see State Bits on page 55) of a sync or future variable on which threads are blocked waiting for the opposite full/empty state. If this happens, the toggling read or write triggers in the same step an action that satisfies one of the blocked threads request to read or write the variable (possibly changing the value of the variable or the full/empty state). This type of read or write by an initially blocked thread is the only non-focus thread activity that occurs during stepping; threads unblocked in this fashion remain suspended while the current focus thread is stepping, though in a state that differs as a result of the read or write assisted by the trap handler. mdb must be focused on a thread (see Focus Thread on page 37) to execute any of the following stepping commands. step Continue running the focus thread until it reaches a different line, then stop the focus thread and return control to the debugger. This command is abbreviated s. This command may be given when a thread is within a function for which there is no debugging information. In this case, execution proceeds until the thread reaches a different function, or is about to return from this function. An argument repeats this action. step count Continue running as in step, but do so count times. 32 S–2467–20Breakpoints and Watchpoints [2] next Similar to step, but any function calls appearing within the line of code are executed by the focus thread without stopping. Execution stops when the focus thread reaches a different line of code at the stack level that was executing when the next command was given. This command is abbreviated n. An argument is a repeat count, as in step. next within a function without debugging information acts as does step; any function calls appearing within the code of the function are executed by the focus thread without stopping. finish Continue running the focus thread until immediately after the selected stack frame returns (or until there is some other reason to stop, such as a fatal signal or a breakpoint). Print value returned by the selected stack frame (if any). until This command is used to avoid single-stepping a thread through a loop more than once. It is like the next command, except that when until encounters a jump, it automatically continues execution of the focus thread until the program counter is greater than the address of the jump. This means that when the focus thread reaches the end of a loop after single-stepping though it, until causes the program to continue execution until the loop is exited. In contrast, a next command at the end of a loop steps the focus thread back to the beginning of the loop, which forces the thread to step through the next iteration. until always stops the focus thread if it attempts to exit the current stack frame. Note: until may produce somewhat counter-intuitive results if the order of the source lines does not match the actual, optimized order of execution. For example, in a typical C for loop, the third expression in the for statement (the loop-step expression) is executed after the statements in the body of the loop, but is written before them. Therefore, the until command appears to step the focus thread back to the beginning of the loop when it advances to this expression. However, it has not really done so—not in terms of the actual machine code. until location Continue running the focus thread until either the specified location is reached, or the current (innermost) stack frame returns. This form of the command uses breakpoints, and hence is quicker than until without an argument. S–2467–20 33Cray XMT™ Debugger Reference Guide stepi, si Execute the focus thread for one machine instruction, then stop and return to the debugger. It is often useful to do display/i $pc when stepping by machine instructions. This causes the next instruction to be executed by the focus thread to be displayed automatically at each stop. See Automatic Display on page 62. An argument is a repeat count, as in step. nexti, ni Execute the focus thread for one machine instruction, but if the instruction is a subroutine call, proceed until the subroutine returns. nexti within a trap handler acts as does stepi, stopping at the next machine instruction that is outside the trap handler. An argument is a repeat count, as in next. A typical technique that uses stepping is to put a breakpoint (see Breakpoints and Watchpoints on page 21) at the beginning of the function or the section of the program in which a problem is believed to lie, and then step one or more threads through the suspect area, examining the variables that are interesting, until the problem happens. You can achieve the effect of some of the stepping commands within a trap handler by setting breakpoints at each line or instruction of the trap handler. Other debugging commands, such as examining and altering memory, work within trap handlers as they normally do. You can use the cont/1 command after stepping to resume execution of only the focus thread until the next breakpoint or signal. See Continuing on page 31. Your program is probably linked with some number of standard libraries such as libc, libm, and librt. By default, if you step the focus thread through code that contains a call into one of the functions in these libraries, mdb will step completely over the function. You can alter this behavior and have mdb step into the function by setting the mdb variable enter-stdlib to true. set enter-stdlib true Setting enter-stdlib to false restores the default behavior. 34 S–2467–20Understanding Multithreading [3] The Cray XMT compilers automatically identify sections of source code that can be partitioned into independent and parallel operations. When you execute this compiled code on a Cray XMT system, each program starts as a single thread. As the program executes, it spawns new threads to perform simultaneous and parallel operations, while existing threads may be waiting for a resource or the completion of a memory reference, and yet other threads are completing their tasks and disappearing. As a result, the set of threads executing in your program changes dynamically throughout the life of the program, in both number and nature. At any intermediate point in the execution of your program, mdb knows only about the threads that are currently part of the execution; that is, the threads that are running or waiting to run. When your program is suspended under mdb, you can ask mdb for information about the current set of threads and perform debugging operations on individual threads (see Focus Thread on page 37). Each thread originates either from a future statement in your source code or as part of an automatic compiler optimization. Each future statement explicitly determines a primary thread responsible for executing the body of the future statement, while compiler optimizations are performed automatically, when the compiler recognizes an opportunity to improve code performance by executing certain sections of code simultaneously. For example, the compiler may recognize a loop whose iterations are relatively independent. The compiler in this case partitions the entire execution of the loop (where each resulting component is usually one or more loop iterations) and directs the single thread that encounters the loop to split into a set of threads, or frays. Each thread of the fray independently executes some component of the overall execution of the loop, in parallel with the other fray members. If the fray size, which is determined at runtime, is less than the number of execution components, a fray thread may execute more than one component. 3.1 Thread Names The mdb debugger learns about the threads in your program progressively and assigns each new thread a unique integer identifier. The mdb debugger reassigns the identifiers, starting from 0, each time your program is run. Due to timing issues inherent in parallel programs, the mapping of identifiers to threads may be different from one program run or debugging session to the next. S–2467–20 35Cray XMT™ Debugger Reference Guide The runtime system also assigns a name to each thread in your program. They are of the form a.b or a.b.c, where a, b, and c are non-negative integers, and the names conform to the following rules. • 0.1 names the initial thread. • The form a.b designates a thread that is usually determined by a future statement in your source code. • The runtime name form a.b.c designates a thread that is generated strictly by the compiler. For example, a fray thread has a three-component name. The runtime system gives related names to threads in a group generated by the compiler to execute a particular section of code. A fray is a typical example of such a group. The names of all threads in a compiler-generated group differ only in the third component. You can change the form of thread name mdb uses with the set id-style command. set id-style system Use the runtime system names of the form a.b or a.b.c for thread names. set id-style mdb Use integers for thread names. This is the default. You may also start mdb using the command-line option -id-style with the same argument choice. On rare occasions, the thread name in the prompt may be a negative integer—indicating a runtime thread running on a dedicated stream. See Focus Thread on page 37 for more details. 3.2 Thread States The state of a thread persists during suspension of the program by mdb. For example, if a running thread t hits a breakpoint (see Breakpoints and Watchpoints on page 21) and mdb stops your program, when mdb resumes execution, t will be running. 36 S–2467–20Understanding Multithreading [3] A thread t is in one of the following states: running Thread t is executing. startable A thread in your program has executed the future statement that establishes thread t; t has not run; t will run when execution resources become available. blocked Thread t is waiting for a memory reference to complete and has released the execution resources it was using. spinning Thread t is waiting for a memory reference to complete; while it waits, it continues to execute but does not make progress. resumable The memory reference on which thread t blocked has completed; t will resume running when execution resources become available. aborted Thread t has experienced a fatal error and is not resumable. indeterminate Thread t is in transition between two of the previous states; mdb is unable to determine the recent or impending state of t. mdb retains no knowledge of threads that have completed and disappeared. 3.3 Focus Thread When you run your program under mdb, you or mdb may use the thread command to designate a single thread as the focus thread—the thread of particular interest. Many debugging operations refer implicitly to the focus thread. If mdb is focused on a single thread, the identifier of the focus thread appears on the same line as, and in front of, the prompt. [1] (mdb) You can change the format of the thread state description with the set state-length command. You can explicitly set the focus thread to be one of the threads through the thread command. The focus thread is the target thread of any mdb command that pertains to a single thread, such as step or backtrace. If mdb is not focused when you issue such a command, mdb returns an error message stating the need for a focus. S–2467–20 37Cray XMT™ Debugger Reference Guide Once your program has begun running, one or more threads execute your program. When mdb suspends execution, you can examine the threads currently comprising the execution of your program by issuing the info threads command. info threads Print a table of the current set of threads in your program sorted by thread state and breakpoint number. By default, a compressed list of thread ids are printed. info threads/l Print table in long format (all thread IDs). info threads/f Sort table by the name of the function where execution is currently stopped as well as thread state and breakpoint number. info threads/n Print only the total number of threads in each state or function. info threads/b number Print only threads stopped at breakpoint number. info threads/v Print verbose information about the current set of threads, including thread id, system name, and name of function where execution is currently stopped. By default, the number of threads displayed in verbose mode is limited to 500. Use set info-limit to change the limit. set state-length Print the current value of state-length. set state-length length Set the length of the state description mdb prints before the prompt for threads that are not in state running. The argument length may take on one of the following values. long Print the state as a long name. This is the default. short Print the state as the first character of the long name. none Print no state description. 38 S–2467–20Understanding Multithreading [3] You can explicitly set the focus thread to be one of the threads through the thread command. The focus thread is the target thread of any mdb command that pertains to a single thread, such as step or backtrace. If mdb is not focused when you issue such a command, mdb returns an error message stating the need for a focus. thread thread-name Set the target thread of subsequent mdb commands that pertain to a single thread to thread-name. In the following example, info threads lists the names of the threads in the program, as well as the state of each thread. Thread 1 is running, since no state is printed to the left of the thread identifier. [1] (mdb) info threads Thread State Brk 1 running 1 2 startable 3 startable . . . [1] (mdb) Below, the thread command changes the focus thread from 1 to 2 then back to 1. Line 149 is the current source position of startable thread 2. Line 30 is the current source position of thread 1. [1] (mdb) thread 2 149 future $left_done (data, left) { // fork left [2] (mdb) thread 1 30 for (int j = 1; j < size; j++) { [1] (mdb) For the thread command, the square brackets ([]) around the thread name are optional. Program execution may be interrupted if your program receives certain UNIX signals or if some set of parallel threads hits a breakpoint or watchpoint, raises a fatal exception, or completes a mdb stepping command. After execution is suspended, mdb determines if there is a thread of particular interest. If so, mdb focuses on that thread; otherwise mdb does not set a focus thread. For example, when you type Ctrl-C, mdb remembers the previous focus thread, which mdb retains as the new focus thread if the thread is still active. If the previous focus thread is blocked or otherwise unable to resume execution, however, mdb prints out a message to that effect and leaves the focus thread unset. S–2467–20 39Cray XMT™ Debugger Reference Guide On rare occasions, mdb may focus on a thread whose name is a negative integer. This is a runtime thread whose underlying stream is dedicated to the runtime, such as a daemon. mdb may focus on such a thread if a runtime daemon hits a user-set breakpoint or watchpoint (see Breakpoints and Watchpoints on page 21). These runtime threads executing on dedicated streams are never listed in the output of info threads. If you set the focus thread to be a thread that has just executed an instruction that raised an exception and caused the thread to trap, mdb sets the current source position to the line containing the next instruction to execute after the trap is handled, which may be up to eight dynamic instructions beyond the trapping instruction. In this case, any instructions between the trapping instruction and the current source position are executed before the trap is handled. Additionally, if the trapping instruction contains a function return, the current source line may even be in a source file different than the one where the trap occurred. 40 S–2467–20Examining the Stack [4] Each time a thread performs a function call, the information about where in the program the call was made from is saved in a block of data called a stack frame. The frame also contains the arguments of the call and the local variables of the function that was called. For each thread, all the stack frames are allocated in a region of memory called the call stack. mdb recognizes when the compiler optimizes your code by inlining a function call (substituting the function body for the call statement). For inlined functions, mdb maintains the illusion of a normal function call. When your program stops, the mdb commands for examining the stack of the focus thread allow you to see all of the information saved in the stack frame. When the program stops, mdb automatically selects the currently executing frame of the focus thread and describes the frame briefly as the frame command does (see Information on a Frame on page 45). Whenever you ask mdb for the value of a variable in the program, the value is found in the selected frame. There are special mdb commands to select whichever frame of the focus thread you are interested in. 4.1 Stack Frames The call stack of a thread is divided up into contiguous pieces called stack frames, or frames for short; each frame is the data associated with one call to one function. The frame contains the arguments given to the function, variables local to that function, and the address at which the function is executing. When a thread starts, its stack has only one frame. This is called the initial frame or the outermost frame. Each time a function is called by the thread, a new frame is made. Each time the thread returns from a function, the frame for that function invocation is eliminated. If a function is recursive, there can be many frames for the same function. The frame for the function in which the thread is actually executing is called the innermost frame. This is the most recently created of all the thread stack frames that still exist. S–2467–20 41Cray XMT™ Debugger Reference Guide Inside your program, stack frames are identified by their addresses. A stack frame consists of many words, each of which has its own address. The address of the first word of the frame serves as the address of the frame itself. For each thread, this address is kept in a register called the stack pointer register while the thread is executing in that frame. For each thread, mdb assigns numbers to all existing stack frames, starting with zero for the innermost frame, one for the frame that called it, and so on upward. These numbers do not really exist in your program, but give you a way of talking about stack frames in mdb commands. Many mdb commands refer implicitly to one stack frame of one parallel thread. The implied thread can be selected by you or by mdb (see Focus Thread on page 37). Once focused, mdb records a stack frame that is called the selected stack frame; you can select any frame of the focus thread by using one set of mdb commands, and then other commands will operate on that frame. When your program stops, if mdb focuses on a thread, mdb automatically selects the innermost frame for the focus thread. 4.2 Backtraces A backtrace is a summary of how a thread got where it is. It shows one line per frame, for many frames, starting with the currently executing frame (frame zero), followed by its caller (frame one), and on up the stack of the thread. Backstops are initial (outermost) frames on the stack: main for the first thread in your program, the initial frame for any subsequent thread. backtrace, bt Print a backtrace of the entire stack of the focus thread: one line per frame for all frames in the stack. You can stop the backtrace at any time by typing Ctrl-C. backtrace n, bt n Similar, but print only the innermost n frames. backtrace -n, bt -n Similar, but print only the outermost n frames. backtrace/all, bt/all Backtrace all current threads. This is equivalent to issuing backtrace for each thread in your program. The names where and info stack are additional aliases for backtrace—and require that mdb be focused. 42 S–2467–20Examining the Stack [4] Every line in the backtrace shows the frame number, the function name, and the program counter value. If the function is in a source file whose symbol table data has been fully read, the backtrace shows the source file name and line number. The program counter value is omitted if it is at the beginning of the code for that line number. If the symbol data in the source file has only been scanned and not fully read, this extra information is replaced with an ellipsis. You can force the symbol data for that frame's source file to be read by selecting the frame. (See Selecting a Frame on page 43). Here is an example of a backtrace. [1] (mdb) backtrace #0 foobar2()(hello.c line 11) #1 foobar1()(hello.c line 18) #2 foo()(hello.c line 28) #3 main(hello.c line 60) 4.3 Selecting a Frame Most commands for examining the stack and other data in the program work on whichever stack frame is selected at the moment. Here are the commands for selecting a stack frame, usually of the focus thread. frame n Select frame number n of the focus thread. Recall that frame zero is the innermost (currently executing) frame. Frame one is the frame that called the innermost one, and so on. The highest-numbered frame is the initial frame of the focus thread. frame addr Select the frame at address addr. This is useful mainly if the chaining of stack frames has been damaged by a bug, making it impossible for mdb to assign numbers properly to all frames. In addition, this can be useful when a thread has multiple stacks and switches between them. up n Select frame n frames up from the frame previously selected. For positive numbers n, this advances toward the outermost frame, to higher frame numbers, to frames that have existed longer. n defaults to one. down n Select the frame n frames down from the frame previously selected. For positive numbers n, this advances toward the innermost frame, to lower frame numbers, to frames that were created more recently. n defaults to one. S–2467–20 43Cray XMT™ Debugger Reference Guide upto regexp Select the first frame in the calling stack whose function name matches regexp. For instance, if a backtrace shows functions sprint, print1, print and main, and the current frame is at sprint, the command upto print would select the frame at print1. upto print$ would go to the frame at print. upto also functions as a boolean expression and can be used as the condition for the if or while commands. When used in this manner, it must be the only expression within the condition. Also when used as a condition, no frame information is printed; use the frame command with no argument within the body of the if or while to print out the frame information, if necessary. downto regexp Select the last frame in the calling stack whose function name matches regexp. For instance, if a backtrace shows functions sprint, print1, print and main, and the current frame is at print, the command downto print would select the frame at print1. downto print$ would go to the frame at sprint. downto also functions as a boolean expression and can be used as the condition for the if or while commands. When used in this manner, it must be the only expression within the condition. Also when used as a condition, no frame information is printed; use the frame command with no argument within the body of the if or while to print out the frame information, if necessary. All of these commands (except upto and downto when used as a condition for if or while) end by printing some information on the frame that has been selected: the frame number, the function name, the arguments, the source file and line number of execution in that frame, and the text of that source line. For example: #3 main (argc=3, argv=??, env=??) at main.c, line 67 67 read_input_file (argv[i]); After such a printout, the list command with no arguments will print ten lines centered on the point of execution in the frame. Printing Source Lines on page 47. 44 S–2467–20Examining the Stack [4] 4.4 Information on a Frame There are several other commands to print information about a stack frame, usually the selected frame of the focus thread. frame Print a brief description of the selected stack frame. You can abbreviate it f. With an argument, this command is used to select a stack frame; with no argument, it does not change which frame is selected, but still prints the same information. info frame Print the stack level, the address of the frame, and the program counter of the selected stack frame. This description is useful when something has gone wrong that has made the stack format fail to fit the usual conventions. info frame addr Print the address of the selected frame along with its program counter, function name, and source location (if known). info args Print the arguments of the selected frame, each on a separate line. Also, see the set print-function-args command in Format Options on page 57. info locals Print the local variables of the selected frame, each on a separate line. Every variable declared static or automatic in the current scope is printed. S–2467–20 45Cray XMT™ Debugger Reference Guide 46 S–2467–20Examining Source Files [5] mdb knows which source files your program was compiled from, and can print parts of their text. When your program stops, if mdb automatically determines the current focus thread, then mdb spontaneously prints the line the focus thread stopped in. Likewise, when you select a stack frame (see Selecting a Frame on page 43), mdb prints the current source line in which the focus thread stopped executing in that frame. The current source file contains the line in which the focus thread stopped executing in the selected stack frame. If the program has not yet been run, the current source file is that of main for C/C++ programs. mdb only knows about source files encountered during the course of running your program. If you wish to access source information yet to be seen by mdb, use the load command with the pertinent module name as an argument. (See Module Commands on page 13.) You can also print parts of source files by explicit command. 5.1 Printing Source Lines To print lines from a source file, use the list command (abbreviated l). There are several ways to specify what part of the file you want to print. Here are the forms of the list command most commonly used: list linenum Print ten lines centered around line number linenum in the current source file. list function Print ten lines centered around the beginning of function function. list Print ten more lines. If the last lines printed were printed with a list command, this prints ten lines following the last lines printed; however, if the last line printed was a solitary line printed as part of displaying a stack frame (see Chapter 4, Examining the Stack on page 41), this prints ten lines centered around that line. list - Print the ten lines immediately before the lines last printed. S–2467–20 47Cray XMT™ Debugger Reference Guide Repeating a list command with RET discards the argument, so it is equivalent to typing only list. This is more useful than listing the same lines again. An exception is made for an argument of -; that argument is preserved in repetition so that each repetition moves up in the file. In general, the list command takes zero, one, or two linespecs as arguments. A linespec is a way in which a particular line in the source file can be specified; there are several ways of writing them but the effect is always to specify some source line. Here is a complete description of the possible arguments for list: list linespec Print ten lines centered around the line specified by linespec. list first,last Print lines from first to last. Both arguments are linespecs. list, last Print ten lines ending with last. list first, Print ten lines starting with first. list + Print the ten lines immediately after the lines last printed. list - Print the ten lines immediately before the lines last printed. list As described in the preceding table. Here are the ways of specifying a single source line--all the kinds of linespec. linenum Specifies line linenum of the current source file. When a list command has two linespecs, this refers to the same source file as the first linespec. +offset Specifies the line offset lines after the last line printed. When used as the second linespec in a list command that has two, this specifies the line offset lines down from the first linespec. -offset Specifies the line offset lines before the last line printed. filename:linenum Specifies line linenum in the source file filename. function Specifies the line that begins the body of the function function. 48 S–2467–20Examining Source Files [5] filename:functionname Specifies the line that begins the body of the function functionname in the file filename. The file name is needed with a function name only for disambiguation of identically named functions in different source files. *address Specifies the line containing the program address address. address may be any expression. Two commands relate source lines and program addresses. info line linenum Print the starting and ending addresses of the compiled code for source line linenum. mdb reports an address range for each inlined instance of the source line linenum. Unlike GDB, info line in mdb does not set either the default examine address for the x command or the convenience variable $_. info pc address Print the source lines from which the operations in the instruction at address are derived. Because of compiler optimizations, mdb may not be able to identify the source lines for the single given instruction. When this happens, mdb prints the source lines for a small range of instructions that includes the instruction at address. (mdb) info pc 0x401 Source lines for pc range: 0x401..0x403 main.c:11 (foo()) 11 { 12 for (int i=0; i<20; i++) { The default address argument for info pc is the instruction at which the focus thread is stopped. 5.2 Searching Source Files There are two commands for searching through the current source file for a regular expression. The command forward-search regexp checks each line, starting with the one following the last line listed, for a match for regexp. It lists the line that is found. You can abbreviate the command name as for. The command reverse-search regexp checks each line, starting with the one before the last line listed and going backward, for a match for regexp. It lists the line that is found. You can abbreviate this command with as little as rev. S–2467–20 49Cray XMT™ Debugger Reference Guide 5.3 Specifying Source Directories The path to the source file passed to the compiler or calculated by the front end (for include files) is recorded in the corresponding program library. If the executable moves or if any directories move between the compilation and your debugging session, you must tell mdb where to find the source files for your program. mdb has a list of directories to search for source files; this is called the source path. Each time mdb wants a source file, it tries in order the directories in the source path, until it finds a file with the desired name. Note that the executable search path is not used for this purpose. The current working directory is always the last item in the source path, and is displayed as $cwd. If mdb cannot find a source file in the source path, and the program library records a directory, mdb tries that directory too. If the source path is empty, and there is no record of the compilation directory, mdb looks in the current directory as a last resort. Whenever you reset or rearrange the source path, mdb clears out any information it has cached about where source files are found and where each line is in the file. When you start mdb, its source path is empty. To add other directories, use the directory command. directory dirname, dir dirname Add directory dirname to the front of the source path. Several directory names may be given to this command, separated by : or white space. You may specify a directory that is already in the source path; this moves it forward, so mdb searches it sooner. You can use the string $cwd to refer to the current working directory. $cwd is not the same as .—the former tracks the current working directory as it changes during your mdb session, while the latter is immediately expanded to the current directory at the time you add an entry to the source path. directory Reset the source path to $cwd again. This requires confirmation. info directories Print the source path: show which directories it contains. set linkdir dir If you moved your executable after it was linked, tell mdb that your executable was linked from directory dir. This enables mdb to find the modules for your program based on the information in your root program library. Note that the root program library should be in its original directory. 50 S–2467–20Examining Source Files [5] If your source path is cluttered with directories that are no longer of interest, mdb may sometimes cause confusion by finding the wrong versions of source. You can correct the situation as follows: 1. Use directory with no argument to reset the source path to $cwd. 2. Use directory with suitable arguments to reinstall the directories you want in the source path. You can add all the directories in one command. 5.4 Examining Instructions Sometimes it is useful to examine the low-level machine instructions generated by the compiler. The specialized command disassemble dumps a range of memory as machine instructions. disassemble Disassemble the function surrounding the program counter of the selected frame of the focus thread. disassemble function Disassemble the specified function. disassemble pc Disassemble the function surrounding the specified program counter. disassemble start_pc end_pc Disassemble the range of memory locations between start_pc and end_pc. When a program gets a data exception such as a data protection violation or data alignment error, the info opa command can be used to try to determine the offending machine instruction. The info opa command prints out the list of instructions that may be responsible for the trap. The info opa command takes as an argument the value of the opa register, the contents of the t1 register, and the program counter where the data exception occurred. All three are printed out when an exception is encountered. If invoked without arguments, info opa uses the current values of the opa and t1 registers and the program counter. S–2467–20 51Cray XMT™ Debugger Reference Guide 52 S–2467–20Examining Data [6] The usual way to examine data in your program is with the print command (abbreviated p). It evaluates and prints the value of any valid expression of the language the program is written in. Enter: [1] (mdb) print exp where exp is any valid expression, and the value of exp is printed in a format appropriate to its data type. You may need to provide mdb with type information if your program has yet to encounter the type name or type definition you wish to use in exp. Use the load command to inform mdb about type and other information contained in a module yet to be assimilated by mdb (see Module Commands on page 13). If you use a function or variable name from a linked archive in an expression as part of a mdb command, the state of the mdb visibility variable determines whether you can access the symbol, as well as which one of several entities with the same name is being used. See Archive Symbol Visibility on page 70 for details. A more low-level way of examining data is with the x command. It examines data in memory at a specified address and prints it in a specified format. 6.1 Expressions Many different mdb commands accept an expression and compute its value. Any kind of constant, variable, or operator defined by the programming language you are using is legal in an expression in mdb. This includes conditional expressions, function calls, casts, and string constants. It unfortunately does not include symbols defined by preprocessor #define commands. For parsing expressions and formatting printed data, mdb uses by default either the language of the current module in your executable program or the most recently known language, if the language information for the module cannot be found. This default language mode is called auto. If you want mdb to use a specific language regardless of the current module, use the set language command with either C or C++. The command set language auto returns the mdblanguage mode to the default. For the purposes of parsing expressions and formatting data, mdb considers C and C++ to be the same language. The info language command returns the current mdb expression language. S–2467–20 53Cray XMT™ Debugger Reference Guide If evaluating an expression involves calling a function in your program, any side effects of the call are realized. In particular, any data references as a result of the call change state bits as if the references were executed by a thread in your program. If you type Ctrl-C while mdb is evaluating an expression, mdb tries to interrupt the evaluation at a point where no locks are held. It may fail however, and locks may be left in an abnormal state on return from the interrupt. Typically, interrupting the printing of a large array or structure can be done safely. mdb does not currently support calling a function defined in your program that contains a future statement. If you call such a function from mdb, mdb may hang. Casts are supported in C and C++. It is often useful to cast a number into a pointer so as to examine a structure at that address in memory. mdb supports three kinds of operators, in addition to those of programming languages: @ @ is a binary operator for treating parts of memory as arrays. See Artificial Arrays on page 56 for more information. :: :: allows you to specify a variable in terms of the file or function it is defined in. This use is in addition to its use when specifying class or namespace membership in C++. See Program Variables on page 54. {typename} addr Refers to an object of type typename stored at address addr in memory. addr may be any expression whose value is an integer or pointer (but parentheses are required around nonunary operators, as with a cast). This construct is allowed regardless of what kind of data is officially supposed to reside at addr. 6.2 Program Variables The most common kind of expression to use is the name of a variable in your program. 54 S–2467–20Examining Data [6] Variables in expressions are understood in the selected stack frame. See Selecting a Frame on page 43 and Focus Thread on page 37.) Variables must either be global (or static) or be visible according to the scope rules of the programming language from the point of execution in that frame. This means that in the function: foo (int 2); { bar (a); { int b = test (); bar (b); } } the variable a is visible whenever the focus thread is executing within the function foo, but the variable b is visible only while the focus thread is executing inside the block in which b is declared. 6.3 State Bits Every physical data memory cell contains a 64-bit (word) value and has associated with it four access state bits: trap 0, trap 1, forward, and full/empty. Use the x command to view the value of these memory state bits for a particular word. See Examining Memory on page 60. When mdb prints the value of a variable in your program, mdb may also print the state bits associated with the variable. Variables whose types occupy less than a word may be packed several to a memory word. Each packed variable shares its memory word state bits with other variables packed into the same word. Variables whose type occupies one, two, or four words have a corresponding number of sets of state bits. The examples and descriptions assume that each variable occupies no more than a word and has a single set of state bits unless stated otherwise. If the variable is normal—that is, if it is not qualified as being sync or future, mdb ignores the full/empty bit. For each of the trap 0 and trap 1 bits, mdb prints the names of the trap bits that are on, for each word the variable occupies. When printing the value of a sync or future variable, mdb always lists the state of the full/empty bit (full or empty), as well as any of the trap 0 and trap 1 bits that are on, for each word the variable occupies. For variables whose value is determined by following an address chain defined by one or more set forward bits, mdb prints the value at the end of the chain. When printing a forwarded variable, mdb gives no indication of the set forward bits. Use the x command on the address of the word where the variable is stored to see the state of forward bits (see Examining Memory on page 60). When mdb prints a variable, mdb leaves the state bits unchanged. In particular, mdb does not change the full/empty bit from full to empty when printing a sync or future variable. Rather than "consuming" the value, mdb looks at it. S–2467–20 55Cray XMT™ Debugger Reference Guide Suppose a$ and b$ are sync variables. [1] (mdb) print a$ $1 = 5 (full) [1] (mdb) print a$ $2 = 5 (full) [1] (mdb) print b$ $3 = 2 (empty) [1] (mdb) print b$ $4 = 2 (empty) You may change the full/empty bit of a sync or future variable, thereby perhaps unblocking any threads that happen to be blocked on that variable, using one of the Cray XMT generic functions to simulate an "active" read or write of a sync or future variable. (See Changing the Full/Empty Bit on page 75.) Similarly, when mdb assigns a value to variables, no state bits are changed. [1] (mdb) print b$ $3 = (empty) 2 [1] (mdb) set b$ = 10 [1] (mdb) print b$ $5 = (empty) 10 If any of the forward bits or the trap bits of a variable are set, the actual value of the variable may be only indirectly accessible from its nominal address (see Examining Memory on page 60). The ability of mdb to print the correct value of the variable and state is not affected. When mdb calls a function in your program as part of evaluating an expression, any resulting data references that normally change state bits do indeed change state bits. 6.4 Artificial Arrays It is often useful to print out several successive objects of the same type in memory; a section of an array, or an array of dynamically determined size for which only a pointer exists in the program. You can do this by constructing an artificial array with the binary operator @. The left operand of @ should be the first element of the desired array, as an individual object. The right operand should be the length of the array. The result is an array value whose elements are all of the type of the left argument. The first element is actually the left argument; the second element comes from bytes of memory immediately following those that hold the first element, and so on. Here is an example. If a program contains: int *array = (int *) malloc (len * sizeof (int)); you can print the contents of array with: [1] (mdb) p *array@len 56 S–2467–20Examining Data [6] The left operand of @ must reside in memory. Array values made with @ in this way behave as other arrays in terms of subscripting—they are coerced to pointers when used in expressions. (It would probably appear in an expression using the value history, after you had printed it out.) 6.5 Format Options mdb provides a few ways to control how arrays and structures are printed. info format Display the current settings for the format options. set prettyprint on Cause mdb to print structures in an indented format with one member per line, like this: $1 = { next = 0x0, flags = { sweet = 1, sour = 1 }, meat = 0x54 "Pork" } set prettyprint off Cause mdb to print structures in a compact format, like this: $1 = {next = 0x0, flags = {sweet = 1, sour = 1}, meat = 0x54 "Pork"} This is the default format. set unionprint on Tell mdb to print unions that are contained in structures. This is the default setting. S–2467–20 57Cray XMT™ Debugger Reference Guide set unionprint off Tell mdb not to print unions that are contained in structures. For example, given the declarations: typedef enum {Tree, Bug} Species; typedef enum {Big_tree, Acorn, Seedling} Tree_forms; typedef enum {Caterpillar, Cocoon, Butterfly} Bug_forms; struct thing { Species it; union { Tree_forms tree; Bug_forms bug; } form; }; struct thing foo = {Tree, {Acorn}}; with set unionprint on in effect p foo prints: $1 = {it = Tree, form = {tree = Acorn, bug = Cocoon}} and with set unionprint off in effect it prints: $1 = {it = Tree, form = {...}} set stringprint on Tell mdb to automatically print the value of character strings. This is the default setting. set stringprint off Tell mdb not to print the value of character strings. C arrays of characters not on the heap are unaffected. set print-function-args off Turn off printing of function argument values when displaying function information. By default, mdb prints function argument values. You can start a mdb session with argument printing turned off by invoking mdb with the command-line option -no-function-args. set print-function-args on Turn on printing of function argument values when displaying function information. This is the default. 58 S–2467–20Examining Data [6] set array-max number-of-elements If mdb is printing a large array, it stops printing after it has printed the number of elements set by the set array-max command. This limit also applies to the display of strings. The default number of array elements printed is 200. 6.6 Output Formats mdb normally prints all values according to their data types. Sometimes this is not what you want. For example, you might want to print a number in hex, or a pointer in decimal. Or you might want to view data in memory at a certain address as a character string or an instruction. You can do these things with output formats. The simplest use of output formats is to say how to print a value already computed. This is done by starting the arguments of the print command with a slash and a format letter. The format letters supported are: x Regard the bits of the value as an integer, and print the integer in hexadecimal. d Print as integer in signed decimal. u Print as integer in unsigned decimal. o Print as integer in octal. a Print as an absolute address in hex. c Regard as an integer and print it as a character constant. f Regard the bits of the value as a floating-point number and print using typical floating-point syntax. For example, to print the program counter of the focus thread in hex (see Registers on page 66), type: [1] (mdb) p/x $pc Note that no space is required before the slash; this is because command names in mdb cannot contain a slash. To reprint the last value in the value history with a different format, you can use the print command with only a format and no expression. For example, p/x reprints the last value in hex. See Expressions on page 53 for details of the set language command, which directs mdb to format printed data in a specific programming language. S–2467–20 59Cray XMT™ Debugger Reference Guide 6.6.1 Examining Memory You can use the command x to examine data memory without reference to the data types within the program. The format in which you wish to examine memory is instead explicitly specified. The allowable formats are a superset of the formats described in the previous section. You cannot specify a qualified format for the x command. In particular, the x command examines data memory without regard to whether the program considers the data to be qualified as sync or future, or to be unqualified. x always prints the data value of the actual memory word, as well as the value of the full/empty bit associated with the examined word (even if you are looking at only a part of the word), and any of the trap 0, trap 1, and forward bits that are on. The x command prints the value of the state bits on the examined memory if the state bits are not in what is considered the default state. This description comes after the value. With the /v option, for verbose, the state bits are printed without regard for their values. The default states print as ~trap0, ~trap1, empty, ~fwd, while the non-defaults are trap0, trap1, full and fwd. If the trap 0 bit is set for a word w, then the value printed for w is not the value of the variable your program associates with the variable stored in w. Instead, w holds an identifier used by the runtime to locate the actual value of the variable associated with w. To see the value of the variable, use the mdb print command (see Chapter 6, Examining Data on page 53) in conjunction with info address symbol (see Chapter 7, Examining Symbols on page 69). If the trap 1 bit is set for a word, the value of that word may not be the value of the variable your program associates with the word. As in the case for a set trap 0 bit, the word may instead contain an identifier. If the forward bit is on for the word w, you can examine the forwarded value of w by examining the memory location stored in w. x is followed by a slash and an output format specification, followed by an expression for an address. The expression need not have a pointer value (though it may). It is used as an integer, as the address of a byte of memory. See Expressions on page 53 for more information on expressions. For example, x/4xw $sp prints the four words of memory above the stack pointer in hexadecimal. The output format in this case specifies how big a unit of memory to examine and how to print the contents of that unit. It is done with one or two of the following letters. 60 S–2467–20Examining Data [6] These letters specify the size of unit to examine: b Examine individual bytes. h Examine halfwords (4 bytes each). w Examine words (8 bytes each). g Examine giant words (16 bytes each). These letters specify the way to print the contents: x Print as integers in unsigned hexadecimal. d Print as integers in signed decimal. u Print as integers in unsigned decimal. o Print as integers in unsigned octal. a Print as an absolute address in hex. c Print as character constants. f Print as floating point. This works only with sizes w and g. s Print a null-terminated string of characters. The specified unit size is ignored; instead, the unit is however many bytes it takes to reach a null character (including the null character). i Print a machine instruction in assembler syntax (or nearly). The specified unit size is ignored; the number of bytes in an instruction varies depending on the type of machine, the opcode and the addressing modes used. If either the manner of printing or the size of unit fails to be specified, the default is to use the same one that was used last. If you do not want to use any letters after the slash, you can omit the slash as well. You can also omit the address to examine. Then the address used is immediately after the last unit examined. This is why string and instruction formats actually compute a unit-size based on the data: so that the next string or instruction examined will start in the right place. The print command sometimes sets the default address for the x command; when the value printed resides in memory, the default is set to examine the same location. When you use RET to repeat an x command, it does not repeat exactly the same: the address specified previously (if any) is ignored, so that the repeated command examines the successive locations in memory rather than the same ones. S–2467–20 61Cray XMT™ Debugger Reference Guide You can examine several consecutive units of memory with one command by writing a repeat count after the slash (before the format letters, if any). The repeat count must be a decimal integer. It has the same effect as repeating the x command that many times except that the output may be more compact with several units per line. For example, [1] (mdb) x/10i $pc The previous command prints ten instructions, starting with the one to be executed next, by the focus thread in the selected frame. After doing this, you could print another ten instructions using the following command: [1] (mdb) x/10 in which the format and address are allowed to default. The addresses and contents printed by the x command are not put in the value history because there are often too many of them. Instead, mdb makes these values available for subsequent use in expressions as values of the convenience variables $_ and $__. After an x command, the last address examined is available for use in expressions in the convenience variable $_. The contents of that address, as examined, are available in the convenience variable $__ . If the x command has a repeat count, the address and contents saved are from the last memory unit printed; this is not the same as the last address printed if several units were printed on the last line of output. 6.7 Automatic Display To print the value of an expression frequently (to see how it changes), you can add the expression to the automatic display list, a list of expressions that are displayed each time the program stops. Each element in the list is numbered; to remove an expression from the list, you specify that number. The automatic display looks like this: 2: foo = 38 3: bar[5] = (struct hack *) 0x3804 showing item numbers, expressions and their current values. If the expression refers to local variables, then it does not make sense outside the lexical context for which it was set up. Such an expression is printed only when execution is inside that lexical context. For example, if you give the command display name while inside a function with an argument name, this argument is displayed whenever the program stops inside that function, but not when it stops elsewhere (because this argument does not exist elsewhere). 62 S–2467–20Examining Data [6] display exp Adds the expression exp to the list of expressions to display each time the program stops. See Expressions on page 53. display/ fmt exp Specifies a display format and not a size or count for fmt, adds the expression exp to the auto-display list, and arranges to display exp each time in the specified format fmt. display/ fmt addr Adds the expression addr as a memory address to be examined each time the program stops for fmt i or s, including a unit-size or a number of units. Examining means in effect doing x/fmt addr. See Examining Memory on page 60. display/i $pc Displays the next instruction to be executed by the focus thread. undisplay dnum(s), delete display dnum(s) Removes the item number(s) dnums from the list of expressions to display. disable display dnum(s)... Disables the display of item number(s) dnum(s). A disabled display item is not printed automatically, but is not forgotten. It may be reenabled later. enable display dnum(s) Enables display of item number(s) dnum(s). It becomes effective once again in auto display of its expression, until you specify otherwise. display Displays the current values of the expressions on the list, as is done when the program stops. info display Prints the list of expressions previously set up to display automatically, each one with its item number, but without showing the values. This includes disabled expressions, which are marked as such. It also includes expressions that are not displayed right now because they refer to automatic variables not currently available. S–2467–20 63Cray XMT™ Debugger Reference Guide 6.8 Value History Every value printed by the print command is saved for the entire session in the mdb value history so that you can refer to it in other expressions. The values printed are given history numbers for you to refer to them by. These are successive integers starting with 1. print shows you the history number assigned to a value by printing $num = before the value; here num is the history number. To refer to any previous value, use $ followed by the history number of the value. The output printed by print is designed to remind you of this. A single $ refers to the most recent value in the history, and $$ refers to the value before that. For example, to see the contents of a structure to which you have printed a pointer: [1] (mdb) p *$ If you have a chain of structures where the component next points to the next one, you can print the contents of the next one: [1] (mdb) p *$.next It might be useful to repeat this command many times by typing RET. Note that the history records values, not expressions. If the value of x is 4 and you type this command: [1] (mdb) print x [1] (mdb) set x=5 then the value recorded in the value history by the print command remains 4 even though the value of x has changed. By extension, the type of a history value does not change when circumstances are altered. For example, by continuing to a breakpoint in a different module or, in a multi-threaded context, by focusing on a new thread, you may find that the new context harbors a type whose name is identical to that of the history value type but whose structure differs; however, printing the value continues to produce the same result as in the original context. info values Print the last ten values in the value history, with their item numbers. This is like p $$9 repeated ten times, except that info values does not change the history. info values n Print ten history values centered on history item number n. info values + Print the ten history values immediately after the values last printed. 64 S–2467–20Examining Data [6] 6.9 Convenience Variables mdb provides convenience variables that you can use within mdb to hold on to a value and refer to it later. These variables exist entirely within mdb; they are not part of your program, and setting a convenience variable has no effect on further execution of your program. That is why you can use them freely. Convenience variables have names starting with $. You can use any name starting with $ for a convenience variable, unless it is one of the predefined set of register names (see Registers on page 66). You can save a value in a convenience variable with an assignment expression, as you would set a variable in your program. Example: [1] (mdb) set $foo = *object_ptr saves in $foo the value contained in the object pointed to by object_ptr. Using a convenience variable for the first time creates it; but its value is void until you assign a new value. You can alter the value with another assignment at any time. Convenience variables have no fixed types. You can assign a convenience variable any type of value, even if it already has a value of a different type. The convenience variable as an expression has whatever type its current value has. info convenience Print a list of convenience variables used so far, and their values. Abbreviated i con. One of the ways to use a convenience variable is as a counter to be incremented or a pointer to be advanced. For example: [1] (mdb) set $i = 0 [1] (mdb) print bar[$i++]->contents ...repeat that command by typing RET. Some convenience variables are created automatically by mdb and given values likely to be useful. $_ The variable $_ is automatically set by the x command to the last address examined (see Examining Memory on page 60). $__ The variable $__ is automatically set by the x command to the value found in the last address examined. S–2467–20 65Cray XMT™ Debugger Reference Guide 6.10 Registers Each thread in your program has an identically named set of machine registers. mdb tracks the full set of registers for only the innermost frame. You can refer to register contents of the focus thread in expressions as variables with names starting with $. Use info registers to see the names and values of the focus thread registers. exception Identifies raised exceptions. resultcode Further refines the value of the exception register. mslots_flag Used by the trap handler. dc [0,1,...,7], dv[0,1,...,7] Used by the trap handler. t[0,1,...,7] Target registers. t0 always holds the value of the primary trap handler. r[0,1,...,31] General purpose registers. r0 always holds the value 0. instcount One use by mdb is to step a thread some number of instructions. ssw Holds the program counter ($pc), various condition codes, and trap masks. The value is for the innermost frame, regardless of the selected frame. In addition, mdb recognizes aliases for certain registers. $pc Program counter. Lower bits of ssw. $sp Stack pointer. Points to the current stack frame. Same as r1. $er Exception register. Same as exception. $eps Pointer to end of memory block allocated for stack. Same as r5. This use of $eps is valid only when the focus thread begins executing the function; the compiler may use r5 as a general purpose register during execution of the function body. $ccb Pointer to the control block of the focus thread. Same as r2. 66 S–2467–20Examining Data [6] Register values are relative to the selected stack frame (see Selecting a Frame on page 43). This means that you get the value that the register would contain if all stack frames farther in were exited and their saved registers restored. Registers that were not saved may hold values irrelevant to the selected stack frame. In order to see the real contents of all registers, you must select the innermost frame (with frame 0). Note: Currently, mdb only provides correct values for registers in the innermost frame. info registers Print the names and values of all registers for the focus thread relative to the selected frame. If your are not at the frame where execution is currently stopped (that is, in a frame that is not innermost), some registers may not be tracked and can retain values from lower frames. info registers regname Print the value of register regname for the focus thread. regname may be any valid register name, with or without the initial $. When mdb recognizes that a general-purpose register contains a named variable from your program (as opposed to a compiler-generated temporary or some other value), it prints the name of the variable. 6.11 Register Examples You could print the program counter of the focus thread in hex with: [1] (mdb) p/x $pc or print the instruction to be executed next. [1] (mdb) x/i $pc You can assign registers directly, but if the register holds the value of a variable, the variable may exist in multiple locations (that is, in other registers and memory). In this case, the compiled code does not have to keep the values in these locations consistent or even use the locations in subsequent branching decisions if it can obtain information about the current value of the variable from analysis of the code. S–2467–20 67Cray XMT™ Debugger Reference Guide 68 S–2467–20Examining Symbols [7] The commands described in this chapter allow you to make inquiries for information about the symbols (names of variables, functions and types) defined in your program. This information is found by mdb in the program symbol table, one or more program libraries, or one or more object files. This symbol information is inherent in the text of your program and does not change as the program executes. whatis exp Print the data type of expression exp. exp is not actually evaluated, and any side-effecting operations (such as assignments or function calls) inside it do not take place. See Expressions on page 53. whatis Print the data type of $, the last value in the value history. info address symbol Describe where the data for symbol is stored. For a register variable, this says which register it is kept in. For a non-register local variable, this prints the stack-frame offset at which the variable is always stored. Note the contrast with print &symbol, which does not work at all for a register variables, and for a stack local variable prints the exact address of the current instantiation of the variable. ptype typename Print a description of data type typename. typename may be the name of a type, or for C code it may have the form struct struct-tag, union union-tag or enum enum-tag. info sources Print the names of all source files in the program. For standard shared libraries such as libc, libm, and librt, only the names of the source files referenced by the program are printed. info modules Print the names of all object files in the program. Each object file is listed either as a stand-alone fat .o file or as one of several components of a program library. S–2467–20 69Cray XMT™ Debugger Reference Guide info functions Print the names and data types of all defined functions. info functions regexp Print the names and data types of all defined functions whose names contain a match for regular expression regexp. Thus, info fun step finds all functions whose names include step; info fun s ˆ tep finds those whose names start with step. info variables Print the names and data types of all variables that are declared outside of functions (that is, except for local variables). info variables regexp Print the names and data types of all variables (except for local variables) whose names contain a match for regular expression regexp. info types Print all data types defined in the program. info types regexp Print all data types that are defined in the program whose names contain a match for regular expression regexp. printsyms filename Write a complete dump of the debugger symbol data into the file filename. See also the info files command in File Commands on page 11 and the info modules command in Module Commands on page 13. 7.1 Archive Symbol Visibility On the Cray XMT, when your program is linked with an archive, your program may or may not see a particular global symbol in the archive. If annotate has been run on the archive, the linker allows your program to use only those global archive symbols explicitly made available to user programs. If annotate has not been run, all global symbols within the archive are available. Using annotate to hide symbols provides for a measure of safety, analogous to that provided by static symbols in multiple-module user programs, and allows identical names in multiple archives, as well as in freestanding object files (not contained within an archive), to represent separate entities. 70 S–2467–20Examining Symbols [7] Consider the global symbols in an archive with which your program is linked, where annotate has been run on the archive. The program treats the global archive symbols exported by annotate as visible, and those global archive symbols that are not exported as hidden. Symbols appear in your source code in one of two contexts: as a definition or as a use. When the linker encounters a use of a global symbol within a freestanding module, it locates the symbol definition by searching the visible symbols defined in freestanding modules and archives. When the linker encounters the use of a hidden global symbol within an archive, symbols defined within the archive take precedence over external names. If you request information from mdb about a hidden archive symbol or try to set a breakpoint on a hidden function, mdb uses the internal variable visibility to determine whether to grant access to the symbol and to resolve the ambiguity if there are multiple hidden symbols by that name. set visibility value Determine the way mdb resolves conflicts between visible and hidden global symbols. The possible states for value are: auto This is the default state. Given the choice between a global exported symbol and a hidden symbol of the same name in an archive, mdb selects the hidden symbol if the current stack frame belongs to a function within that archive. When the choice is between multiple hidden symbols, mdb selects the local symbol rather than the one residing in another archive. In this case, if no local symbol exists, mdb chooses one of the symbols arbitrarily. hidden mdb resolves all conflicts in favor of hidden symbols. When multiple hidden symbols with the same name exist, mdb displays a menu. Consider using this mode when there are potential conflicts between exported and hidden symbols in expressions involving several global variables. any mdb makes no attempt to resolve ambiguities. When multiple global symbols of the same name are present, you can choose the symbol you want from a menu. visible mdb resolves all conflicts in favor of the exported symbol. If none exists, an error message is issued. S–2467–20 71Cray XMT™ Debugger Reference Guide You can change visibility either strictly for archive variables or strictly for archive function names by setting one of the subsidiary variables code-visibility or data-visibility to auto, any, visible, or hidden. You may also use abbreviated forms for all these variables and values. The variables may be abbreviated as v, cv, and dv, respectively, and the values as any unambiguous prefix. Thus, the following command sets the visibility for archive function names to the value any. [1] (mdb) set cv an 72 S–2467–20Altering Execution [8] Once you think you have found an error in the program, you might want to find out for certain whether correcting the apparent error leads to correct results in the rest of the run. You can find the answer by experiment, using the mdb features for altering execution of the program. For example, you can store new values into variables or memory locations, give the program a signal, restart it at a different address, or even return prematurely from a function to its caller. 8.1 Assignment to Variables The ability of mdb to change the value of a variable depends on the debugging level with which you compiled your code (see Compiling for Debugging on page 8), as well as on the nature of the variable. Evaluating an assignment expression is one way to alter the value of a variable. See Expressions on page 53. For example, [1] (mdb) print x=4 stores the value 4 into the variable x and then prints the value of the assignment expression (which is 4). All the assignment operators for C are supported, including the increment operators ++ and --, and combining assignments such as += and <<=. If you are not interested in seeing the value of the assignment, use the set command instead of the print command. set is really the same as print except that the value of the expression value is not printed and is not put in the value history (see Value History on page 64). The expression is evaluated only for side effects. Whenever the beginning of the argument string of the set command appears identical to a set subcommand, it may be necessary to use the set variable command. This command is identical to set except for its lack of subcommands. For example, the first of the following two commands sets the mdb variable rw (equivalently register-warning), and the second sets the program variable rw. [1] (mdb) set rw 0 [1] (mdb) set variable rw 0 S–2467–20 73Cray XMT™ Debugger Reference Guide If the value of a variable is kept in a register, mdb may not always be able to update the variable in ways that are fully consistent with normal execution. See Altering Variables Kept in Registers on page 74 for a discussion of how mdb handles such an assignment. If the use of a Cray XMT generic to assign a sync or future variable value changes the full/empty state to full (see Changing the Full/Empty Bit on page 75), and there are blocked threads waiting for the variable to become full, the write is accompanied by a subsequent trap handling action that satisfies the request of one of the blocked threads to write or read the variable (possibly changing the value of the variable value or re-setting the variable to empty). Thus, if you print a sync or future variable after writing to it, without resuming any threads after the write operation, the value or full/empty state of the variable may be different than you expected. The unblocked thread remains suspended, though in a state that differs as a result of the read or write assisted by the trap handler, until the entire program resumes or the particular thread is explicitly set in motion from the mdb command line. See Changing the Full/Empty Bit on page 75, for another means of changing the full/empty bit. mdb allows more implicit conversions in assignments than C does; you can freely store an integer value into a pointer variable or vice versa. You can also convert any structure to any other structure that is the same length or shorter. To store values into arbitrary places in memory, use the {...} construct to generate a value of specified type at a specified address (see Expressions on page 53). For example, {int}0x401033b918 refers to memory location 0x401033b918 as an integer (which implies a certain size and representation in memory), and: [1] (mdb) set {int}0x401033b918 = 20 stores the value 20 into that memory location. 8.1.1 Altering Variables Kept in Registers Under certain compiler optimizations, the value of a variable is sometimes kept in one or more registers. If you change the value of the variable from mdb, mdb may change only one of the copies. The multiple copies of the variable may not have identical values, and further execution may have unexpected behavior. You can prevent the compiler from performing this kind of optimization by compiling your program at the -g2 debugging level (see Compiling for Debugging on page 8). 74 S–2467–20Altering Execution [8] For example, when the following program is compiled at the -g1 debugging level, setting i to a new value at line 4 from mdb has no visible effect, because the parameter to printf is in a different register than the one changed by mdb. Also, the loop control is in yet a third register. main() { int i; for (i=0; i<20; i++) { printf("%d\n", i); // line 4 } } When you change the value of variable stored in a register, mdb issues a warning that the new value may not be propagated to other copies of the variable. [1] (mdb) set foo = 10 Warning: Variable in register. New value may not propagate to all copies. You can adjust the frequency of these warning messages either by invoking mdb with a command line option: % mdb -register-warning frequency % mdb -rw frequency or by setting a mdb variable during your debugging session. set register-warning frequency set rw frequency In either method, frequency is one of never, first, or always (equivalently, 0, 1, or 2, respectively). The default warning frequency is always. A value of first means the warning is given the first time an assignment is made to a variable on a per-variable basis. 8.2 Changing the Full/Empty Bit You can use the generic functions to manipulate the full/empty bit or bits of a variable v while simultaneously reading from or writing to v. See Cray XMT Programming Environment User's Guide for more information. You cannot directly manipulate any of the other state bits through mdb commands (see State Bits on page 55). You can change the other state bits by writing an appropriate function, compiling it into your program, and calling the function from mdb. If multiple threads are blocked on a sync variable, all are either waiting for the variable to become full or waiting for the variable to become empty. When the state of the variable changes, one thread resumes running. If multiple threads are blocked on a future variable, then the variable is empty. When the state of the variable changes from empty to full, all waiting threads resume running. S–2467–20 75Cray XMT™ Debugger Reference Guide mdb prints the value of a variable without changing any of the state bits. In particular, when mdb prints the value of a sync or future variable, the full/empty state does not change. Suppose you want to cause resumption of a thread that is blocked (see Thread States on page 36) waiting to write the one-word sync variable v$ when v$ becomes empty. To simulate an emptying read of v$, use the generic function readfe. [1] (mdb) print v$ $1 = 4 (full,trap0) [1] (mdb) print readfe(&v$) $2 = 4 [1] (mdb) print v$ $3 = 9 (full, trap0) The set trap 0 bit in the second line indicates that there is at least one thread blocked on v$, waiting for the full/empty bit to become empty. The generic function readfe reads v$ when the full/empty bit is full, simultaneously changing the full/empty bit to empty. Because there are threads blocked on v$, immediately following the readfe a trap handler also satisfies one of the blocked threads--in this example, a writer changes the value of v$ to 9, also toggling the full/empty bit to full. Because the trap 0 bit of v$ is still set after the readfe operation and subsequent trap handling, more than one thread was initially blocked on v$. If only one thread were waiting for v$ to become empty, the last line of the example above would read $3 = (Full)9. When you continue your program, the thread whose write request (value 9) was satisfied by the trap handling resumes running; any other threads blocked on v$ remain blocked. In normally executing code, if v$ had been empty, the readfe operation would have blocked until v$ was set to full. If v$ is empty and you issue a readfe of v$ from mdb, mdb returns a message saying the operation was not done because it would have blocked. When mdb halts execution of your program, as long as all the threads are suspended, it makes little sense for you to issue a generic operation on an empty variable or memory location that depends on the variable or location being full. Similarly, generic operations that require an object to be empty hang indefinitely on a full object, in the absence of running threads in the program. The generic functions are intended for use on only sync or future variables, although neither mdb nor the compilers enforce this. Your program may behave incorrectly if you change the full/empty state of a normal (not sync or future) variable with one of these generics: mdb prints a warning message if you access a normal variable with one of these functions as part of a mdb command. Also, if you change the full/empty bit of a normal variable whose size is less than a word, because the full/empty bit actually belongs to the memory word containing the variable, the full/empty bit is changed for any other variables contained in the same word (see State Bits on page 55). 76 S–2467–20Altering Execution [8] If a watchpoint is set on the variable, that is, the x command reveals that the trap 1 bit is set (see Examining Memory on page 60), the generics will not manipulate the full/empty bit properly, nor are they guaranteed to wake up any thread blocked on that variable. Try disabling the watchpoint before proceeding. In your program, generics may be used on sync or future variables of types such as long double in C, which are composed of multiple words. This functionality is not yet implemented in mdb. S–2467–20 77Cray XMT™ Debugger Reference Guide 78 S–2467–20Stored Sequences of Commands [9] mdb provides two ways to store sequences of commands for execution as a unit: user-defined commands and command files. 9.1 User-defined Commands A user-defined command is a sequence of mdb commands to which you assign a new name as a command. This is done with the define command. define [commandname $arg1 $arg2 ...] Define a command named commandname with optional arguments $arg1, $arg2, .... If there is already a command by that name, you are asked to confirm that you want to redefine it. The definition of the command is made up of other mdb command lines, which are given following the define command. The end of these commands is marked by a line containing end. Formal arguments must each start with a dollar sign ($) and may contain letters, digits, and underscores. The arguments may be used within the command lines that follow. When you invoke commandname, you must supply the same number of actual arguments as there are formals. The text of the actual argument is substituted for that of the formal argument before execution of each command line. Arguments may be contained within double quoted material. To avoid substitution, prefix a backslash (\) before the dollar sign. document commandname Give documentation to the user-defined command commandname. The command commandname must already be defined. This command reads lines of documentation the same way that define reads the lines of the command definition, ending with end. After the document command is finished, help on command commandname prints the documentation you have specified. You may use the document command again to change the documentation of a command. Redefining the command with define does not change the documentation. S–2467–20 79Cray XMT™ Debugger Reference Guide When user-defined commands are executed, the commands of the definition are not printed. An error in any command stops execution of the user-defined command. Commands that ask for confirmation if used interactively proceed without asking when used inside a user-defined command. Many mdb commands that normally print messages to say what they are doing omit the messages when used in a user-defined command. 9.2 Command Files A command file for mdb is a file of lines that are mdb commands. Comments (lines starting with #) may also be included. An empty line in a command file does nothing; it does not mean to repeat the last command, as it would from the terminal. When mdb starts, it automatically executes its init files—command files named .mdbinit. mdb reads the init file (if any) in your home directory and then the init file (if any) in the current working directory. (The init files are not executed if mdb is invoked with the -nx option.) You can also request the execution of a command file with the source command: source filename Execute the command file filename. The lines in a command file are executed sequentially. They are not printed as they are executed. An error in any command terminates execution of the command file. Commands that ask for confirmation if used interactively proceed without asking when used in a command file. Many mdb commands that normally print messages to say what they are doing omit the messages when used in a command file. 80 S–2467–20Stored Sequences of Commands [9] 9.3 Commands for Controlled Output During the execution of a command file or a user-defined command, the only output that appears is what is explicitly printed by the commands of the definition. This section describes three commands useful for generating exactly the output you want. echo text Print text. You can include non-printing characters in text using C escape sequences, such as \n to print a newline. No newline is printed unless you specify one. In addition to the standard C escape sequences a backslash followed by a space stands for a space. This is useful for outputting a string with spaces at the beginning or the end, because leading and trailing spaces are trimmed from all arguments. Thus, to print ” and foo = ”, use the command echo ”\ and foo =\ ”. You can use a backslash at the end of text, as in C, to continue the command onto subsequent lines. For example, echo This is some text\n\ that is continued\n\ onto several lines.\n produces the same output as: echo This is some text\n echo that is continued\n echo onto several lines.\n output expression Print only the value of expression—no newlines, no $nn =. The value is not entered in the value history either. See Expressions on page 53 for more information on expressions. output/ fmt expression Print the value of expression in format fmt. See Output Formats on page 59 for more information. printf string, expressions... Print the values of the expressions under the control of string. The expressions are separated by commas and may be either numbers or pointers. Their values are printed as specified by string, in general exactly as if the program were to execute: [1] (mdb) printf (string, expressions...); (One minor exception is that integers are currently treated as 32 bit numbers by mdb.) As an example, you can print two values in hex like this: [1] (mdb) printf "foo, bar-foo = 0x%x, 0x%x\n", foo, bar-foo The only backslash-escape sequences that you can use in the string are the simple ones that consist of backslash followed by a letter. S–2467–20 81Cray XMT™ Debugger Reference Guide 82 S–2467–20Options and Arguments for mdb [10] When you invoke mdb, you can specify arguments telling it what files to operate on and what other things to do. 10.1 Mode Options -nx Do not execute commands from the init files .mdbinit. Normally, the commands in these files are executed after all the command options and arguments have been processed. See Command Files on page 80. -q Quiet. Do not print the usual introductory messages. -batch Run in batch mode. Exit with code 0 after processing all the command files specified with -x (and .mdbinit, if not inhibited). Exit with nonzero status if an error occurs in executing the mdb commands in the command files. -fullname This option is used when Emacs runs mdb as a subprocess. It tells mdb to output the full file name and line number in a standard, recognizable fashion each time a stack frame is displayed (which includes each time the program stops). S–2467–20 83Cray XMT™ Debugger Reference Guide 10.2 File-specifying Options All the options and command line arguments given are processed in sequential order. The order makes a difference when the -x option is used. -s filename Read symbol table from filename. -e filename Use filename as the executable file to execute when appropriate, and for examining pure data. -se filename Read symbol table from filename and use it as the executable file. -x filename Execute mdb commands from filename. -d directory Add directory to the path to search for source files. -cd directory Use directory as the working directory for mdb. 10.3 Communication Options and Variables In each of the following pairs, the first item is the command-line option form, the second item is the variable setting that will evoke the option behavior for subsequent run commands. -rm, set remote-manual Start in remote-manual mode. mdb does not start the inferior—it waits until the inferior is started manually. -open-socket, set communication open-socket Use a socket as the communication channel between mdb and the target program. mdb creates the socket. -socket hostname,portnumber, set communication socket host,port Use the socket from hostname using portnumber as the communication channel between mdb and the target program. 84 S–2467–20Options and Arguments for mdb [10] 10.4 Breakpoint-behavior Options -ox Execute instructions at breakpoints by creating and calling a pseudo-function that simulates the behavior of the original instruction. The other option, restoring the original instruction and executing it in place, allows other activities to proceed past the breakpoint without stopping. -ox is the default. The -out-of-line-execution option is identical to this option. -ix Execute instructions at breakpoints by restoring the original instruction to its rightful address, single-stepping across it and then restoring the breakpoint. You can use this if a bug is suspected in the pseudo-function created with -ox, but it is not recommended for general use. Conditional breakpoints cannot be used with this option. The -inline-execution option is identical to this option. 10.5 Miscellaneous Options -ximm command Execute command immediately. 10.6 Other Arguments If there are arguments to mdb that are not options or associated with options, the first one specifies the symbol table and executable file name (as if it were preceded by -se). A second unassociated argument should be a decimal number which is treated as the process id (PID) of the running process to which mdb should attach. When mdb attaches to a process, the process halts until you enter the run command. After you enter run, mdb resumes execution of the process until either the program exits, you type Ctrl-C, or the process reaches the next breakpoint. S–2467–20 85Cray XMT™ Debugger Reference Guide 86 S–2467–20GNU General Public License [A] Version 1, February 1989 Copyright (C) 1989 Free Software Foundation, Inc. 675 Mass Ave, Cambridge, MA 02139, USA. Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. A.1 Preamble The license agreements of most software companies try to keep users at the mercy of those companies. By contrast, our General Public License is intended to guarantee your freedom to share and change free software—to make sure the software is free for all its users. The General Public License applies to the Free Software Foundation's software and to any other program whose authors commit to using it. You can use it for your programs, too. When we speak of free software, we are referring to freedom, not price. Specifically, the General Public License is designed to make sure that you have the freedom to give away or sell copies of free software, that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs; and that you know you can do these things. To protect your rights, we need to make restrictions that forbid anyone to deny you these rights or to ask you to surrender the rights. These restrictions translate to certain responsibilities for you if you distribute copies of the software, or if you modify it. For example, if you distribute copies of a such a program, whether gratis or for a fee, you must give the recipients all the rights that you have. You must make sure that they, too, receive or can get the source code. And you must tell them their rights. We protect your rights with two steps: (1) copyright the software, and (2) offer you this license that gives you legal permission to copy, distribute and/or modify the software. Also, for each author's protection and ours, we want to make certain that everyone understands that there is no warranty for this free software. If the software is modified by someone else and passed on, we want its recipients to know that what they have is not the original, so that any problems introduced by others will not reflect on the original authors' reputations. S–2467–20 87Cray XMT™ Debugger Reference Guide The precise terms and conditions for copying, distribution and modification follow. A.2 Terms and Conditions 1. This License Agreement applies to any program or other work that contains a notice placed by the copyright holder saying it may be distributed under the terms of this General Public License. The Program, below, refers to any such program or work, and a work based on the Program means either the Program or any work containing the Program or a portion of it, either verbatim or with modifications. Each licensee is addressed as you. 2. You may copy and distribute verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty; keep intact all the notices that refer to this General Public License and to the absence of any warranty; and give any other recipients of the Program a copy of this General Public License along with the Program. You may charge a fee for the physical act of transferring a copy. 3. You may modify your copy or copies of the Program or any portion of it, and copy and distribute such modifications under the terms of Paragraph 1 above, provided that you also do the following: • Cause the modified files to carry prominent notices stating that you changed the files and the date of any change; and • Cause the whole of any work that you distribute or publish, that in whole or in part contains the Program or any part thereof, either with or without modifications, to be licensed at no charge to all third parties under the terms of this General Public License (except that you may choose to grant warranty protection to some or all third parties, at your option). • If the modified program normally reads commands interactively when run, you must cause it, when started running for such interactive use in the simplest and most usual way, to print or display an announcement including an appropriate copyright notice and a notice that there is no warranty (or else, saying that you provide a warranty) and that users may redistribute the program under these conditions, and telling the user how to view a copy of this General Public License. • You may charge a fee for the physical act of transferring a copy, and you may at your option offer warranty protection in exchange for a fee. Mere aggregation of another independent work with the Program (or its derivative) on a volume of a storage or distribution medium does not bring the other work under the scope of these terms. 88 S–2467–20GNU General Public License [A] 4. You may copy and distribute the Program (or a portion or derivative of it, under Paragraph 2) in object code or executable form under the terms of Paragraphs 1 and 2 above provided that you also do one of the following: • Accompany it with the complete corresponding machine-readable source code, which must be distributed under the terms of Paragraphs 1 and 2 above; or, • Accompany it with a written offer, valid for at least three years, to give any third party free (except for a nominal charge for the cost of distribution) a complete machine-readable copy of the corresponding source code, to be distributed under the terms of Paragraphs 1 and 2 above; or, • Accompany it with the information you received as to where the corresponding source code may be obtained. (This alternative is allowed only for noncommercial distribution and only if you received the program in object code or executable form alone.) Source code for a work means the preferred form of the work for making modifications to it. For an executable file, complete source code means all the source code for all modules it contains; but, as a special exception, it need not include source code for modules that are standard libraries that accompany the operating system on which the executable file runs, or for standard header files or definitions files that accompany that operating system. 5. You may not copy, modify, sublicense, distribute or transfer the Program except as expressly provided under this General Public License. Any attempt otherwise to copy, modify, sublicense, distribute or transfer the Program is void, and will automatically terminate your rights to use the Program under this License. However, parties who have received copies, or rights to use copies, from you under this General Public License will not have their licenses terminated so long as such parties remain in full compliance. 6. By copying, distributing or modifying the Program (or any work based on the Program) you indicate your acceptance of this license to do so, and all its terms and conditions. 7. Each time you redistribute the Program (or any work based on the Program), the recipient automatically receives a license from the original licensor to copy, distribute or modify the Program subject to these terms and conditions. You may not impose any further restrictions on the recipients' exercise of the rights granted herein. 8. The Free Software Foundation may publish revised and/or new versions of the General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Program specifies a version number of the license that applies to it and any later version, you have the option of following the terms and conditions either of that version or S–2467–20 89Cray XMT™ Debugger Reference Guide of any later version published by the Free Software Foundation. If the Program does not specify a version number of the license, you may choose any version ever published by the Free Software Foundation. 9. If you wish to incorporate parts of the Program into other free programs whose distribution conditions are different, write to the author to ask for permission. For software that is copyrighted by the Free Software Foundation, write to the Free Software Foundation; we sometimes make exceptions for this. Our decision will be guided by the two goals of preserving the free status of all derivatives of our free software and of promoting the sharing and reuse of software generally. NO WARRANTY 10. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM AS IS WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 11. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. END OF TERMS AND CONDITIONS A.3 How to Apply These Terms to Your New Programs If you develop a new program, and you want it to be of the greatest possible use to humanity, the best way to achieve this is to make it free software that everyone can redistribute and change under these terms. 90 S–2467–20GNU General Public License [A] To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively convey the exclusion of warranty; and each file should have at least the copyright line and a pointer to where the full notice is found. one line to give the program's name and a brief idea of what it does. Copyright (C) 19yy name of author This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 1, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. Also add information on how to contact you by electronic and paper mail. If the program is interactive, make it output a short notice like this when it starts in an interactive mode: Gnomovision version 69, Copyright (C) 19yy name of author Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type s` how w'. This is free software, and you are welcome to redistribute it under certain conditions; type s` how c' for details. The hypothetical commands show w and show c should show the appropriate parts of the General Public License. Of course, the commands you use may be called something other than show w and show c; they could even be mouse-clicks or menu items—whatever suits your program. You should also get your employer (if you work as a programmer) or your school, if any, to sign a copyright disclaimer for the program, if necessary. Here a sample; alter the names: Yoyodyne, Inc., hereby disclaims all copyright interest in the program G` nomovision' (a program to direct compilers to make passes at assemblers) written by James Hacker. signature of Ty Coon, 1 April 1989 Ty Coon, President of Vice That's all there is to it! S–2467–20 91Cray XMT™ Debugger Reference Guide 92 S–2467–20Using mdb under GNU Emacs [B] A special interface allows you to use GNU Emacs to view (and edit) the source files for the program you are debugging with mdb. To get the interface with under Emacs 19.25, put the following line in your .emacs file: (autoload 'mdb "gud" "grand unified debugging mode" t) To use this interface, use the command M-x mdb in Emacs. Give the executable file you want to debug as an argument. This command starts mdb as a subprocess of Emacs, with input and output through a newly created Emacs buffer. If Emacs produces an error message instead of starting mdb, you may be using an older file. Remove the autoload line from your .emacs file and use M-x gdb. Then, substitute mdb for gdb in the minibuffer. Using mdb under Emacs is like using mdb normally except for two things: • All terminal input and output goes through the Emacs buffer. This applies to mdb commands and their output, and to the input and output done by the program you are debugging. This is useful because it means that you can copy the text of previous commands and input them again; you can even use parts of the output in this way. All the facilities of the Emacs Shell mode are available for this purpose. • mdb displays source code through Emacs. Each time mdb displays a stack frame, Emacs automatically finds the source file for that frame and puts an arrow (=>) at the left margin of the current line. Explicit mdb list or search commands still produce output as usual, but you probably have no reason to use them. S–2467–20 93Cray XMT™ Debugger Reference Guide In the mdb I/O buffer, you can use these special Emacs commands: C-c C-s Execute to another source line, like the mdb step command. C-c C-n Execute to next source line in this function, skipping all function calls, like the mdb next command. C-c C-i Execute one instruction, like the mdb stepi command. C-c C-b Set a breakpoint on the current line, like the mdb break linenum command, where linenum corresponds to the position of (=>) in the source file buffer. C-c C-f Execute until exit from the selected stack frame, like the mdb finish command. C-c C-r Continue execution of the program, like the mdb cont command. C-c C-p Evaluate the expression immediately following the cursor, like the mdb print exp command where exp is the expression immediately following the cursor in the mdb buffer. C-c < Go up the number of frames indicated by the numeric argument, like the mdb up command. C-c > Go down the number of frames indicated by the numeric argument, like the mdb down command. In any source file, the Emacs command C-x SPC (mdb-break ) tells mdb to set a breakpoint on the source line point is on. The source files displayed in Emacs are in ordinary Emacs buffers that are visiting the source files in the usual way. You can edit the files with these buffers if you wish; but keep in mind that mdb communicates with Emacs in terms of line numbers. If you add or delete lines or characters from the text, the line numbers that mdb knows will cease to correspond properly to the code. 94 S–2467–20mdb Input and Output Conventions [C] To invoke mdb, enter the shell command mdb. Once started, mdb reads commands from the terminal until you tell it to exit. A mdb command is a single line of input. There is no limit on how long it can be. It starts with a command name, which is followed by arguments whose meaning depends on the command name. For example, the command step accepts an argument that is the number of times to step, as in step 5. You can also use the step command with no arguments. Some command names do not allow any arguments. mdb command names may always be abbreviated if the abbreviation is unambiguous. Sometimes even ambiguous abbreviations are allowed; for example, s is specially defined as equivalent to step even though there are other commands whose names start with s. Possible command abbreviations are often stated in the documentation of the individual commands. A blank line as input to mdb means to repeat the previous command verbatim. Certain commands do not allow themselves to be repeated this way; these are commands for which unintentional repetition might cause trouble and that you are unlikely to want to repeat. Certain others (list and x) act differently when repeated because that is more useful. A line of input starting with # is a comment; it does nothing. This is useful mainly in command files (see Command Files on page 80). mdb indicates its readiness to read a command by printing a string called the prompt. This string is normally (mdb). If a thread currently has the focus (see Focus Thread on page 37), the focus thread ID is printed in square brackets to the left of (mdb). If the thread is in a non-running state (see Thread States on page 36), mdb prints the state of the focus thread leftmost in the prompt, within angle brackets. S–2467–20 95Cray XMT™ Debugger Reference Guide Use the set prompt command to change the prompt string. You can also include system information in the prompt. set prompt newprompt Directs mdb to use newprompt as its prompt string henceforth. In addition to a literal prompt string, newprompt may include any of the following two-character specifications for system information: %T focus thread name %S focus thread state %R focus thread state, if not running %F function name %U source file name %L line number %M module name (.o file) %P program library name (.a or .pl file) %C program counter %H history number Many of the specifications described above result in an empty string if the relevant information is unknown or unavailable. For instance, when you initially start mdb, no thread has the focus, so %T results in an empty string. You can use the following three-character specifications to control the printing of characters near the resulting strings. %pX Immediately precedes a two-character specification %? from the list above, where X is a single character of your choice. If %? is printed, X is printed to its left; otherwise X is omitted. %sX Immediately succeeds a two-letter specification, where X is a single character of your choice. If %? is printed, X is printed to its right; otherwise X is omitted. If mdb cannot determine system information included in the prompt, mdb prints nothing. The default prompt specification is %p<%R%s>%s %p[%T%s]%s (mdb). To exit mdb, use the quit command (abbreviated q). Ctrl-C does not exit from mdb, but rather terminates the action of any mdb command that is in progress and returns to mdb command level. It is generally safe to type Ctrl-C at any time because mdb attempts to synchronize the interrupt to a time when it is safe. However, there is the possibility that Ctrl-C during expression evaluation may leave locks in a held state. 96 S–2467–20mdb Input and Output Conventions [C] Certain commands to mdb may produce large amounts of information output to the screen. To help you read all of it, mdb pauses and asks you for input at the end of each page of output. Press Enter when you want to continue the output. Normally mdb knows the size of the screen from the termcap database together with the value of the TERM environment variable. To change the screen size use the set screensize command: set screensize lpp, set screensize lpp cpl Specify a screen height of lpp lines and (optionally) a width of cpl characters. If you omit cpl, the width does not change. If you specify a height of zero lines, mdb will not pause during output no matter how long the output is. This is useful if output is to a file or to an editor buffer. Also, mdb may at times produce more information about its own workings than is of interest to the you. You can turn some of these informational messages on and off with the set verbose command: set verbose on Re-enables mdb output of certain informational messages. set verbose off Disables mdb output of certain informational messages. Currently, the messages controlled by set verbose are those that announce that the symbol table for a source file is being read (see File Commands on page 11), in the description of the command symbol-file). S–2467–20 97Cray XMT™ Debugger Reference Guide 98 S–2467–20Glossary blade 1) A Cray XMT compute blade consists of Threadstorm processors, memory, Cray SeaStar chips, and a blade control processor. 2) From a system management perspective, a logical grouping of nodes and blade control processor that monitors the nodes on that blade. blade control processor A microprocessor on a blade that communicates with a cabinet control processor through the HSS network to monitor and control the nodes on the blade. See also blade, L0 controller, Hardware Supervisory System (HSS). cabinet control processor A microprocessor in the cabinet that communicates with the HSS via the HSS network to monitor and control the devices in a system cabinet. See also Hardware Supervisory System (HSS). CLE The operating system for Cray XMT systems. fork Occurs when processors allocate additional streams to a thread at the point where it is creating new threads for a parallel loop operation. future Implements user-specified or explicit parallelism by starting new threads. A future is a sequence of code that can be executed by a newly created thread that is running concurrently with other threads in the program. Futures delay the execution of code if the code is using a value that is computed by a future, until the future completes. The thread that spawns the future uses parameters to pass information from the future to the waiting thread, which then executes. In a program, the term future is used as a type qualifier for a synchronization variable or as a keyword for a future statement. S–2467–20 99Cray XMT™ Debugger Reference Guide Hardware Supervisory System (HSS) Hardware and software that monitors the hardware components of the system and proactively manages the health of the system. It communicates with nodes and with the management processors over the private Ethernet network. See also system interconnection network. logical machine An administrator-defined portion of a physical Cray XMT system, operating as an independent computing resource. login node The service node that provides a user interface and services for compiling and running applications. metadata server (MDS) The component of the Lustre file system that manages Metadata Targets (MDT) and handles requests for access to file system metadata residing on those targets. node For CLE systems, the logical group of processor(s), memory, and network components acting as a network end point on the system interconnection network. See also processing element. phase A set of one or more sections of code that the stream executes in parallel. Each section contains an iteration of a loop. Phases and sections are contained in control flow code generated by the compiler to control the parallel execution of a function. processing element The smallest physical compute group. There are two types of processing elements: a compute processing element consists of an AMD Opteron processor, memory, and a link to a Cray SeaStar chip. A service processing element consists of an AMD Opteron processor, memory, a link to a Cray SeaStar chip, and PCI-X or PCIe links. System Management Workstation (SMW) The workstation that is the single point of control for system administration. See also Hardware Supervisory System (HSS). 100 S–2467–20 TM Cray XMT™ Performance Tools User's Guide S–2462–20© 2007–2011 Cray Inc. All Rights Reserved. This document or parts thereof may not be reproduced in any form unless permitted by contract or by written permission of Cray Inc. U.S. GOVERNMENT RESTRICTED RIGHTS NOTICE The Computer Software is delivered as "Commercial Computer Software" as defined in DFARS 48 CFR 252.227-7014. All Computer Software and Computer Software Documentation acquired by or for the U.S. Government is provided with Restricted Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7014, as applicable. Technical Data acquired by or for the U.S. Government, if any, is provided with Limited Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7013, as applicable. BSD Licensing Notice: Copyright (c) 2008, Cray Inc. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name Cray Inc. nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. Your use of this Cray XMT release constitutes your acceptance of the License terms and conditions. Cray, LibSci, and PathScale are federally registered trademarks and Active Manager, Cray Apprentice2, Cray Apprentice2 Desktop, Cray C++ Compiling System, Cray CX, Cray CX1, Cray CX1-iWS, Cray CX1-LC, Cray CX1000, Cray CX1000-C, Cray CX1000-G, Cray CX1000-S, Cray CX1000-SC, Cray CX1000-SM, Cray CX1000-HN, Cray Fortran Compiler, Cray Linux Environment, Cray SHMEM, Cray X1, Cray X1E, Cray X2, Cray XD1, Cray XE, Cray XEm, Cray XE5, Cray XE5m, Cray XE6, Cray XE6m, Cray XMT, Cray XR1, Cray XT, Cray XTm, Cray XT3, Cray XT4, Cray XT5, Cray XT5 h , Cray XT5m, Cray XT6, Cray XT6m, CrayDoc, CrayPort, CRInform, ECOphlex, Gemini, Libsci, NodeKARE, RapidArray, SeaStar, SeaStar2, SeaStar2+, The Way to Better Science, Threadstorm, and UNICOS/lc are trademarks of Cray Inc. Lustre is a trademark of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Platform is a trademark of Platform Computing Corporation. Windows is a trademark of Microsoft Corporation. UNIX, the “X device,” X Window System, and X/Open are trademarks of The Open Group in the United States and other countries. All other trademarks are the property of their respective owners. RECORD OF REVISION S–2462–20 Published May 2011 Supports release 2.0 GA running on Cray XMT and Cray XMT Series compute nodes and on Cray XT 3.1UP02 service nodes. This release uses the System Management Workstation (SMW) version 5.1.UP03.S–2462–15 Published December 2010 Supports release 1.5 running on Cray XMT compute nodes and on Cray XT service nodes running CLE 2.2.UP01. This release uses the System Management Workstation (SMW) version 4.0.UP02. 1.4 Published December 2009 Supports release 1.4 running on Cray XMT compute nodes and on Cray XT service nodes running CLE 2.2.UP01. This release uses the System Management Workstation (SMW) version 4.0.UP02. 1.3 Published March 2009 Supports release 1.3 running on Cray XMT compute nodes and on Cray XT 2.1.5HD service nodes. This release uses the System Management Workstation (SMW) version 3.1.09. 1.2 Published August 2008 Supports general availability (GA) release 1.2 running on Cray XMT compute nodes and on Cray XT 2.0.49 service nodes. This release uses the System Management Workstation (SMW) version 3.1.04. 1.1 Published March 2008 Supports limited availability (LA) release 1.1.01 running on Cray XMT compute nodes and on Cray XT 2.0 service nodes. 1.0 Published August 2007 Supports Canal, Bprof, Tview, and Cray Apprentice2 version 3.2 running on Cray XMT systems. This manual incorporates material previously published in S-2319-10, Cray MTA-2 Performance Programming Tools Reference Manual.Changes to this Document Cray XMT™ Performance Tools User's Guide S–2462–20 This manual supports the 2.0 release of the Cray XMT Performance Analysis Tools. Added information • Support for partial tracing, which makes tracing information available even when the execution of a tracing program terminates prematurely. See Partial Tracing on page 58. • Added information about annotations in inlined functions in Statement-level Annotations on page 32 • New default mode for Apprentice2, which displays traps taken in system libraries, and new option --nosystem to turn off this mode. • New trace profiling report (Tprof)Contents Page Introduction [1] 11 1.1 The Performance Tool Set . . . . . . . . . . . . . . . . . . . . . 11 1.2 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.2.1 Module and Compiler Considerations . . . . . . . . . . . . . . . . . 13 1.2.2 Execution Considerations . . . . . . . . . . . . . . . . . . . . 14 1.2.3 Data Conversion (pproc) . . . . . . . . . . . . . . . . . . . . 15 1.3 Using Cray Apprentice2 . . . . . . . . . . . . . . . . . . . . . . 16 1.3.1 Modules . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.3.2 Launching the Application . . . . . . . . . . . . . . . . . . . . 16 1.3.3 Loading Data Files . . . . . . . . . . . . . . . . . . . . . . 17 1.3.4 Basic Navigation . . . . . . . . . . . . . . . . . . . . . . . 18 1.3.5 Comparing Files . . . . . . . . . . . . . . . . . . . . . . . 24 1.3.6 Exiting from Cray Apprentice2 . . . . . . . . . . . . . . . . . . . 25 Compiler Analysis (Canal) [2] 27 2.1 CLI Version of Canal . . . . . . . . . . . . . . . . . . . . . . . 27 2.2 GUI Version of Canal . . . . . . . . . . . . . . . . . . . . . . . 29 2.2.1 Canal Window Layout . . . . . . . . . . . . . . . . . . . . . 29 2.2.2 Browse Loops . . . . . . . . . . . . . . . . . . . . . . . 32 2.2.3 Statement-level Annotations . . . . . . . . . . . . . . . . . . . 32 2.2.4 Statement Remarks . . . . . . . . . . . . . . . . . . . . . . 36 2.2.5 Loop-level Annotations . . . . . . . . . . . . . . . . . . . . . 38 2.2.6 Canal Configuration and Navigation Options . . . . . . . . . . . . . . . 43 2.2.6.1 Select Source . . . . . . . . . . . . . . . . . . . . . . 43 2.2.6.2 Toolbars . . . . . . . . . . . . . . . . . . . . . . . . 44 2.2.6.3 Show/Hide Data . . . . . . . . . . . . . . . . . . . . . 44 2.2.6.4 Change Font . . . . . . . . . . . . . . . . . . . . . . 45 2.2.6.5 Panel Actions . . . . . . . . . . . . . . . . . . . . . . 46 S–2462–20 7Cray XMT™ Performance Tools User’s Guide Page Trace View (Tview) [3] 47 3.1 CLI Version of Tview . . . . . . . . . . . . . . . . . . . . . . . 47 3.2 GUI Version of Tview . . . . . . . . . . . . . . . . . . . . . . 48 3.2.1 Using Tview . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2.2 Traced Data . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2.2.1 Optional Data . . . . . . . . . . . . . . . . . . . . . . 51 3.2.2.2 Zooming In . . . . . . . . . . . . . . . . . . . . . . . 52 3.2.2.3 Handling Large Trace Files . . . . . . . . . . . . . . . . . . 52 3.2.3 Event and Trap Details . . . . . . . . . . . . . . . . . . . . . 52 3.2.3.1 Event Details . . . . . . . . . . . . . . . . . . . . . . 52 3.2.3.2 Trap Details . . . . . . . . . . . . . . . . . . . . . . . 54 3.2.4 About System Library Traps . . . . . . . . . . . . . . . . . . . 56 3.2.5 Tview Configuration and Navigation Options . . . . . . . . . . . . . . . 56 3.2.5.1 Select Range . . . . . . . . . . . . . . . . . . . . . . 57 3.2.5.2 Panel Actions . . . . . . . . . . . . . . . . . . . . . . 58 3.3 Partial Tracing . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.4 Tuning Tracing . . . . . . . . . . . . . . . . . . . . . . . . 62 3.4.1 Changing the Persistent Buffer Size . . . . . . . . . . . . . . . . . 62 3.4.2 Changing the Frequency of Trace Buffer Flushing . . . . . . . . . . . . . . 62 3.4.3 Resolving Tracing Failures . . . . . . . . . . . . . . . . . . . . 63 Block Profiling (Bprof) [4] 65 4.1 CLI Version of Bprof . . . . . . . . . . . . . . . . . . . . . . . 65 4.2 GUI Version of Bprof . . . . . . . . . . . . . . . . . . . . . . . 68 4.2.1 Bprof Window Layout . . . . . . . . . . . . . . . . . . . . . 69 4.2.2 Function List . . . . . . . . . . . . . . . . . . . . . . . . 70 4.2.3 Callers and Callees . . . . . . . . . . . . . . . . . . . . . . 71 4.2.4 Bprof Configuration and Navigation Options . . . . . . . . . . . . . . . 72 4.2.4.1 Panel Actions . . . . . . . . . . . . . . . . . . . . . . 73 Trace Profiling (Tprof) [5] 75 Glossary 77 Procedures Procedure 1. Using Canal . . . . . . . . . . . . . . . . . . . . . . 29 Procedure 2. Compiling and Linking for Tview . . . . . . . . . . . . . . . . 48 Procedure 3. Using Bprof . . . . . . . . . . . . . . . . . . . . . . 68 8 S–2462–20Contents Page Examples Example 1. Canal CLI output . . . . . . . . . . . . . . . . . . . . . 28 Example 2. Bprof CLI output – header . . . . . . . . . . . . . . . . . . 66 Example 3. Bprof CLI output – call tree profile . . . . . . . . . . . . . . . . 67 Example 4. Bprof CLI output – routine profile . . . . . . . . . . . . . . . . 68 Example 5. Bprof CLI output – routine listing and index . . . . . . . . . . . . . 68 Tables Table 1. Cray Apprentice2 Navigation Functions . . . . . . . . . . . . . . . . 19 Table 2. File Menu . . . . . . . . . . . . . . . . . . . . . . . . 20 Table 3. Help Menu . . . . . . . . . . . . . . . . . . . . . . . . 21 Table 4. Canal Window Layout . . . . . . . . . . . . . . . . . . . . 30 Table 5. Canal GUI Statement Annotations . . . . . . . . . . . . . . . . . 33 Table 6. Canal GUI Additional Annotations . . . . . . . . . . . . . . . . . 34 Table 7. Data Columns . . . . . . . . . . . . . . . . . . . . . . . 45 Table 8. Canal GUI Panel Actions . . . . . . . . . . . . . . . . . . . . 46 Table 9. Tview Window Layout . . . . . . . . . . . . . . . . . . . . 49 Table 10. Tview GUI Optional Data . . . . . . . . . . . . . . . . . . . 51 Table 11. Event Details . . . . . . . . . . . . . . . . . . . . . . . 53 Table 12. Trap Details . . . . . . . . . . . . . . . . . . . . . . . 55 Table 13. Tview GUI Configuration and Navigation Options . . . . . . . . . . . . 56 Table 14. Tview GUI Panel Actions . . . . . . . . . . . . . . . . . . . 58 Table 15. Bprof CLI Section Data . . . . . . . . . . . . . . . . . . . . 66 Table 16. Bprof CLI Line Data . . . . . . . . . . . . . . . . . . . . 67 Table 17. Description of Block Profiling Report Window . . . . . . . . . . . . . 70 Table 18. Bprof GUI Report Data . . . . . . . . . . . . . . . . . . . . 71 Table 19. Bprof GUI Caller Detail . . . . . . . . . . . . . . . . . . . . 71 Table 20. Bprof GUI Callee Detail . . . . . . . . . . . . . . . . . . . 72 Table 21. Bprof GUI Configuration and Navigation Options . . . . . . . . . . . . . 72 Table 22. Bprof GUI Panel Actions . . . . . . . . . . . . . . . . . . . 73 Figures Figure 1. Cray Apprentice2 File Selection Dialog . . . . . . . . . . . . . . . 18 Figure 2. Cray Apprentice2 Window . . . . . . . . . . . . . . . . . . . 19 Figure 3. Save Screendump Dialog Window . . . . . . . . . . . . . . . . . 21 Figure 4. About Dialog Window . . . . . . . . . . . . . . . . . . . . 22 Figure 5. Online Help Window . . . . . . . . . . . . . . . . . . . . 23 Figure 6. Comparison Report (Tdiff) . . . . . . . . . . . . . . . . . . . 25 S–2462–20 9Cray XMT™ Performance Tools User’s Guide Page Figure 7. Canal View in Cray Apprentice2 . . . . . . . . . . . . . . . . . 30 Figure 8. Canal Source File Selection . . . . . . . . . . . . . . . . . . . 43 Figure 9. Canal Select Font Dialog . . . . . . . . . . . . . . . . . . . 46 Figure 10. Tview Window Layout . . . . . . . . . . . . . . . . . . . . 49 Figure 11. Tview Event Details . . . . . . . . . . . . . . . . . . . . 53 Figure 12. Tview Trap Details . . . . . . . . . . . . . . . . . . . . . 55 Figure 13. Select Range Dialog . . . . . . . . . . . . . . . . . . . . 57 Figure 14. Full Trace of radixsort Application . . . . . . . . . . . . . . . 59 Figure 15. Partial Trace of radixsort Application . . . . . . . . . . . . . . 60 Figure 16. Segment of Full Trace of radixsort Application . . . . . . . . . . . 61 Figure 17. Block Profiling Report Window . . . . . . . . . . . . . . . . . 69 Figure 18. Tprof Report . . . . . . . . . . . . . . . . . . . . . . 75 10 S–2462–20Introduction [1] This guide is for application programmers and users of Cray XMT systems. It describes the Cray XMT performance analysis tools: Cray Apprentice2, Canal, Tview, Tprof, and Bprof, and the associated file conversion and viewing utilities, pproc and ap2view. Use the information in this guide to examine the optimizations performed by the Cray XMT C and C++ compilers during compilation and the behavior of your program during execution. The information provided in this guide assumes that you are already familiar with Cray XMT C and C++ compilers and are already able to compile, link, and execute your program successfully. For more information about the Cray XMT programming environment and compiler usage, see the Cray XMT Programming Environment User's Guide. For information about debugging programs, see the Cray XMT Debugger Reference Guide. By default, the Cray XMT compilers produce highly optimized executable code. The information in this guide may help you to find additional opportunities to improve program performance. 1.1 The Performance Tool Set The Cray XMT performance analysis tool set consists of seven components. • Cray Apprentice2 is a cross-platform data visualization tool. It provides the graphical framework within which the GUI versions of the other performance analysis tools operate. • Canal (compiler analysis) uses information captured during compilation to produce an annotated source code listing showing the optimizations performed automatically by the compiler. Use the Canal listing to identify and correct code that the compiler cannot optimize. Canal is also available in a command-line interface (CLI) version. This version is documented in the canal(1) man page. • Tview (trace viewer) uses information captured during program execution to produce graphical displays showing performance metrics over time. Use the Tview graphs to identify when a program is running slowly. Tview is also available in a command-line interface (CLI) version. This version is documented in the tview(1) man page. S–2462–20 11Cray XMT™ Performance Tools User’s Guide • Bprof (block profiling) uses information captured during program execution to identify which functions are performing what amounts of work. When used with Tview, Bprof can help you to identify the functions that consume the most time while producing the least work. Bprof is also available in a command-line interface (CLI) version. This version is documented in the bprof(1) man page. • Tprof (trace profiling) is similar to Bprof, but it displays a profile of functions and parallel regions from traces. The Tprof report is generated when you run Apprentice2 in the default mode. There is no command-line interface for Tprof. • Pproc is a post-processing data conversion tool. Use the pproc command to convert the data generated by the compiler and the application into a format that can be displayed within Cray Apprentice2. Alternatively, append the -pproc option to the mtarun command. When used in this way, the pproc conversion begins automatically upon the successful completion of program execution. Note: The data conversion that pproc performs is required only by the GUI versions of the performance tools. If you are using the CLI version of a tool, data conversion is not required. The pproc command is documented in the pproc(1) man page. • Ap2view is a file viewing tool. Use the ap2view to view the data file created by pproc as XML. The ap2view command is documented in the ap2view(1) man page. The Canal tool can be used at any time after your program has been compiled and linked. The Tview and Bprof tools can be used only if your program has been compiled with the correct options and after your program has been executed successfully. The CLI versions of Canal, Tview, and Bprof require no further file conversion after the program has been compiled and executed and the requisite data files generated. The ap2view command, and the GUI versions of Canal, Tview, and Bprof, require that the data files be in .ap2 format before they can be displayed in Cray Apprentice2. To generate .ap2 files, use the pproc command or the mtarun -pproc option. Alternatively, you can use the -a option with the tview command to convert just the tracing information to .ap2 format. 12 S–2462–20Introduction [1] 1.2 Prerequisites Use of the Cray XMT performance analysis tools is closely associated with use of the Cray XMT compilers. You must compile and link your program with the correct modules loaded and the correct compiler options invoked in order to generate an executable that can capture performance analysis data. The following sections discuss compiler, execution environment, and data conversion considerations. 1.2.1 Module and Compiler Considerations The Cray XMT system uses modules in the user environment to support multiple versions of software and to create integrated software packages. Before you can use the performance analysis tools, you must have at least the following modules loaded. • mta-pe (Programming Environment and Performance Tools) • xmt-tools (mtarun) You may have other local or system-specific requirements. For a complete discussion of the modules environment, see the Cray XMT Programming Environment User's Guide. The performance analysis tools work with both the Cray XMT C and C++ compilers. To compile your program for use with the performance analysis tools, use a compiler command similar to one of the following examples. Note: The following examples are all C commands. The C++ commands are identical, except that the cc command is replaced with c++. users/smith> cc mysource.c users/smith> cc -trace mysource.c users/smith> cc -trace_level level mysource.c users/smith> cc -profile mysource.c users/smith> cc -trace -profile mysource.c users/smith> cc -trace_level level -profile mysource.c Compiling your program with no tracing or profiling options specified produces a valid Canal listing, but executing the resulting program does not produce Tview or Bprof data. Use the compiler -trace option to prepare the program for tracing all functions larger than 50 source lines. The -trace_level option is similar, but enables you to specify the minimum size in source lines of the functions to be traced. Additional tracing options are available and are described in the cc(1) man page. Note: Programs compiled with the -trace option must be executed using the mtarun -trace option. Use the compiler -profile option to prepare the program for block profiling. You may combine the trace and profile options. S–2462–20 13Cray XMT™ Performance Tools User’s Guide In all cases, successful compilation and linking produces two files: a.out, which is the actual executable, and a.out.pl, which is a program library file. At this point you may either use Canal to examine the optimizations performed by the compiler, or execute the program and collect tracing and/or profiling data. To produce a compiler analysis (Canal) data file, see Data Conversion (pproc) on page 15. To execute the program and collect tracing (Tview) and profiling (Bprof) data, see Section 1.2.2. For more information about compiler options, see the Cray XMT Programming Environment User's Guide or the cc(1) and c++(1) man pages. 1.2.2 Execution Considerations On Cray XMT systems, all programs are executed using the mtarun command. To capture performance analysis data, you must have the xmt-tools module loaded before you begin program execution. For example, to execute the program a.out, type this command: users/smith> mtarun a.out Upon successful completion of program execution, one or more data files are created, depending on the tracing and profiling options you selected when you compiled and ran the program. For example, if you use these commands to prepare the program for both tracing and profiling, and then execute the program with tracing enabled, the data files trace.out and profile.out are created: users/smith> cc -trace -profile myprogram.c users/smith> mtarun -trace a.out By default, tracing and profiling data files are created in the execution directory. If you prefer, you can set the MTA_TRACE_FILE or MTA_PROFILE_FILE environment variables before execution to specify other locations for the data files. The Cray XMT system allows for concurrent execution of programs. Users must exercise caution when undertaking performance analysis on multiple programs running concurrently. Because tracing and profiling options produce output files, behavior is undefined when multiple executions attempt to write to the same file location, resulting in data corruption. This situation can occur when multiple executions are launched simultaneously in the same directory, or if the environment variable to override the placement of output files is defined and executions are launched simultaneously. To ensure the integrity of the data being collected, execute only one program at a time from the same working directory, or when directing output to the same MTA_TRACE_FILE or MTA_PROFILE_FILE. For more information about executing programs on Cray XMT systems, and in particular for information about improving performance when using output file redirection, see the mtarun(1) man page. 14 S–2462–20Introduction [1] 1.2.3 Data Conversion (pproc) After you collect performance tool data, you must use the pproc utility to convert the data to .ap2 format before you can view it in Cray Apprentice2. Note: If you are using the command-line text-only versions of the performance tools, it is not necessary to use data conversion. Canal data can be converted and viewed at any time after the program is compiled. To convert Canal data, type the following command. users/smith> pproc a.out Tview data can be converted and viewed only after the program completes execution and generates a trace.out file. If Canal data exists, this command also converts and incorporates the Canal data into the resulting .ap2 output file. To convert Tview data, type the following command. users/smith> pproc --mtatf=trace.out a.out Bprof data can be converted and viewed only after the program completes execution and generates a profile.out file. If Canal data exists, this command also converts and incorporates the Canal data into the .ap2 output file. To convert Bprof data, type the following command. users/smith> pproc --mtapf=profile.out a.out To convert all data, type this command: users/smith> pproc --mtatf=trace.out --mtapf=profile.out a.out In all of the above examples, pproc creates the file a.out.ap2, which you can view with Cray Apprentice2. Note: While the pproc utility does not support an option to specify the output file name, you can safely rename a.out.ap2 after it has been created, provided you keep the .ap2 suffix. If you move files while building, installing, or executing your program, pproc may be unable to find some information. In this case, use the pproc --spath option to specify the directory containing the desired source files, or use the pproc --prompt option to enter an interactive mode in which you are prompted to enter paths for missing files. When the trace file is large, pproc can take a long time to run, up to an hour in some cases. Using the --verbose when running pproc from the command line will display additional information about which stage of processing pproc is in, the percentage of progress, and the number of descriptors and events being processed. For more information about the pproc utility, see the pproc(1) man page, or type pproc --help at the command line. S–2462–20 15Cray XMT™ Performance Tools User’s Guide 1.3 Using Cray Apprentice2 Cray Apprentice2 is an interactive X Window System GUI tool for visualizing and manipulating performance analysis data. Cray Apprentice2 can display a wide variety of reports and graphs, depending on the type of program being analyzed, the computer system on which the program was run, the software tools used to capture data, and the particular performance analysis experiments that were conducted during program execution. Cray Apprentice2 is a platform-independent, post-processing data exploration tool. You do not set up and run performance analysis experiments from within Cray Apprentice2. Rather, you use the Cray Apprentice2 GUI after a performance analysis to examine results. To use Cray Apprentice2 to view Cray XMT Canal, Tview, and Bprof data, you must first use the pproc utility to convert the output files to .ap2 format, as described in Data Conversion (pproc) on page 15. After you do so, you are ready to launch Cray Apprentice2 and explore the data. Note: Alternatively, CLI versions of Canal, Tview, and Bprof area also available. If you choose to use the CLI version of a tool, data conversion is not necessary. 1.3.1 Modules Cray Apprentice2 is included in the mta-pe module. If this module is not part of your default environment, you must load it before you can use Cray Apprentice2. users/smith> module load mta-pe 1.3.2 Launching the Application Use the app2 command to launch the Cray Apprentice2 application. users/smith> app2 & Alternatively, you can specify the data file to load when you launch Cray Apprentice2. users/smith> app2 a.out.ap2 & You can also specify the tool to use first when you launch Cray Apprentice2. For example, to begin with the Canal report, type this command: users/smith> app2 a.out.ap2 --tool=canal & 16 S–2462–20Introduction [1] Cray Apprentice2 supports other options related to loading data files. For more information, see the app2(1) man page. Note: Cray Apprentice2 requires that the X Window System forwarding is enabled in order to start the graphical display. If the app2 command returns an X Window System error message, forwarding may be disabled or set incorrectly. If this happens, log into the Cray XMT login node, using the ssh -X option and try again. If this does not correct the problem, contact your system administrator for help in resolving your X Window System forwarding issues. 1.3.3 Loading Data Files After you launch Apprentice2, the report that the tool displays differs depending upon how you compiled your application. If you: • Compile with tracing and profiling—the Tview report displays. • Compile with profiling only—the Bprof report displays. • Compile with tracing only—the Tview report displays. • Compile with no report options—the Canal report displays. If you did not specify a data file on the command line, you are prompted to select a data file to display. S–2462–20 17Cray XMT™ Performance Tools User’s Guide Figure 1. Cray Apprentice2 File Selection Dialog You can use Cray Apprentice2 to simultaneously load multiple data files. For example, you may want to load multiple files in order to compare the results side-by-side. For each data file loaded, Cray Apprentice2 displays the file name and one or more icons representing the types of data included in that file. To view a report, click on an icon. Each icon spawns a separate window containing the selected report. The appearance and behavior of the Canal, Tview, and Bprof reports are specific to each tool and are discussed in the following chapters. 1.3.4 Basic Navigation Cray Apprentice2 displays a wide variety of reports, depending on the program being studied, the type of experiment performed, and the data captured during program execution. While the number and content of reports varies, all reports share the general navigation features described in Table 1. 18 S–2462–20Introduction [1] Figure 2. Cray Apprentice2 Window 1 2 3 4 5 6 Table 1. Cray Apprentice2 Navigation Functions Callout Description 1 The File and Help menus contain the following items described in Table 2 and Table 3, respectively: Open, Comparison, Screendump, Quit, About, and Main Help. 2 The Loaded File notebook is a tabbed notebook of all the files loaded into Apprentice2. Click a tab to bring a file to the foreground. Right-click a tab for additional report-specific options. 3 The Available Report toolbar shows the reports that can be displayed for the data currently selected. Hover the cursor over an individual report icon to display the report name. To view a report, click the icon. S–2462–20 19Cray XMT™ Performance Tools User’s Guide Callout Description 4 The Open Report notebook shows the reports that have been displayed thus far for the data file currently selected. Click a tab to bring a report to the foreground. Right-click a tab for additional report-specific options. 5 The main display varies depending on the report selected and can be resized to suit your needs. However, most reports feature pop-up tips that appear when you allow the cursor to hover over an item, and active data elements that display additional information in response to left or right clicks. 6 The Status and Progress bar shows the progress of loading or plotting data. Table 2. File Menu Menu Option Description Open Shows a dialog for selecting an .ap2 file that will be loaded and added to the file notebook described below. Comparison Compares the loaded files by adding a new Comparison tab to the file notebook and showing the Tdiff report for these files. This comparison can also be done by specifying --compare on the command line when running Apprentice2. There can only be one comparison done at a time. If a user wants to add new files to this comparison, they can close the Comparison tab, load the file and re-select this menu item. The Tdiff report will be described below. Screendump Captures the current screen to an image. A dialog will be shown to choose where to save the image. Quit Exits the application. 20 S–2462–20Introduction [1] Figure 3. Save Screendump Dialog Window Table 3. Help Menu Menu Option Description About Shows an about dialog. Main Help Loads the help documentation into a separate tab. The contents of this file are controlled by the environment variable APP2_HELPFILE, which should be set properly when the mta-pe module is loaded. S–2462–20 21Cray XMT™ Performance Tools User’s Guide Figure 4. About Dialog Window 22 S–2462–20Introduction [1] Figure 5. Online Help Window Loaded file notebook This area is a tabbed notebook of all the files loaded into Apprentice2. Generally, these are all the .ap2 files loaded, but can also include a multifile comparison or help documentation, as described above. Available report toolbar If the file selected in the loaded file notebook is an .ap2 file, this toolbar shows all the available reports for this file. In the image shown, going from left to right, the icons shown are for the Canal, Tview, and Bprof reports. When a comparison is done, only the Tdiff icon will be shown as it is the only report. S–2462–20 23Cray XMT™ Performance Tools User’s Guide Open report notebook When a report button is clicked in the above toolbar, the report is loaded into this notebook as a separate tab. If the report is already open, clicking on the toolbar button makes that report the frontmost tab. All report tabs feature right-click menus, which display both common options and additional report-specific options. For more information about specific options see Canal Configuration and Navigation Options on page 43, Tview Configuration and Navigation Options on page 56, and Bprof Configuration and Navigation Options on page 72. Status and progress bar The main purpose of this area is to show the progress when large .ap2 files are loading or when the plot in the Tview report is being recalculated. 1.3.5 Comparing Files Selecting Comparison from the main menu creates a new Comparison file tab with a Tdiff report. 24 S–2462–20Introduction [1] Figure 6. Comparison Report (Tdiff) The Tdiff report shows a single metric for all the loaded files on the same plot. CpuUtil is shown by default. The Tdiff report has the same menu as the Tview report minus the Show Event Summaries and Show Trap Summaries options, which are meaningless in the context of multiple data files. 1.3.6 Exiting from Cray Apprentice2 To exit from an individual report, close the report window. To close an individual data file, right-click on the file name in the Cray Apprentice2 base window and then select Close from the pop-up window. To exit from Cray Apprentice2 and close all report windows and data files, open the base window File menu and select Quit. You are asked to confirm that you want to exit from Cray Apprentice2. S–2462–20 25Cray XMT™ Performance Tools User’s Guide 26 S–2462–20Compiler Analysis (Canal) [2] The Canal report details the optimizations performed by the compiler. Canal reads the source file, along with information extracted from the object file or program library, and from this creates an annotated source code listing. This listing shows information and remarks about the implicit parallelism recognized and exploited by the compiler, as well as other loops that the compiler chose to execute serially because they either lacked parallelism or could not be exploited profitably. The Canal report is available at any time after the program has been compiled. You do not need to execute the program in order to produce Canal report data. Instead, depending on the compiler options you use, the remarks are saved in either a fat object (.o) or program library (.pl) file. The Canal report is available in two forms: a text-only command-line interface (CLI) version, and a Cray Apprentice2 (GUI) version. 2.1 CLI Version of Canal To use the CLI version of Canal, type the canal command, followed by the name of the source file. users/smith> canal myprogram.c If there is ambiguity about the source, you are prompted to use the -pl option to specify the program library. For example: users/smith> canal -pl a.out.pl myprogram.c The variable myprogram.c is the C source file for which you are creating a program library. Canal prints an annotated source code listing to stdout. This source listing is divided into two sections: the first reproduces the input source with some additional statement-level annotations at the beginning of each line, while the second provides detailed remarks about the loops in the program and how they were optimized. A column of vertical bar characters (|) separates the statement annotations from the source statements, as shown in the following example. S–2462–20 27Cray XMT™ Performance Tools User’s Guide Example 1. Canal CLI output ******************************************************************************** * Cray Compilation Report * Source File: radix.c * Program Library: radix.pl * Module: radix.o ******************************************************************************** | unsigned* radix_sort(unsigned* array, unsigned size) { ** multiprocessor parallelization enabled (-par) ** expected to be called in a serial context ** fused mul-add allowed ** debug level: off | for (byte = 0; byte < sizeof(unsigned); ++byte) { | for (i = 0; i < buckets; ++i) { 2 Ss | cnt[i] = 0; | } | | for (i = 0; i < size; ++i) { 5 SP:$ | cnt[MTA_BIT_PACK(~mask, src[i])]++; | } ******************************************************************************** * Additional Loop Details ******************************************************************************** Loop 1 in radix_sort at line 28 Expecting 8 iterations Loop 2 in radix_sort at line 21 in loop 1 Expecting 256 iterations Loop summary: 0 loads, 1 stores, 0 floating point operations 1 instructions, needs 50 streams for full utilization pipelined Parallel region 3 in radix_sort in loop 1 Multiple processor implementation Requesting at least 45 streams Loop 4 in radix_sort in region 3 In parallel phase 1 Dynamically scheduled, variable chunks, min size = 7 Compiler generated Loop 5 in radix_sort at line 25 in loop 4 Loop summary: 1 loads, 1 stores, 0 floating point operations 2 instructions, needs 45 streams for full utilization pipelined Annotated statements consist of a number followed by a sequence of characters. The number is an identifier assigned to the innermost loop around a statement and serves as an index into the detailed loop information in the second section of the report. The absence of a number indicates that the compiler had no remark about the implementation. 28 S–2462–20Compiler Analysis (Canal) [2] The sequence of characters describes how the compiler restructured the loop. In nested loops, the left character corresponds to the outermost loop, the next character corresponds to the next loop within the nest, and so on. The meanings of the various statement annotations and additional loop details are described in GUI Version of Canal on page 29. For more information about canal command syntax, see the canal(1) man page. You can also type canal without a target file name to generate a usage summary statement. 2.2 GUI Version of Canal Procedure 1. Using Canal 1. Compile and link your program. users/smith> cc mysource.c 2. Use the pproc utility to generate a .ap2-format data file from the compiled object code and program library. users/smith> pproc a.out 3. Open the resulting .ap2-format data file in Cray Apprentice2. users/smith> app2 --tool=canal a.out.ap2 & The Canal report window displays. 2.2.1 Canal Window Layout The Canal report window is divided into three main sections. S–2462–20 29Cray XMT™ Performance Tools User’s Guide Figure 7. Canal View in Cray Apprentice2 1 2 3 4 5 Table 4. Canal Window Layout Callout Description 1 The Navigation toolbar shows which source code file is currently being viewed, along with the module and library in which that file appears. As inlined functions may be parallelized or optimized differently depending on where they are used, this location line also shows the calling context as a pair of numbers. When an inlined function is present, double clicking on it in the source listing will cause the source view to jump to that source location. The Back and Forward buttons are used to go back and forth to the original and jumped to locations. For more information, see Statement-level Annotations on page 32. This toolbar is hidden by default. 2 The Source code pane shows annotated source code. Selecting a line will cause the annotation detail area to be updated with any further notes regarding the selected line. If the loop browser area below is expanded, it will be updated to show the current loop selected. 30 S–2462–20Compiler Analysis (Canal) [2] Callout Description Double clicking on an inlined function will jump to the source location and annotations for that function as described above. The columns shown in this table are: Line The source line number is shown every five lines. Notes Compiler shorthand for the optimizations done on the source (hovering the mouse over a particular set of notes will show a tooltip defining all the characters). Code The source code, which will appear in blue if there is an inlined function at that location and red if there are traps associated with a memory allocation at that line. Issues The number of instructions issued at this source line, available only if profiling was enabled. MemRefs The number of memory references issued at this source line, available only if profiling was enabled. Counts The number of times this source code line was tripped during execution, available only if profiling was enabled. Traps The number of traps recorded at this location, available only if tracing was enabled (hovering the mouse over this value will give a breakdown of the kinds of traps that contributed to this total). 3 The Loop Browser pane is collapsed by default and shows the hierarchy of parallel regions and loops detected and parallelized by the compiler. Selecting a loop will cause the source listing to shift to the line for that loop. This line is often not the same line where there are notes, as those are usually assigned to the body of the loop and not the entry point. 4 The Annotation Details pane is updated with further details about compiler optimizations done for a particular source line when that line is selected in the source listing. 5 The Search Toolbar searches in the source based on an arbitrary string, a line number, or a loop number. When a string is entered, the Next and Previous buttons will jump to the next or previous match provided there are more than one. The search is case insensitive unless that checkbox is marked. This toolbar is hidden by default. S–2462–20 31Cray XMT™ Performance Tools User’s Guide 2.2.2 Browse Loops The Browse Loops window displays a hierarchical tree that lists all functions or procedures in the file that contain loops or parallel regions. To expand an entry and display an indented list of the loops contained within the parent loop or parallel region, click the arrow icon. To contract an indented list, click the arrow icon again. To jump to the area of interest in the source code listing, double-click on the item in the Browse Loops window. In the canal report, there are lines of code displayed in red and blue. Blue indicates that this is an inlined function. If you double-click on the blue text, it jumps directly to the inlined function. Red text indicates that there are traps associated with a memory allocation at that line. 2.2.3 Statement-level Annotations Statement-level annotations are printed in the Notes column, specific to their context, and consist of alpha and numeric codes identifying the type of optimization performed and the innermost loop or parallel region within which the annotated line of code occurs. The leading number in an annotation identifies the loop or parallel region; this number is assigned by Canal and has no correspondence to line numbers or other identifiers in the source code. Note: Functions that are always inlined will not be compiled, thus the source of the function will not show any annotations. Instead, the annotations will appear at the location where the function was inlined. Use the #pragma mta no inline to prevent inlining of functions and force their compilation. This will cause the annotations to appear in the function source. Be aware, however, that this will affect performance. Also, the annotations may not necessarily match what actually occurs when the function is inlined, as the context into which it is inline can affect how and whether loops are parallelized. If a loop is restructured by the compiler, the loop identifier is followed by one alpha character for each source loop within which the statement was nested before restructuring. If, in restructuring the code, the compiler has reordered the loop nest, the alpha character is followed by a numeric code indicating the loop's new position in the loop nest. To view a tool-tip showing more information about an annotation, hover the mouse pointer over the annotation code. To see the full annotation or comment associated with an optimization, click on the line in the source code display. 32 S–2462–20Compiler Analysis (Canal) [2] The Back and Forward buttons are used to navigate inlined code. An inlined function may be optimized differently, depending on where it is inlined, and it appears in the Canal listing as blue text, which functions as a link. Double-click on blue text to jump to the source file for the inlined function. After you have done so, the Back and Forward buttons become active. Use the Back button to return to the call site, or, when back at the call site, use the Forward button to return to the inlined function source. Table 5. Canal GUI Statement Annotations Code Description P Indicates that the loop is executed in parallel. The exact scheduling mechanism used to implement this is described in the statement remarks. p Indicates that the loop is executed in parallel because of an assert parallel directive. I Indicates that the function has been inlined. D Indicates that the loop is executed concurrently due to an assert parallel directive, even though the marked statement appears to contain a dependency that would otherwise prevent parallel execution. L Indicates that the loop is a linear recurrence or reduction rewritten to be explicitly parallel using a cyclic-reduction technique. - Indicates that the loop is executed serially due to a compiler directive or flag. S Indicates that the loop is executed serially and that the marked statement inhibits parallelism. s Indicates that the loop is executed serially because the number of iterations in the loop is too small to warrant parallelization. X Indicates that the loop is executed serially because it is not structurally suitable (i.e., not an inductive loop) U Indicates that the loop is unrolled. ? Indicates an error condition. If this occurs, please provide a test case demonstrating this behavior to Cray support. Basic loop annotations can be followed by a colon (:) and then an additional character providing more information about the type of optimization performed. The additional character indicates a place where the compiler has performed a more complex optimization and may therefore have introduced more overhead. S–2462–20 33Cray XMT™ Performance Tools User’s Guide Table 6. Canal GUI Additional Annotations Code Description t A triangular loop collapse was performed. Triangular loops have the following general form: for (i = 0; i < n; ++i) { for (j = 0; j < a*i + b; ++j) { A[i] = B[i][j]; } Variables for a and b are integer expressions invariant with respect to the i loop. This is collapsed to a single suitable loop where the individual i and j values for an iteration are recovered directly from the resulting loop index. The compiler generally uses block scheduling on this loop to reduce the cost of this computation. m A general loop collapse was performed. A general loop nest has the following form. for (i = 0; i < n; ++i) { for (j = 0; j < f(i); ++j) { ... } } Where f(i) is any expression involving the outer loop control variable and values which are invariant with respect to that loop. This loop is collapsed by first creating a temporary array t of the following form: t[0] = 0; for (i = 0; i < n; ++i) { t[i + 1] = t[i] + f(i); } Then the original loop nest is replaced by a single loop of the following form: for (k = 0; k < t[n]; ++k) { ... } Where the original i and j values are recovered by doing a binary search on the array t. The compiler generally uses block scheduling to reduce the cost of the binary search. If n is small and f(i) is large, a general loop collapse may not be the best solution. Instead, consider using a loop serial directive on the inner loop to improve performance in this case. 34 S–2462–20Compiler Analysis (Canal) [2] Code Description w The loop nest was wavefronted in one or more dimensions. A loop nest is wavefronted by adding synchronization to a sequentially executed inner loop, thereby allowing the execution of the outer loops in the nest to be staged. Staging the outer loops allows the outer loops to be executed in parallel by guaranteeing that no iteration of an inner loop in one thread will begin until all iterations on which it depends have completed, even if those iterations are being performed by other threads. For example, consider the following loop: for (i = 1; i < n; ++i) { for (j = 1; j < m; ++j) { a[i][j] = a[i - 1][j] + a[i][j - 1]; } } In this example, the outer loop is parallelized while execution of the inner loop remains serial. To do this, the compiler transforms the code so that it is equivalent to the following loop: forall (i = 1; i < n; ++i) { for (j = 1; j < m; ++j) { if (i > 1) wait(i - 1, j); a[i][j] = a[i - 1][j] + a[i][j - 1]; if (i < n) signal(i, j); } } Where forall indicates a loop done in parallel and wait(i,j) delays execution until a corresponding signal(i,j) operation is performed. When n is small and m is large, wavefronting may not be the best solution. Instead, consider using a loop serial directive on the outer loop to improve performance by treating the loop nest as a series of linear recurrences. S–2462–20 35Cray XMT™ Performance Tools User’s Guide Code Description e A scalar variable was expanded into a temporary variable to permit loop distribution. For example, consider the following loop: for (i = 0; i < n; ++i) { t = sqrt(a[i + 1]); a[i] = t + ... } In this example, the variable t might be expanded into a temporary variable, so that the anti-dependence is preserved by distribution, as shown in the following example: for (i = 0; i < n; ++i) { t[i] = sqrt(a[i + 1]); a[i] = t[i] + ... } $ An associative operation was converted to an atomic form to allow parallelization. For example, consider the following loop: for (i = 0; i < m; ++i) { x[idx[i]] = x[idx[i]] + f(i); } In this example, the fetch, add, and store of the array element x(idx(i)) is turned into an atomic operation, which permits the loop to be parallelized by guaranteeing that no other thread may access the same array element until this operation completes. Atomic updates of floating point data may produce small differences in results. If these differences are significant to computation, use the no recurrence directive to prevent this transformation. 2.2.4 Statement Remarks In addition to statement-level annotations, statements may also have separate remarks. The presence of a remark is indicated by an asterisk (*) character at the end of the annotation. 36 S–2462–20Compiler Analysis (Canal) [2] The Canal listing may include the following remarks: Function with unknown side effects: function_name The behavior of function_name is unknown to the compiler. This applies only to statements inside loops that are candidates for parallelization. Indirect function inhibits parallelism There is an indirect function call through a pointer variable, and the compiler has no knowledge of the function's behavior. This applies only to statements inside loops that are candidates for parallelization. Loop exit A secondary exit from the loop inhibits parallelization. Loop rerolling applied A loop rerolling transformation was applied to the loop. For example, consider the following loop: for (i = 0; i < 300; i += 3) { a[i] = b[i]; a[i + 1] = b[i + 1]; a[i + 2] = b[i + 2]; } In loop rerolling, the above loop is replaced with the following loop: for (i = 0; i < 300; ++i) { a[i] = b[i]; } Program with infinite loop The loop has no obvious exit and cannot terminate normally, This is not necessarily an error, but such a loop cannot ordinarily be parallelized. S–2462–20 37Cray XMT™ Performance Tools User’s Guide Reduction moved out of number loops This remark identifies a statement that performs a data reduction inside a loop involving a single memory location. For example: for (i = 0; i < m; ++i) { a = a + x[i]; } This loop performs a sum reduction of x(1:m) into the location a. The compiler tries to change this loop so that each stream computes a partial sum, and these partial sums are combined into a complete sum after the loop finishes. The value of number is positive and indicates the number of loops that the combining stage of the reduction was moved out of. Unreachable The statement in the code can never be executed and thus was removed by the compiler. Unused or forward substituted The statement does not affect the behavior of the program. This remark is also used to identify definitions of variables when the defining expression is substituted for the variable throughout the program. This is done to eliminate unnecessary constraints on loop restructuring. 2.2.5 Loop-level Annotations Annotations are generated for each loop in the optimized program. Annotations are also generated for parallel regions created by the compiler. Such parallel regions may contain one or more loops, which may in turn be nested within another loop. Each loop or parallel region begins with a header line that provides the unique identifying number assigned to this loop or region, the name of the function in which this loop occurs, and optionally the unique identifying number assigned to the loop or region within which this loop or parallel region is nested. These unique identifying numbers correspond to those used in the statement-level annotations, although only the number corresponding to the innermost loop or region is used in statement-level annotations. Each parallel region annotation can include information on the technique used to implement the region and the minimum number of streams per processor requested. 38 S–2462–20Compiler Analysis (Canal) [2] The Canal listing may include the following loop-level annotations: block scheduled Block scheduling was used to implement a parallel loop. Compiler generated Loop was created by the compiler as part of the optimization process. Dependencies carried by: variable Loop parallelism was inhibited by assumed inter-iteration interactions involving variable. dynamically scheduled Dynamic scheduling was used to implement a parallel loop. Iterations of the loop are assigned to individual threads one iteration at a time. dynamically scheduled, chunk size = n A dynamically scheduled loop where threads schedule n iterations at a time. dynamically scheduled, variable chunks, min size = size Dynamic scheduling was used to implement a parallel loop. Iterations of the loop are assigned to individual threads in blocks of variable numbers of iterations, beginning with large blocks and decreasing to blocks of size iterations. Expecting size iterations The compiler assumed this loop executes for size number of iterations. This assumption affects the order of loops in the final loop nest and the choice of implementation techniques. Expecting size iterations based on array bounds The compiler assumed that this loop executes for size number of iterations. The number of iterations was derived by examining the declared bounds of arrays referenced inside the loop. Implemented with futures The parallel loop was implemented using threads created by the runtime using future statements. S–2462–20 39Cray XMT™ Performance Tools User’s Guide in parallel phase number The loop was in phase number. Phases are numbered starting with 1. There are no barriers between loops in the same phase, while there are barriers between different phases. The number is also used to annotate trace information available in Tview. Initial array value cache for recurrence A loop was created by the compiler to cache certain array values. These values are overwritten by later stages of a recurrence. interleave scheduled Interleave scheduling was used to implement a parallel loop. Loop moved from level n to level m The order of loops in a nest has been altered by moving the current loop from source level n to destination level m. The outermost loop in a nest is level 1. Loop summary: details The details indicate the number of memory operations, floating-point operations, and instructions executed per iteration of the loop. Loop not pipelined: reason An attempt was made to use the special loop scheduler for this loop, but the attempt failed for the listed reason and the standard instruction scheduler was used instead. Valid reasons include: Debugging level too high Loop scheduling is not applied for debugging levels -g1 and -g2. Loop too large The loop exceeds the size threshold above which loop scheduling is not attempted. Not structurally OK There are structural requirements such as control flow or function calls that inhibit loop scheduling. Too many condition codes Condition codes are used to implement test operations for comparisons. However, a large number of condition codes inhibits loop scheduling. 40 S–2462–20Compiler Analysis (Canal) [2] Too many pseudo registers Pseudo registers are internal names for values. Using a large number of pseudo registers can exhaust the available supply and inhibit loop scheduling. Too many registers The scheduler was unable to find an acceptable schedule that fit in the available hardware registers. Loop unrolled n times The loop was unrolled n times, so that there are n+1 copies of the original loop body. Unrolling is typically applied to an outer loop when the inner loops are fused together. This transformation is done only when the compiler expects to reduce the total number of memory operations for the loop nest. n instructions added to satisfy recurrence This indicates that there is a cycle of interactions between statements in this loop, and that the compiler was unable to schedule the loop in the minimum number of instructions predicted from the simple set of operations. This recurrence may include false dependencies between memory operations, which can be eliminated by using a no dependence directive. n instructions added to reduce register requirements The compiler was unable to pack the operations of this loop into the minimum number of instructions. Needs number streams for full utilization Indicates that the compiler assumes this loop will achieve full processor utilization if the loop body is executed concurrently on number streams per processor. This annotation may also appear on loops that are not parallelized. In this case it indicates that the compiler assumes full utilization would be achieved if the serial loop was executed in a parallel context (e.g., inside another parallel loop or in a function called from a parallel loop) with at least number streams per processor. Odd iterations for unrolled loop When a loop is unrolled and the amount of the unrolling is not known to be an exact divisor of the number of iterations of the loop, a copy of the original loop is created to handle the small number of extra iterations. S–2462–20 41Cray XMT™ Performance Tools User’s Guide parallel region initialization A loop was added to initialize the full-empty bits. When a single-processor parallel region that includes a recurrence or reduction is implemented, it needs a block of memory with the full-empty bits set to empty. pipelined A specialized instruction-scheduling technique was applied to the loop to increase memory concurrency and reduce loop overhead. private variable: var For the variable var, a private copy was created for each stream working on the loop. These variables may have been asserted local or proven local by the compiler. Recurrence control loop, chunk size = n Implementation of a recurrence may require caching of values from one stage to the next. In this case, each stream performs the loop in fixed-size chunks, and there is an outer control loop that implements the entire recurrence loop in batches of iterations. The number of iterations per chunk is n; thus the number of iterations per batch is n times the number of streams. Recurrence control loop, non-iterating The outer control loop for a recurrence performs all iterations as a single batch and will not iterate. Scheduled to minimize serial time The non-loop scheduled serial loop indicated was implemented so as to minimize time rather than instruction issues. single processor implementation The parallel loop or region indicated was implemented to use only a single physical processor. Stage n of recurrence This indicates a particular stage of a linear recurrence computation. Stage n of recurrence communication This indicates a communication loop that follows a particular stage of a recurrence. 42 S–2462–20Compiler Analysis (Canal) [2] Using max concurrency c Indicates that the parallel region will use a maximum concurrency of c because the user specified the max concurrency c pragma on all parallel loops in this region. For single processor parallel regions this means the parallel loops will use at most c streams. For multiprocessor parallel regions this means at most max(1,c/num_streams) processors will be used, where num_streams is the number of streams the compiler requests for each processor. For loop future parallel regions this means that at most c futures will be created. Using max n processors Indicates that the parallel region will use at most n processors because the user specified the max n processors pragma on all parallel loops in this region. Note: See the note in Statement-level Annotations on page 32 for information about annotations of inlined functions. 2.2.6 Canal Configuration and Navigation Options The Canal report provides a number of options for configuring the display and finding information. All of these options are accessed by right-clicking on the Canal tab in the upper-left corner of the window. When you do so, a pop-up menu displays, offering the following options. 2.2.6.1 Select Source After you choose Select Source from the pop-up menu, the Select Source window displays. Figure 8. Canal Source File Selection S–2462–20 43Cray XMT™ Performance Tools User’s Guide Use this window to navigate to and select the source file you want to examine in the Canal report. To select a file, highlight it in this window and click the OK button. After you select a file, it is displayed in the Canal window. 2.2.6.2 Toolbars The Canal window has two optional toolbars: Navigation and Search. By default, the Navigation toolbar and the Search toolbar are hidden. To show or hide a toolbar, select Toolbars from the pop-up menu, and select the toolbar you want to show or hide. The Search functions are hidden by default. To show the Search function, select Search from the Toolbars menu. After you do so, the Search toolbar displays at the bottom of the window. To search for a text string, enter the text in the Find box and press Enter. To search for the next or previous iteration of the same text, click the Next or Previous buttons. To match the text string exactly, check the Match Case box. The Search toolbar is used to find specific text, line numbers, or loops in the source code. To search for a specific line of code by line number, enter the line number in the Line box and press Enter. To search for a loop by its unique sequence number, type the number in the Loop box and press Enter. There is no "clear" function. Only the search mode you are using is relevant; any text or values in the other entry fields are ignored. The Navigation toolbar lists the files used to generate the Canal report and contains the Loops button. This toolbar is not displayed by default and discussed in Canal Window Layout on page 29. 2.2.6.3 Show/Hide Data By default, the Canal report displays all information currently available. To reduce the amount of information displayed, select Columns from the pop-up menu, and then select the data column you want to show or hide. The columns and their contents are described in Table 7. You cannot choose to hide the source listing. 44 S–2462–20Compiler Analysis (Canal) [2] Table 7. Data Columns Column Heading Description Line The source code line number in increments of five. Loop The loop number and annotation codes. Hovering the mouse pointer over the Loop column causes a pop-up tool tip to display the meaning of the annotation code. Issues The number of machine instructions issued. Issues and Counts data are available only if profiling was done. MemRefs The number of memory references. Counts The number of times this line of code was executed. Notes Compiler shorthand for the optimizations done on the source. Hovering the mouse over a particular set of notes will show a tooltip defining all the characters. Traps The number of traps recorded at this line. Traps data is available only if tracing was done. Traps data is important for detecting hotspots in code. Hovering the mouse pointer over a value in the Traps column causes a pop-up tool tip to display the kinds and number of traps that contributed to this number. A high number of LATENCY_LIMIT traps may indicate a hotspot. 2.2.6.4 Change Font To change the face, size, or style of the Canal window display font, select Change Font from the pop-up menu. The Select Font window displays. S–2462–20 45Cray XMT™ Performance Tools User’s Guide Figure 9. Canal Select Font Dialog Use the options on this window to select the font face, style, and size used in the Canal window. To accept your changes, click the OK button. Note: This option affects only the Canal window. It does not affect the Tview or Bprof windows. 2.2.6.5 Panel Actions To manipulate the Canal report window, select Panel Actions from the pop-up window. Table 8. Canal GUI Panel Actions Action Description Detach Panel Displays the report in a new window. The original window remains blank. Remove Panel Closes the report window. Freeze Panel Freezes the report as shown. Subsequent changes to other parameters do not change the appearance of the frozen report. 46 S–2462–20Trace View (Tview) [3] The Tview report uses information captured during program execution to produce a whole-program view of performance metrics over time. When used with Bprof, the Tview report can help you to identify the functions that consume the most of amount of execution time while producing the least amount of work. The Tview report is available in two forms: a text-only command-line interface (CLI) version, and a Cray Apprentice2 (GUI) version. The compiler -trace option enables tracing for all functions larger than 50 source lines. The -trace_level option is similar, but allows you to specify the minimum size in source lines of the functions to be traced. Likewise, the -tracef option allows you to specify a comma-delimited list of function names to be traced. Additional tracing options are available and are described in the cc(1) and c++(1) man pages. When a function is traced, calls to the event-tracing library are placed at the function's entry and exit points. In addition, any compiler-generated parallelism within the function has trace-library calls placed at its fork, join, and barrier portions. Inlined functions are never traced, regardless of the tracing level. Because the trace file can grow very large, only the first 512 occurrences of each individual traced event are recorded in the trace file. This limit can be increased or decreased by calls to the runtime function mta_set_trace_limit, which is described in the mta_set_trace_limit(3) man page. 3.1 CLI Version of Tview The CLI version of Tview displays the trace data in one of three formats: XML, Apprentice2, or compressed (gzip). By default the trace data is displayed as XML to stdout. To use the CLI version of Tview, type the tview command. Given that trace files are typically fairly large, it is generally advisable to pipe the results to an output file or through the more command. users/smith> tview | more S–2462–20 47Cray XMT™ Performance Tools User’s Guide The contents of the trace.out file are displayed as XML code. Alternatively, you can create a compressed XML file by using the -z option. users/smith> tview -z -o filename.gz Finally, to create a file in Apprentice2 format, which you can view with the GUI version of Tview, use the -a option. users/smith> tview -a -o filename.ap2 Note: An .ap2 file generated using the CLI version of Tview will contain only the Tview report. To generate .ap2 files that contain additional reports use pproc, as described in (Data Conversion (pproc) on page 15). For more information about the tview command syntax, see the tview(1) man page. You can also type tview -h at the command line to generate a usage summary. 3.2 GUI Version of Tview To use the GUI version of Tview, do the following: Procedure 2. Compiling and Linking for Tview 1. Compile and link your program using the trace option. users/smith> cc -trace mysource.c 2. Execute your program using the mtarun -trace option. users/smith> mtarun -trace a.out Upon successful completion of program execution, mtarun generates a data file named trace.out is generated. 3. Use the pproc utility with the --mtatf option to generate an .ap2-format data file from the binary executable and the trace data. users/smith> pproc --mtatf=trace.out a.out 4. Open the resulting .ap2-format data file in Cray Apprentice2, or use the command-line interface. users/smith> app2 a.out.ap2 & The Tview report window displays. 3.2.1 Using Tview The Tview report window is divided into three main sections. 48 S–2462–20Trace View (Tview) [3] Figure 10. Tview Window Layout 1 2 Table 9. Tview Window Layout Callout Description 1 Summary line of trace events and traps recorded at run time. 2 Performance metric plot of metric derived during run time, plotted against execution time. The Summary line describes how many trace events were recorded, how many were lost due to the throttling of the tracing system in the runtime, the number of CPUs and the clock speed, and the number of data_blocked and float_extension traps as recorded by the trap counters in the runtime. S–2462–20 49Cray XMT™ Performance Tools User’s Guide The Performance Metric plot displays various performance metrics derived from the hardware counters are plotted against the execution time. By default, Tview displays a graph showing processor utilization CpuUtil against memory concurrency MemConcur. Each of these metrics has a different unit so they are shown on two separate y-axes. As can be seen in the screenshot, the labels for each axis shows the units, and the scales differ accordingly. A horizontal dashed line at or near the top of the plot shows the system limit for any metrics that have a maximum value or the injection limit for processing references, beyond which a bottleneck will occur. These limits are defined in Table 10. Use the ShowMetric menu to hide or select additional metrics. If a new metric selected has the same units as one of the metrics currently shown it is added; if not you will need to unselect one or more of the metrics shown to free up one of the y-axes. The performance metric plot area is interactive. When the cursor is a crosshair (+), you can select an area of the plot by clicking on the plot with the mouse, holding down the button, and moving the mouse. When you release the button the plot will zoom into this region. Repeat this action multiple times to zoom into an area of interest. When you right-click on the plot the view will return to the previous zoom level. The legend in the upper right corner shows which metrics are currently shown. Each of these titles has a small box with each line's color. When the cursor is an arrow, clicking on one of these boxes brings up a dialog window allowing you to change the color of the line. 3.2.2 Traced Data The Tview graph presents information from the trace file in a graphical format to simplify the analysis of performance data. The x-axis of the graph shows the time in seconds relative to the start of program execution, and the y-axis shows various performance metrics derived from the hardware counters. The availability of a second y-axis allows Tview to show metrics with two different scales. When the event or trap detail pane is first opened, the first event or trap in the detail pane will be selected. A selection line appears on the plot, corresponding to the time when the selected trap or event was recorded. This selection line has a small handle in the middle. When the mouse pointer is over the handle and the cursor becomes an arrow, clicking and holding down the mouse button will allow users to drag this selection line to a previous or subsequent event. Because there is not an event for every possible position of the selection line, it is possible to release the mouse button somewhere between two events. In this case, the line will "snap" to the closest event. The Event and Trap Detail pane is not visible by default, but will appear if Show Event Summaries or Show Trap Summaries is selected from the Tview context menu. When both are shown, there are tabs at the top of the region allowing navigation between one detail or the other. 50 S–2462–20Trace View (Tview) [3] 3.2.2.1 Optional Data By default, Tview displays CpuUtil and MemConcur data. In addition, other types of data are available. To display these values, right-click on the Tview tab in the upper-left corner of the window, and toggle the values that you want to show or hide. Table 10. Tview GUI Optional Data Metric Unit Description CpuUtil Processors Shows processor utilization based on the instruction issue counter. The maximum value is the number of teams used. CpuAvail Processors Shows processor availability based on the issues vs. issues and phantoms. The maximum value is the number of teams used. StrmUtil Streams Shows average stream utilization based on the stream reservation counter. The maximum value is the maximum number of streams multiplied by the number of teams. StrmReady Stream Shows streams ready to issue instructions but not currently executing, based on the stream ready counter. MemRefs References Shows LOAD, STORE, INT_FETCH_ADD, and STATE operations issued, based on the memory reference counter. The maximum value is the number of teams used. MemConcur References Shows memory references issued but not completed. Based on the concurrency counter. The limit is the injection limit, which represents a bottleneck for processing the references over that limit. This limit is the number of teams multiplied by the network limit. FloatOps References Shows floating point operations. Based on a programmable counter; not valid if changed. Retries Operations Shows retried memory operations. Based on a programmable counter; not valid if changed. Creates Operations Shows stream create operations. Based on a programmable counter; not valid if changed. Traps Traps Shows traps taken. Based on a programmable counter; not valid if changed. S–2462–20 51Cray XMT™ Performance Tools User’s Guide 3.2.2.2 Zooming In By default, Tview shows data for the entire length of the program run. To zoom in on a smaller span of time, hover the cursor over the graph until it changes to a + character, and then left-click and drag to define a bounding box. The graph is redrawn to show the selected time span. To zoom out again, right-click anywhere on the graph. Alternatively, you can use the Select Range option to enter numeric values for the starting and ending times that define the range of data to be displayed. For more information about the Select Range and Clear Selection options, see Tview Configuration and Navigation Options on page 56. 3.2.2.3 Handling Large Trace Files The APP2_SWAPFILE environment variable is set when Apprentice2 needs to handle very large trace files. Set APP2_SWAPFILE to the root name of some temporary files that Apprentice2 creates to help offset memory usage on the XMT login nodes that lack swap. For example, export APP2_SWAPFILE=/mnt/lustre/users/app2 might be a reasonable choice for this variable. Apprentice2 then creates a couple of files with the name /mnt/lustre/users/app2.XXXXXX where XXXXXX is replaced by some random string. These files are cleaned up if Apprentice2 is exited properly. 3.2.3 Event and Trap Details This pane is not visible by default, but will appear if Show Event Summaries or Show Trap Summaries is selected from the Tview context menu. When both are shown, there are tabs at the top of the region allowing navigation between one detail or the other. Events and Traps are displayed in a tabular format. Click a column heading to sort the data by that type. If you zoom into a particular time range on the plot in the pane above, only the events or traps for that time range will be shown. Selecting an individual event or trap draws a line on the plot, showing the location of that event in the timeline. The line includes a handle, which you can use to drag the line around the plot. As the line moves, the event selected in the Event Detail will change. Double clicking on an event or trap will jump to that source location in the Canal report. 3.2.3.1 Event Details The Events tab displays the timestamp, type of event, team performing the event, and function name for every traced event within the range currently displayed on the Tview graph. Click the expandable area below the table, labeled Filter, to filter events by kind. To disable filtering, un-expand this area. 52 S–2462–20Trace View (Tview) [3] Figure 11. Tview Event Details Table 11. Event Details Heading Description Time The time at which the event occurred. Kind The kind of event: FUNCTION_ENTRY, FUNCTION_EXIT, PAR_REGION_ENTRY, PAR_REGION_EXIT, PAR_REGION_BARRIER, START_FUTURE, or USER_SPECIFIED. Proc The processor on which the event occurred. Name The name of the event. Streams The number of streams requested at PAR_REGION_ENTRY. 25% Done The time at which 25% of the streams in a region reached a PAR_REGION_BARRIER or PAR_REGION_EXIT. S–2462–20 53Cray XMT™ Performance Tools User’s Guide Heading Description 50% Done The time at which 50% of the streams in a region reached a PAR_REGION_BARRIER or PAR_REGION_EXIT. 75% Done The time at which 75% of the streams in a region reached a PAR_REGION_BARRIER or PAR_REGION_EXIT. 100% Done The time at which 100% of the streams in a region reached a PAR_REGION_BARRIER or PAR_REGION_EXIT. 3.2.3.2 Trap Details The Traps tab shows all the traps recorded into the trace during execution. A checkbox below the table can be used for collating or grouping the traps by their program counter. When collated, several of the columns will change as they are not relevant to this summarized view. Traps data is useful for determining the reasons for certain types of poor program performance, such as memory hotspotting. During program execution, if the rate of traps exceeds a certain threshold, the Cray XMT runtime generates a trace event providing information about the range of traps that were encountered. The number of traps listed in the detail will almost always be less than those shown in the summary at the top. The difference is that all the traps handled by the runtime are captured by the counters, but only those that occur at a rate exceeding a given threshold will cause an event. This threshold is controlled by the MTA_PARAMS environment variable. The rate is equal to the minimum dump threshold over the frequency of even sampling. Specify the threshold by setting MTA_PARAMS to PC_HASH n, m, l, where n, m, l are the hash size, age threshold, and dump threshold, respectively. Events are hashed based on pc and event type, so the hash size determines how often the event hash will have to wait for a free row. The age threshold determines the frequency of trap event sampling, as well as when a trap event is considered stale. The dump threshold determines the minimum number of events that must be hashed before an event is generated. The default values for n, m, l are 1009, 30000000, and 5, respectively. Note: The number of traps in the summary includes traps taken in the system libraries. The default behavior of the app2 command is to capture all of the traps and events that occur, whether they are in the user code or the system code. To hide the system traps, start Apprentice2 with the --nosystem flag to run in system mode. This flag is documented in the app2(1) man page. 54 S–2462–20Trace View (Tview) [3] Figure 12. Tview Trap Details Table 12. Trap Details Heading Description Kind The type of trap, either DATA_BLOCKED or FLOAT_EXTENSION. Data Result Code The result code or subtype of DATA_BLOCKED traps. Retry Op Code The machine operation which caused the trap, either LOAD, STORE, or INT_FETCH_ADD. Count The number of traps that occurred in the sample period, or the total number when collated. Rate The rate at which the traps occurred in the sample period. This detail is absent when collated. Time The time at which the trap event was recorded, which is not necessarily the time of the trap. This detail is absent when collated. S–2462–20 55Cray XMT™ Performance Tools User’s Guide Heading Description Destination Register The destination register of the memory operation for DATA_BLOCKED traps. This detail is absent when collated. Data Address The data address of the memory operation for DATA_BLOCKED traps. This detail is absent when collated. Program Counter The program counter at which the trap was taken. Typically this is the instruction immediately after the one that caused the trap. Library The library in which the traps occurred. This detail is visible when collated. Module The module in which the traps occurred. This detail is visible when collated. Source The source file in which the traps occurred. This detail is visible when collated. Line The source line number at which the traps occurred. This detail is visible when collated. 3.2.4 About System Library Traps Effective with Cray XMT 2.0, Tview shows not only the traps and events that occurred within your program, but also the traps that occurred inside system code. Previously this information was available only when you invoked Apprentice2 with the --system option. 3.2.5 Tview Configuration and Navigation Options The Tview report provides a number of options for configuring the display. All of these options are accessed by right-clicking on the Tview tab in the upper-left corner of the window. Table 13. Tview GUI Configuration and Navigation Options Option Description Select Range Opens a window that enables you to zoom-in on a portion of the data, by selecting the beginning and ending time-points. For more information, see Select Range on page 57. Clear Selection Resets the range to zero and the end of the program execution. 56 S–2462–20Trace View (Tview) [3] Option Description Show Metric Enables you to show or hide the StrmUtil, StrmReady, MemRefs, MemConcur, FloatOps, Traps, Retries, or Creates data. For more information, see Optional Data on page 51. Show Details Shows/hides tracing details. For more information, see Event and Trap Details on page 52. Position Legend Position the graph legend at the left edge or right edge of the window, or hide it altogether. Change Font Changes the font for text displayed in the window. Panel Actions Performs the standard Cray Apprentice2 actions: detach, remove, or freeze a panel. For more information, see Panel Actions on page 58. Panel Help Displays panel-specific help, if available. 3.2.5.1 Select Range By default, Tview shows data for the entire length of the program run. To zoom-in on a smaller span of time, use the Select Range option to enter numeric values for the starting and ending times that define the range of data you want displayed. Figure 13. Select Range Dialog Alternatively, you can hover the cursor over the graph until it changes to a + character, and then left-click and drag to define a bounding box. After you either enter range values or draw a bounding box, the graph is redrawn to show only the selected time span. To undo a zoom-in, either use the Clear Selection option, or right-click anywhere on the graph. S–2462–20 57Cray XMT™ Performance Tools User’s Guide 3.2.5.2 Panel Actions To manipulate the Tview report window, select Panel Actions from the pop-up window. Table 14. Tview GUI Panel Actions Action Description Detach Panel Display the report in a new window. The original window remains blank. Remove Panel Close the report window. Freeze Panel Freeze the report as shown. Subsequent changes to other parameters do not change the appearance of the frozen report. 3.3 Partial Tracing If the execution of a tracing program terminates prematurely, tracing information may still be available. If so, the trace.out file will still be produced in the same directory as would be expected for a successfully completed execution. The data may vary slightly, depending on the reason for the termination. In general, however, the output of a premature termination will be the same as what would have been seen up to that point in a full execution. For example consider this trace from a full execution of the radixsort application. 58 S–2462–20Trace View (Tview) [3] Figure 14. Full Trace of radixsort Application S–2462–20 59Cray XMT™ Performance Tools User’s Guide Figure 15 shows a partial trace of the same application. Figure 15. Partial Trace of radixsort Application By zooming in on the same segment of the program in the full trace as is shown in the partial trace (Figure 16) we can see that the two executions show similar plots up to 88 seconds, which is when the program was terminated with a SIGINT. After that the plot tapers off in the partial trace, but continues as expected in the full trace. 60 S–2462–20Trace View (Tview) [3] Figure 16. Segment of Full Trace of radixsort Application Partial tracing is available for any execution that terminates prematurely, provided tracing was initialized and tracing data was gathered prior to termination. However, tracing data is gathered and stored in runtime trace buffers. Only three termination signals will initiate flushing of these buffers to the persistent mmapped buffers that are shared between the runtime and mtarun. Those signals are SIGINT, SIGQUIT, and SIGTERM. All other causes of termination will leave the data in the runtime trace buffers and output only what was already written to the trace.out file, and what remains in the persistent mmapped buffers. It is possible to tune the frequency with which the trace buffers are flushed to the persistent buffers, thus making the trace buffer data more accessible. This tuning is described in Changing the Frequency of Trace Buffer Flushing on page 62. S–2462–20 61Cray XMT™ Performance Tools User’s Guide 3.4 Tuning Tracing 3.4.1 Changing the Persistent Buffer Size As described in Partial Tracing, tracing data is gathered during program execution and stored in runtime trace buffers. Periodically these buffers are dumped to persistent buffers, which are shared between the runtime and mtarun. The size of the persistent buffers determines how much tracing data can be gathered before requiring a dump of the gathered data to the trace.out file. The default size of these buffers is 16,777,216 words (16 MB), which is also the maximum size. This default provides the lowest overhead in writing to the trace file. Depending on the requirements of your application, you may want to change the size of these buffers to free up memory. To do this use the MTA_PARAM mmap_buffer_size, to specify the desired size in words. MTA_PARAM="mmap_buffer_size 8192" 3.4.2 Changing the Frequency of Trace Buffer Flushing Data that is held in the runtime trace buffers is dumped periodically to the persistent buffers. It is the data in the persistent buffers that is output upon termination of a program. This means that if a program is terminated prematurely, there may be data in the runtime trace buffers that was not yet dumped to the persistent buffers. To minimize this data loss you can use the MTA_PARAM must_dump_size to reduce the size of the trace buffer from the default size of 512 words. Again, the tradeoff is that the runtime trace buffers will be dumped more frequently during program execution, which can have an impact on performance. MTA_PARAM="must_dump_size 256" On the other hand, when an application requires a large number of streams, fewer streams may be available for tracing. This can cause a bottleneck in tracing because the teams have to wait for streams in order to dump their data to the persistent buffers. If a large number of traps are being taken due to tracing at larger scales, raising the value of must_dump_size can alleviate the bottleneck. 62 S–2462–20Trace View (Tview) [3] 3.4.3 Resolving Tracing Failures Tracing failures generally are caused by one of the following issues: • When tracing fails to initialize, program execution continues without tracing. To override this default behavior and force your program to exit if tracing fails, use the MTA_PARAM exit_on_trace_fail MTA_PARAM="exit_on_trace_fail" • A trace.out file can be empty when program execution is terminated prematurely by any signal other than SIGINT, SIGQUIT, or SIGTERM, preventing data in the runtime trace buffers from being dumped to the persistent buffers. If your trace file is empty, try increasing the frequency with which the trace buffers are dumped, as described in Changing the Frequency of Trace Buffer Flushing on page 62. S–2462–20 63Cray XMT™ Performance Tools User’s Guide 64 S–2462–20Block Profiling (Bprof) [4] The Bprof report uses information captured during program execution to provide a function-level view of program performance. When combined with Tview, it can help you to identify the functions that consume the greatest amount of execution time while producing the least amount of work. To produce the Bprof report, you must first compile the program using the compiler's -profile option, and then execute the program using mtarun. For example: users/smith> cc -profile myprogram.c users/smith> mtarun a.out The variable myprogram is the name of the source file that is being compiled. This produces a profile data file, profile.out, which is saved in either the execution directory or the directory specified in the MTA_PROFILE_FILE environment variable. Note: If the executable binary file for a program is not altered between executions, the profile data file is updated rather than removed and rewritten each time the program is run. This allows you to generate profile reports that reflect the typical performance of your program over many runs, rather than the unique and perhaps exceptional performance of a single run. When a program is profiled, the system records the number of instructions issued by instrumented routines during program execution, but not the amount of time spent executing any given routine. The compiler -profile option enables block profiling for all routines compiled and linked using the -profile flag, as well as all routines inlined into a routine that was compiled and linked with the -profile flag. However, any routine called by a profiled routine, but not inlined into that routine, shows up in the Bprof output has having generated no instruction issues. The Bprof report is available in two forms: a text-only command-line interface (CLI) version, and a Cray Apprentice2 (GUI) version. 4.1 CLI Version of Bprof The CLI version of Bprof displays the profile data as formatted text. To run this version, use the bprof command. The command defaults to using a.out as the name of the executable and profile.out as the name of the profile data file. Given that profile data files are typically fairly large, it is generally advisable to pipe the results to an output file or through more. S–2462–20 65Cray XMT™ Performance Tools User’s Guide For example: users/smith> bprof | more The text report generated by bprof consists of a header followed by three sections. The header contains a summary of total instructions issued for profiled routines, as well as a list of various sources of program overhead. Example 2. Bprof CLI output – header Approximate total: 133256589 issues, profiled: 166411 issues Approximate amount of the program that was profiled: 0.1% Total function call overheads: 42 issues (0.0%) Total parallel overheads: 14729 issues (8.9%) Total profiling overheads: 11239 issues (6.8%) Total unknown overheads: 0 issues (0.0%) The first section of the report provides a profile of the program execution in terms of instructions issued for each call tree branch. This section is broken down into subsections, each of which provides information about one routine, along with its parent and child routines. These subsections are organized within the first section based on the number of instructions issued by the routine and all of its descendants, and each subsection provides the following information. Table 15. Bprof CLI Section Data Data Tag Description % Issues Percent of total profiled instructions issued by the routine and all its children combined. % MemRefs Percent of total memory references. Self Instruction counts for the routine itself and for each individual parent or child of the routine, in units of 100M. Total Instruction counts for all descendents of the routine and for the descendants of each individual parent or child of the routine, in units of 100 M. % Calls Calls For the routine, the total number of times the routine was called; for a parent, the number of times it called the routine out of the total number of times the routine was called; for a child the number of times it was called by the routine out of the total number of times it was called. Name Name of the routine. Parents Children Name of the parents and children of the routine. Index The index number assigned to the routine in the second section. 66 S–2462–20Block Profiling (Bprof) [4] Example 3. Bprof CLI output – call tree profile Call graph: % Calls Parents Index % Issues Self Total Calls Name % Calls Children --------------------------------------------------- [2] 100.0 10M 166M 1 main 155M 156M 100.0 radix [1] 3 3 100.0 atoi [6] n/a n/a 100.0 prand_int [22] n/a n/a 25.0 malloc [8] n/a n/a 25.0 free [14] --------------------------------------------------- 155M 156M 100.0 main [2] [1] 93.8 155M 156M 1 radix n/a n/a 75.0 malloc [8] n/a n/a 75.0 free [14] --------------------------------------------------- (example truncated for length) The second section of the report provides a profile of the program execution in terms of instructions issued per individual routine. This section is organized in descending order, from greatest number of instructions issued to least. Each line provides the following information. Table 16. Bprof CLI Line Data Data Tag Description % Issues Percent of total profiled instructions issued by the individual routine. Cumul The total of the instructions issued by this routine and all routines above it in this section, in units of 100M. Self Number of instructions issued by this routine, in units of 100M. Calls Number of times this routine was called. Self/Call Issues that result from one call to this routine (not counting descendants). Total/Call Issues that result from one call to this routine (counting descendants). Name Name of the routine being profiled in this line followed by an index number that provides a numbering of the profiled routine from largest number of instructions issued to smallest. S–2462–20 67Cray XMT™ Performance Tools User’s Guide The second section looks like this example. Example 4. Bprof CLI output – routine profile Flat profile: % Issues Cumul Self Calls Self/Call Total/Call Name 93.6 155M 155M 1 155M 156M radix [1] 6.2 166M 10M 1 10M 166M main [2] 0.0 166M 3 1 3 3 atoi [6] 0.0 166M n/a 4 0 0 malloc [8] 0.0 166M n/a 1 0 0 strtol [9] 0.0 166M n/a 4 0 0 free [14] 0.0 166M n/a 1 0 0 prand_int [22] (example truncated for length) The third section of the Bprof report provides an alphabetic listing of the routines and their associated index number from the second section. Example 5. Bprof CLI output – routine listing and index Function index: [6] atoi [14] free [2] main [8] malloc [22] prand_int [1] radix [9] strtol (example truncated for length) For more information about bprof command syntax, see the bprof(1) man page. You can also type bprof -h to generate a usage statement. 4.2 GUI Version of Bprof To use the GUI version of Bprof, you must do the following. Procedure 3. Using Bprof 1. Compile and link your program using the compiler -profile option. users/smith> cc -profile mysource.c 2. Execute your program using mtarun. users/smith> mtarun a.out Upon successful completion of program execution, a data file named profile.out is generated. 3. Use the pproc utility with the --mtapf option to generate an .ap2-format data file from the binary executable and the profiling data. users/smith> pproc --mtapf=profile.out a.out 68 S–2462–20Block Profiling (Bprof) [4] 4. Open the resulting .ap2-format data file in Cray Apprentice2. users/smith> app2 a.out.ap2 & The Bprof report window displays. 4.2.1 Bprof Window Layout The Bprof report window is divided into three main sections. Figure 17. Block Profiling Report Window 1 2 3 S–2462–20 69Cray XMT™ Performance Tools User’s Guide Table 17. Description of Block Profiling Report Window Callout Description 1 The Summary line displays a summary of the profiled routines, including profiling and programming overhead. A variety of configuration options are provided on a pop-up menu that displays when you right-click on the Bprof tab in the upper-left corner of the window. These are discussed in greater detail in Bprof Configuration and Navigation Options on page 72. 2 The Function pane displays the functions that have been profiled, along with all data collected about each function. This section is discussed in more detail in Function List on page 70. 3 The Callers and Callees pane displays the names of and data about the functions that call and are called by the selected function. This section is discussed in more detail in Callers and Callees on page 71. 4.2.2 Function List The Detail area makes up the majority of the Bprof display. It presents in tabular format all of the data collected during program execution. Note: If the executable binary file for a program is not altered between executions, the profile data file is updated rather than removed and rewritten each time the program is run. This allows you to generate profile reports that reflect the typical performance of your program over many runs, rather than the unique and perhaps exceptional performance of a single run. Each column header is an active button. Click on the column header to sort the report by the data in that column, and click again to toggle between sorting in ascending and descending order. On the Bprof report window, you can toggle between views of issues and memory reference information. On the Bprof tab, right-click the blue arrow to display the options menu. You can change the display between the default Issues display to the MemRefs display. Note: The following table describes each column displayed when you use the Issues option. For the MemRefs option, the report displays the same type of information, but in this context it pertains to memory references rather than issues. 70 S–2462–20Block Profiling (Bprof) [4] Table 18. Bprof GUI Report Data Name Description Function The name of the profiled function. % Issues The percent of total profiled instructions issued by the routine and all of its children combined. Total Issues The total of the instructions issued by this routine and all routines above it in the calling tree. Issues The total number of instruction issues that the profiled function is responsible for. Calls The total number of calls to the profiled function. Issues/Call The ratio of issues to calls. Total Issues/Call The ratio of cumulative issues to calls. To view detailed caller and callee information for a specific function, click on the function name. 4.2.3 Callers and Callees If you click on a function name in the Profiling Detail section of the window, more information is displayed in the Callers and Callees section of the report window. The Caller detail lists the functions that call the profiled function. Note: The following table describes each column displayed when you use the Issues option. For the MemRefs option, the report displays the same type of information shown in the following table, but in this context it pertains to memory references rather than issues. Table 19. Bprof GUI Caller Detail Name Description Function The name of the function that called the profiled function. % Issues The percentage of the total number of issues for which this caller's descendants are responsible that originated from the profiled function. Issues The total number instructions issued by this function. Calls The number of times that this caller called the profiled routine. S–2462–20 71Cray XMT™ Performance Tools User’s Guide The Callee detail lists the functions that were called by the profiled function. Note: The following table describes each column displayed when you use the Issues option. For the MemRefs option, the report displays the same type of information shown in the following table only now it pertains to memory references rather than issues. Table 20. Bprof GUI Callee Detail Name Description Function The name of the function called by the profiled function. % Issues The percentage of the total number of issues that this callee and its descendants are responsible for. Issues The total number instructions issued to this function. Calls The number of times that this caller was called by the profiled routine. The Callers and Callees sections of the report window are displayed by default, but can be hidden or shown independently of each other. To hide or show either the Callers or Callees section, right-click on the Bprof tab in the upper-left corner of the window, and then select the desired hide or show option from the pop-up menu that displays. 4.2.4 Bprof Configuration and Navigation Options The Bprof report provides a number of options for configuring the display. All of these options are accessed by right-clicking on the Bprof tab in the upper-left corner of the window. Table 21. Bprof GUI Configuration and Navigation Options Option Description Issues/Memrefs Toggles between showing issues versus memory references. Hide Callers Shows/hides the Callers section of the report. For more information, see Callers and Callees on page 71. Hide Callees Shows/hides the Callees section of the report. For more information, see Callers and Callees on page 71. Panel Actions Performs the standard Cray Apprentice2 actions: detach, remove, or freeze a panel. For more information, see Panel Actions on page 73. Panel Help Displays panel-specific help, if available. 72 S–2462–20Block Profiling (Bprof) [4] 4.2.4.1 Panel Actions To manipulate the Bprof report window, select Panel Actions from the pop-up window. Table 22. Bprof GUI Panel Actions Action Description Detach Panel Displays the report in a new window. The original window remains blank. Remove Panel Closes the report window. Freeze Panel Freezes the report as shown. Subsequent changes to other parameters do not change the appearance of the frozen report. S–2462–20 73Cray XMT™ Performance Tools User’s Guide 74 S–2462–20Trace Profiling (Tprof) [5] The Tprof report is a simple profile of the functions and parallel regions in the code, based on traces. This sample report shows each function entry/exit pair and each parallel region entry/exit pair. The entry events are marked Exclusive and show the amount of time spent in that function or region, less the time spent in any child functions or regions. The exit events are marked Inclusive and show the time spent in that function or region plus any time spent in any child functions or regions. Figure 18. Tprof Report S–2462–20 75Cray XMT™ Performance Tools User’s Guide The Tprof report was originally created for debugging operating system traces and is generally not of use to the typical user. Note that the Tprof report is generated only when Apprentice2 is running in system mode (the default). 76 S–2462–20Glossary barrier In code, a barrier is used after a phase. The barrier delays the streams that were executing parallel operations in the phase until all the streams from the phase reach the barrier. Once all the streams reach the barrier, the streams begin work on the next phase. block scheduling A method of loop scheduling used by the compiler, where contiguous blocks of loop iterations are divided equally and assigned to available streams. For example, if there are 100 loop iterations and 10 streams, the compiler assigns 10 contiguous iterations to each stream. The advantages to this method are that data in registers can be reused across adjacent iterations, and there is no overhead due to accessing a shared iteration counter. dynamic scheduling In a dynamic schedule, the compiler does not bind iterations to streams at loop startup. Instead, streams compete for each iteration using a shared counter. fork Occurs when processors allocate additional streams to a thread at the point where it is creating new threads for a parallel loop operation. inductive loop An inductive loop is one that contains no loop-carried dependencies and has the following characteristics: a single entrance at the top of the loop; controlled by an induction variable; and has a single exit that is controlled by comparing the induction variable against an invariant. join The point where threads that have previously forked to perform parallel operations join back together into a single thread. S–2462–20 77Cray XMT™ Performance Tools User’s Guide linear recurrence A special type of recurrence that can be parallelized. See the Cray XMT Programming Environment User's Guide. phase A set of one or more sections of code that the program may execute in parallel. The code in a section may consist of either a parallel loop or a serial block of code. No barriers are inserted between sections of a phase, however barriers are inserted between different phases of a region. recurrence Occurs when a loop uses values computed in one iteration in subsequent iterations. These subsequent uses of the value imply loop-carried dependences and thus usually prevent parallelization. To increase parallelization, use linear recurrence. reduction A simple form of recurrence that reduces a large amount of data to a single value. It is commonly used to find the minimum and maximum elements of a vector. Although similar to a reduction, it is easier to parallelize and uses less memory. region An area in code where threads are forked in order to perform a parallel operation. The region ends at the point where the threads join back together at the end of the parallel operation. 78 S–2462–20 Optimizing Loop-Level Parallelism in Cray XMT™ Applications Abstract In this paper, we describe how to write ef?cient, parallel codes for the Cray XMT system, a massively multithreaded, shared memory computer. To achieve good performance on Cray XMT systems, programs must exploit not only coarse-grained parallelism at the algorithmic level, but also ?ne-grained parallelism at the loop level. While the Cray XMT compiler is capable of performing the loop-level optimizations required to expose ?ne-grained parallelism, it can often do a better job when additional information is provided by the programmer via pragmas and language constructs. These hints enable the compiler to perform sophisticated transformations that ultimately result in highly parallel codes. The Canal tool, part of the Cray Apprentice2 performance tool suite, can be used to guide the programmer through the process of tuning a program. When properly optimized, programs written for Cray XMT systems can achieve signi?cant speed up on problems that have never been shown to attain speed up on conventional multiprocessor systems.© 2009 Cray Inc. All Rights Reserved. This document or parts thereof may not be reproduced in any form unless permitted by contract or by written permission of Cray Inc. U.S. GOVERNMENT RESTRICTED RIGHTS NOTICE The Computer Software is delivered as "Commercial Computer Software" as de?ned in DFARS 48 CFR 252.227-7014. All Computer Software and Computer Software Documentation acquired by or for the U.S. Government is provided with Restricted Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7014, as applicable. Technical Data acquired by or for the U.S. Government, if any, is provided with Limited Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7013, as applicable. BSD Licensing Notice: Copyright (c) 2008, Cray Inc. All rights reserved. Redistribution and use in source and binary forms, with or without modi?cation, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name Cray Inc. nor the names of its contributors may be used to endorse or promote products derived from this software without speci?c prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. Your use of this Cray XMT release constitutes your acceptance of the License terms and conditions. Cray, LibSci, and UNICOS are federally registered trademarks and Active Manager, Cray Apprentice2, Cray Apprentice2 Desktop, Cray C++ Compiling System, Cray CX, Cray CX1, Cray CX1-iWS, Cray CX1-LC, Cray Fortran Compiler, Cray Linux Environment, Cray SeaStar, Cray SeaStar2, Cray SeaStar2+, Cray SHMEM, Cray Threadstorm, Cray X1, Cray X1E, Cray X2, Cray XD1, Cray XMT, Cray XR1, Cray XT, Cray XT3, Cray XT4, Cray XT5, Cray XT5 h , Cray XT5m, CrayDoc, CrayPort, CRInform, Cray ECOphlex, Libsci, NodeKARE, RapidArray, UNICOS/lc, UNICOS/mk, and UNICOS/mp are trademarks of Cray Inc. AMD, Opteron and AMD Opteron are trademarks of Advanced Micro Devices, Inc. Linux is a trademark of Linus Torvalds. UNIX, the “X device,” X Window System, and X/Open are trademarks of The Open Group in the United States and other countries. All other trademarks are the property of their respective owners. Version 1.0 Published December 2009 Supports Cray XMT systems.Optimizing Loop-Level Parallelism in Cray XMT™ Applications Table of Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 Overview of the Cray XMT System . . . . . . . . . . . . . . . . . 4 The Cray XMT Programming Environment and Tools . . . . . . . . . . . . . . . 5 Language Extensions . . . . . . . . . . . . . . . . . . . . . . . . 5 Loop-level Parallelism and Canal . . . . . . . . . . . . . . . . . . . . 6 Identifying Parallelism . . . . . . . . . . . . . . . . . . . . . 7 Conditions for Safe Parallelization . . . . . . . . . . . . . . . . . . . . 7 Compiler Transformations . . . . . . . . . . . . . . . . . . . . 9 Scalar Expansion . . . . . . . . . . . . . . . . . . . . . . . . . 9 Reductions . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Linear Recurrences . . . . . . . . . . . . . . . . . . . . . . . . 11 Nested Parallelism and Loop Collapse . . . . . . . . . . . . . . . . . . . 12 Pragmas . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Implementing Parallelism . . . . . . . . . . . . . . . . . . . . 18 Parallel Regions . . . . . . . . . . . . . . . . . . . . . . . . . 19 Styles of Parallelism . . . . . . . . . . . . . . . . . . . . . . . . 20 Loop Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . 22 An Example Loop . . . . . . . . . . . . . . . . . . . . . . 23 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 25 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . 25 About the Authors . . . . . . . . . . . . . . . . . . . . . . 26 Selected Bibliography . . . . . . . . . . . . . . . . . . . . . 26 S–2487–14 3Optimizing Loop-Level Parallelism in Cray XMT™ Applications Introduction The Cray XMT system is a massively multithreaded shared memory system purpose-built for parallel applications that require a large shared address space. Examples of such applications include graph analysis, database queries, and other problems where partitioning the data precludes any potential gain from adding more processors. Such applications do not typically run well on conventional distributed memory systems due to the irregular nature of memory access patterns. The Cray XMT architecture enables many threads to be running concurrently (up to 128 per processor) so that the memory accesses from individual threads can be overlapped with those of other threads, effectively hiding some or all of the actual memory latency. Programs that exploit both coarse- and ?ne-grained parallelism are able to take advantage of this latency hiding capability and perform well on Cray XMT systems. Such programs draw on parallelism at the algorithmic level as well as at the loop and instruction level. Programmers are typically better at identifying high-level parallelism at the algorithmic level. Such parallelism requires an insider’s understanding of the problem space and objectives. In contrast, compilers are typically better at performing lower level optimizations such as loop parallelization. Often this requires the compiler to perform other transformations such as loop collapse and scalar expansion. Unfortunately, for languages like C, which were not originally intended to be parallel, compilers have a dif?cult time making the conclusions required to perform such transformations and ef?ciently parallelize programmer written loops. At the same time, programmers have a dif?cult time understanding what exactly the compiler needs to know in order to best optimize a loop. For the most effective optimization, both the programmer and the compiler must work together to create an optimized program that runs well on Cray XMT hardware. Overview of the Cray XMT System A Cray XMT system is a distributed, global shared memory machine built to leverage Cray’s MPP system design. A Cray XMT system includes a compute partition consisting of Cray Threadstorm processors, and a service partition comprising AMD Opteron™ processors that can be con?gured for I/O, login, or system functions. The network topology architecturally resembles a torus, similar to the Cray MPP systems. Both of the two partitions are made up of blades that hold four processors each. The processors on the blades are identical and may be either Cray Threadstorm processors or AMD Opteron processors. The compute partition runs the Cray MTK operating system—a single system image operating system based on BSD. The processors on the service partition use the Cray Linux Environment (CLE). A single Cray Threadstorm processor has 128 hardware contexts (streams) and a 64 KB, four-way associative instruction cache shared by all the streams. Each stream consists of 32 general purpose registers, eight target registers, and a status word that includes the PC. The processor will issue an instruction on every cycle in round-robin fashion to the streams that are ready to execute an instruction. If no stream is ready to execute an instruction, the processor will stall. A single instruction word can encode up to three operations: one memory operation, one arithmetic operation, and one control ?ow operation. S–2487–14 4Optimizing Loop-Level Parallelism in Cray XMT™ Applications The memory system is a global shared memory address space accessible by all Cray Threadstorm processors in the compute partition. Each processor can have up to 8 GB of memory associated with it, all of which is accessible by every other processor in the system. Each eight-byte word of memory has two additional bits associated with it. One of the two additional bits is called the full-empty bit, used to associate memory location state with the eight-byte word. A memory location state is considered full if the bit is set to one, empty if set to zero. The other bit is used by the runtime libraries to service and implement user traps, some of which are associated with the full-empty bit. For more details on the Cray XMT architecture, see Proceedings of the 2nd Conference on Computing Frontiers (May 2005): 28–34. The Cray XMT Programming Environment and Tools The Cray XMT Programming Environment includes a C and C++ compiler, a standard runtime library including support for multithreaded execution, a multithreaded debugger (mdb), the Cray Apprentice2 tool kit, and several auxiliary libraries such as a parallel random number generator libprand, RPC library libluc, and memory snapshot/restore facilities libsnapshot. For the purposes of this paper, we will focus on the compiler and the Canal report of the Cray Apprentice2 suite. The Cray XMT compiler supports C and C++ codes with extensions for parallelization and multithreaded execution. These XMT extensions include language extensions and compiler directives. In addition, the compiler can detect loop-level parallelism for multithreaded execution. Language Extensions The Cray XMT compiler recognizes two type quali?ers, sync and future, that are used to indicate that the full-empty bits should be used whenever the variable is accessed. For sync quali?ed variables, a use, or load, of the variable can only proceed if the variable’s state is full; upon completion of the use, its state is set to empty. An assignment, or store, to a sync variable can only proceed if the variable’s state is empty; upon completion of the assignment, its state is set to full. Loads and stores that are issued when the variable is not in the required state will be blocked until the variable’s state changes. State changes occur atomically with the operation. Future variables are similar to sync variables, except that a load will only proceed if the state is full (and leaves the state full) and a store will also only proceed if the state is full (and leaves the state full). Future variables are typically used with future statements. Future statements de?ne a block of work, referred to as a future, that is to be executed by some thread. A future variable can be associated with a future statement to guarantee that the future has been executed. Upon completion of the future, the future variable associated with the future statement is set to full. Any uses of the future variable will be blocked until the future has been executed. Future statements describe what we refer to as explicit parallelism, because the parallelism is made explicitly by the programmer. S–2487–14 5Optimizing Loop-Level Parallelism in Cray XMT™ Applications Loop-level Parallelism and Canal In addition to future statements, the Cray XMT compiler supports implicit parallelism in the form of compiler-generated loop-level parallelism, the focus of this paper. If the compiler can determine that a loop can be safely executed in parallel, it will generate loop parallelism by distributing the iterations of a loop across multiple software threads. The runtime library assigns these threads to hardware streams, which then execute them in parallel. The compiler also inserts calls to create (fork) and terminate (join) threads. The results of loop parallelization and other transformations can be seen in Canal. The Canal report has two parts: an annotated view of the compiled source code, and a report containing additional information about loops in the code. The annotated source view contains annotations for each loop in a users code, as well as a selection of messages about the transformations that were applied to the code. For example, the following snippet tells us that two nested loops were parallelized, and that reduction and a manhattan loop collapse, designated by the m, were used. | for (int i=0; i), where is the number of streams the compiler requests.Limiting Loop Parallelism in Cray XMT™ Application S–0027–14 Cray Inc. 7 ? Limits the number of processors used by a multiprocessor parallel loop to max(1, c / ), where is the number of streams the compiler requests for each processor used by the parallel loop. ? If c is larger than or equal to , the total number of streams used by the parallel loop will be at most c. ? If c is less than , one processor will be used and streams will be requested by the compiler. ? Limits the number of futures created for a loop that uses loop future parallelism to c. ? If multiple max concurrency c pragmas are specified on one loop, the value of c specified by the last pragma will be used. ? For collapsible loop nests, the max concurrency value specified by the outer loop (if any) will be used for the collapsed loop. ? The max concurrency c pragma is not allowed to be used on a loop that also uses the use n streams pragma. Examples The following example illustrates using the max concurrency c pragma on a single processor parallel loop. /* Use at most 95 streams. */ #pragma mta loop single processor #pragma mta max concurrency 95 for(i = 0; i < size; i++) { array[i] += array[i] + (size + i); } The following example illustrates using the max concurrency c pragma on a multiprocessor parallel loop. /* Use at most 512 streams across all processors. */ #pragma mta max concurrency 512 for(i = 0; i < size; i++) { array[i] += array[i] + (size + i); }Limiting Loop Parallelism in Cray XMT™ Application S–0027–14 Cray Inc. 8 The following example illustrates using the max concurrency c pragma on a loop that uses loop future parallelism. /* Create at most 512 futures. */ #pragma mta loop future #pragma mta max concurrency 512 for(i = 0; i < size; i++) { array[i] += array[i] + (size + i); } Multiprocessor parallel loops are allowed to use both the max n processors and max concurrency c pragmas, and can use both on a single loop. In cases where both pragmas are used, the lower bound of the number of processors estimated by the two limits will be the limit used on the loop. For example, the following code illustrates the use of both pragmas on one multiprocessor parallel loop. /* Use at most 512 streams across all processors or * at most 8 processors, whichever is smaller. */ #pragma mta max concurrency 512 #pragma mta max 8 processors for(i = 0; i < size; i++) { array[i] += array[i] + (size + i); } In the above example, if the compiler were to request 64 streams per processor, then the max concurrency 512 would estimate that 8 processors should be used for the loop (i.e., 512/64). The max 8 processors has the same limit on the number of processors so the loop would be limited to 8 processors. If the compiler instead requested 32 streams per processor, then the max concurrency 512 would estimate that 16 processors should be used, which is more than the limit of 8 specified by the max 8 processors, so the loop would be limited to 8 processors. Because the use n streams pragma cannot be used on the same loop as a max concurrency c pragma, the loop will use the default number of streams determined by the compiler. The user will need to look at the canal details for a loop to determine the default number of streams being requested by the compiler. Effect of Pragmas on Loop Fusion and Parallel Region Merging The new pragmas can prevent the compiler from fusing loops if the loops involved do not have the same limits for the max processors and max concurrency. This is because the compiler will need to put the loops into different parallel regions in order to limit the processors and/or concurrency as requested by the user. This could potentially have a negative impact on the performance of a user's application, so users may need to look at the canal output to see what loops the compiler fused.Limiting Loop Parallelism in Cray XMT™ Application S–0027–14 Cray Inc. 9 The pragmas could also prevent the compiler from merging the parallel regions for different loops into a single parallel region. The limitation for concurrency or processors specified by the new pragmas applies to the current parallel region that contains the loop with the pragmas. The compiler must ensure that all loops in a parallel region have the same limits for max processors and max concurrency. If the loops do not have matching limits, the compiler will put them in different parallel regions to ensure the user's limits on processors and/or concurrency can be correctly applied. This could potentially have a negative impact on the performance of a user's application because more time will be spent tearing down and starting new parallel regions. In the case of nested parallel regions, any limitations for concurrency or processors specified with the pragmas on either region do not affect the other region. For example, if the outer parallel region has a max 8 processors, that pragma will not affect the inner parallel region because the pragmas apply to the current parallel region only. The user can determine what loops the compiler placed in a parallel region by looking at the canal output. The “Additional Loop Details” shows which parallel region a loop is in, and the details for parallel regions state what limits for processors or concurrency (if any) are being applied to the region. The following is an example of two loops that have matching limits for max n processors that could be fused and placed into one parallel region by the compiler. #pragma mta max 64 processors for(i = 0; i < size; i++) array[i] = i; #pragma mta max 64 processors for(i = 0; i < size; i++) { array[i] += array[i] + (size + i); } The following is an example of two loops that cannot be fused or put into one parallel region because the loops specify different limits for the max processors. #pragma mta max 256 processors for(i = 0; i < size; i++) array[i] = i; #pragma mta max 512 processors for(i = 0; i < size; i++) { array[i] += array[i] + (size + i); } The following is another example of two loops that cannot be fused or put into one parallel region because the loops specify different limits for the max processors. The first loop does not use the max n processors pragma, which implies there is no user specified limit. for(i = 0; i < size; i++) array[i] = i;Limiting Loop Parallelism in Cray XMT™ Application S–0027–14 Cray Inc. 10 #pragma mta max 512 processors for(i = 0; i < size; i++) { array[i] += array[i] + (size + i); } Use Case: Applying Max Processors Pragma to GraphCT An example application that uses nested parallelism to improve system utilization and reduce contention on shared data structures is GraphCT (Graph Characterization Toolkit) [1]. GraphCT consists of multiple kernels that perform operations on a graph and the kernel focused on in this example is betweenness centrality. The betweenness centrality kernel of GraphCT is executed concurrently by a small number of threads using loop future parallelism, and each thread uses multiprocessor parallelism to compute the betweenness centrality of a node. The betweenness centrality kernel of GraphCT can see significant variance in performance due to issues with load balancing across the threads. The max n processors pragma can be used to help improve load balancing and increase utilization by evenly distributing the processors across the threads. The betweenness centrality kernel of GraphCT consists of two functions, kcentrality and kcent_core. The kcentrality function creates a small number of threads using loop future parallelism, and each of those threads calls kcent_core to compute the betweenness centrality for the nodes in the graph. Both of these functions were updated to make use of the new max n processors pragma. The changes to kcent_core are limited to applying the max n processors pragma to each parallel loop in the function. The limit for the number of processors to use per thread was determined experimentally based on the default number of threads created in kcentrality in the release version 0.4 of GraphCT, which is 20. This would give each thread approximately 6 processors on a 128P XMT system if each thread got the same number of processors. This led to trying a limit of 8 processors per thread in kcent_core. Experiments showed that using 8 processors per thread performed better than the release version of GraphCT with 20 threads and no max n processors pragmas. A power of two was chosen so the number of processors in the system could be easily divided by the number of processors used per thread. A limit of 16 processors per thread was also tested and was shown to have reasonable performance that could be very similar to the performance with a limit of 8, especially for larger graphs (scale >= 28). The following code snippets show how the max n processors pragma was used for each loop in kcent_core. In these examples, MAX_PROCS is a preprocessor macro that has been defined as 8. <...> #pragma mta max MAX_PROCS processors #pragma mta assert nodep for (j = 0; j < NV; j++) {marks[j] = sigma[NV*(K+1) + j] = 0;} <...>Limiting Loop Parallelism in Cray XMT™ Application S–0027–14 Cray Inc. 11 #pragma mta max MAX_PROCS processors #pragma mta assert nodep for (j = 0; j < (K+1)*NV; j++) { dist[j] = -1; sigma[j] = child_count[j] = 0; } <...> #pragma mta max MAX_PROCS processors #pragma mta assert no dependence #pragma mta block dynamic schedule #pragma mta use 100 streams for (j = Qstart; j < Qend; j++) { <...> #pragma mta max MAX_PROCS processors #pragma mta assert nodep #pragma mta assert no alias *sigma *Q *child *start *QHead #pragma mta use 100 streams for (n = QHead[p]; n < QHead[p+1]; n++) { <...> #pragma mta max MAX_PROCS processors for (j=0; j<(K+1)*NV; j++) delta[j] = 0.0; <...> #pragma mta max MAX_PROCS processors #pragma mta assert nodep #pragma mta block dynamic schedule #pragma mta assert no alias *sigma *Q *BC *delta *child *start *QHead #pragma mta use 100 streams for (n = Qstart; n < Qend; n++) { <...> The pragma was used on all parallel loops in the function to ensure that each thread that calls kcent_core is limited to the desired number of processors, which is 8 in this case. Also, because all of the parallel loops in kcent_core have the same limit for the max processors, the compiler will not need to put the loops into different parallel regions because of a mismatch in limits. Grouping the loops into one region can help reduce the cost of going parallel and improve performance by avoiding starting and tearing down multiple parallel regions. The kcentrality function was modified to compute the number of threads at runtime based on the number of processors used by the application and the number of processors used per thread in kcent_core. The number of threads, INC, is a preprocessor macro in version 0.4 of GraphCT. However, the modifications to kcentrality changed INC to a variable that is computed at runtime. The following code snippet shows the changes made to kcentrality. Again, MAX_PROCS used in the example below has been defined as 8.Limiting Loop Parallelism in Cray XMT™ Application S–0027–14 Cray Inc. 12 <...> /*Compute INC based on the number of processors we're using and limiting each thread to MAX_PROCS processors (in kcent_core()).*/ int INC; INC = mta_get_max_teams(); INC = INC / MAX_PROCS; INC = MTA_INT_MAX(1, INC); <...> #pragma mta loop future for(x=0; x for (int claimedk = int_fetch_add (&k, 1); claimedk < Vs; claimedk = int_fetch_add (&k, 1)) { <...> kcent_core(G, BC, K, s, Q, dist, sigma, marks, QHead, child, child_count); <...> } } <...> These changes to GraphCT helped the betweenness centrality kernel have better load balancing across the threads and achieve higher system utilization, which improved the performance and scalability of the kernel. References [1] “GraphCT – Streaming Graph Analysis”, http://trac.research.cc.gatech.edu/graphs/wiki/GraphCT, May 4, 2010. Cray DVS Installation and Configuration Private S–0005–10© 2008 Cray Inc. All Rights Reserved. This manual or parts thereof may not be reproduced in any form unless permitted by contract or by written permission of Cray Inc. U.S. GOVERNMENT RESTRICTED RIGHTS NOTICE The Computer Software is delivered as "Commercial Computer Software" as defined in DFARS 48 CFR 252.227-7014. All Computer Software and Computer Software Documentation acquired by or for the U.S. Government is provided with Restricted Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7014, as applicable. Technical Data acquired by or for the U.S. Government, if any, is provided with Limited Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7013, as applicable. Cray, LibSci, and UNICOS are federally registered trademarks and Active Manager, Cray Apprentice2, Cray Apprentice2 Desktop, Cray C++ Compiling System, Cray Fortran Compiler, Cray Linux Environment, Cray SeaStar, Cray SeaStar2, Cray SeaStar2+, Cray SHMEM, Cray Threadstorm, Cray X1, Cray X1E, Cray X2, Cray XD1, Cray XMT, Cray XR1, Cray XT, Cray XT3, Cray XT4, Cray XT5, Cray XT5 h , CrayDoc, CrayPort, CRInform, Libsci, RapidArray, UNICOS/lc, UNICOS/mk, and UNICOS/mp are trademarks of Cray Inc. Linux is a trademark of Linus Torvalds. NFS is a trademark of Sun Microsystems, Inc. in the United States and other countries. UNIX, the “X device,” X Window System, and X/Open are trademarks of The Open Group in the United States and other countries. All other trademarks are the property of their respective owners. The UNICOS, UNICOS/mk, and UNICOS/mp operating systems are derived from UNIX System V. These operating systems are also based in part on the Fourth Berkeley Software Distribution (BSD) under license from The Regents of the University of California.Abstract Cray DVS Installation and Configuration S–0005–10 This paper provides instructions for installing and configuring the Cray Data Virtualization Service (Cray DVS) on Cray XT systems running UNICOS/lc 2.0. The paper does not describe the design or internal workings of Cray DVS.Record of Revision Version Description 1.0 January 2008 Supports limited availability versions of Cray DVS for the UNICOS/lc 2.0 release running on Cray XT systems. S–0005–10 Cray Private iContents Page Introduction [1] 1 Prerequisites [2] 3 Cray DVS Installation [3] 7 Installing the Cray DVS RPMs . . . . . . . . . . . . . . . . . . . . 7 Creating the node-map Files . . . . . . . . . . . . . . . . . . . . . 8 Cray DVS Configuration [4] 9 Creating fstab Entries and Mount Points . . . . . . . . . . . . . . . . . 9 Creating the Boot Image . . . . . . . . . . . . . . . . . . . . . . 10 Configuring Boot Automation . . . . . . . . . . . . . . . . . . . . 11 dvs(5) Man Page [5] 15 NAME . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 SYNOPSIS . . . . . . . . . . . . . . . . . . . . . . . . . . 15 IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . . . . 15 DESCRIPTION . . . . . . . . . . . . . . . . . . . . . . . . . 15 OPTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 EXAMPLES . . . . . . . . . . . . . . . . . . . . . . . . . . 17 FILES . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 SEE ALSO . . . . . . . . . . . . . . . . . . . . . . . . . . 17 S–0005–10 Cray Private iiiIntroduction [1] The Cray Data Virtualization Service (Cray DVS) is a distributed network service that provides transparent access to NFS file systems residing on the service I/O (SIO) nodes. Cray DVS provides a service analogous to NFS. The key difference is Cray DVS provides I/O performance and scalability to large numbers of nodes, far beyond the typical number of clients supported by a single NFS server. The limited availability release of Cray DVS running on the UNICOS/lc 2.0 release provides support for access to NFS file systems. This allows applications running on the compute nodes to read and write data files to the users home directory. Figure 1, page 1 presents a typical Cray DVS use case. Compute Nodes /home /home (NFS) Input Files Small data files Applications Lustre Input Files Large data files Applications User Application DVS Figure 1. Cray DVS Typical Use Case For users who are migrating from Catamount to CNL, Cray DVS provides functionality similar to yod NFS access on Catamount compute nodes. Normal systems calls such as open(), read() and write() work without modification. Impact on compute node memory resources, as well as operating system jitter, is minimized in the Cray DVS configuration. DVS-specific options to the mount command enable client access to a network file system being projected by DVS server nodes. See the mount(8) and dvs(5) man pages for more information. Figure 2, page 2 illustrates the system administrator's view of Cray DVS. Administration of Cray DVS is very similar to configuring and mounting any Linux file system. S–0005–10 Cray Private 1Cray DVS Installation and Configuration SIO Node SIO Node Compute Nodes /home DVS Client Compute Nodes /home DVS Client HSN SIO Node /home DVS Server CRAY XT NFS Server NFS client Compute Nodes /home DVS Client Compute Nodes /home DVS Compute Nodes /home DVS Client Compute Nodes /home DVS Client Figure 2. System Administrator's View of Cray DVS 2 Cray Private S–0005–10Prerequisites [2] Before you begin installing and configuring Cray DVS: • Obtain the Cray DVS RPMs from your Cray representative. • Your Cray XT system must be running the UNICOS/lc 2.0 release. Verify that the Cray DVS RPMs being installed were generated for the UNICOS/lc 2.0 update level currently running on your system. Warning: Cray DVS RPMs must be generated to specifically match the UNICOS/lc release level. Customers running the limited availability release of Cray DVS for UNICOS/lc 2.0 will need to request and install updated Cray DVS RPMs each time a new UNICOS/lc update package is installed. Contact your Cray Representative for more information. • Determine which network file systems will be supported using Cray DVS. • Determine which SIO node will be configured as the DVS server for each network filesystem. Verify connections to the network file systems on the DVS servers. S–0005–10 Cray Private 3Cray DVS Installation and Configuration 4 Cray Private S–0005–10S–0005–10 Cray Private 5Cray DVS Installation and Configuration 6 Cray Private S–0005–10Cray DVS Installation [3] Cray DVS Installation [3] Follow these procedures to install Cray DVS and create a node map file for CNL compute nodes and service nodes. Cray DVS uses the node map file to determine which nodes are participating in DVS communication and where these nodes are located in the mesh. 3.1 Installing the Cray DVS RPMs These steps assume that you have obtained two RPM files called dvs-ss*.rpm and dvs-cnl*.rpm and copied them to the System Management Workstation (SMW) in a directory called /tmp/dvs. Install the dvs-ss RPM on the shared root using the following commands: smw:~> scp -p /tmp/dvs/dvs-ss*.rpm root@boot:/rr/current/software/ smw:~> ssh root@boot boot001:~# xtopview default/:/ # rpm -ivh /software/dvs-ss*.rpm default/:/ # exit boot001:~# exit smw:~> Install the dvs-cnl RPM to your CNL image using the following commands, where xthostname-XT_version is the name of your CNL image: smw:~# cd /opt/xt-images smw:/opt/xt-images # xtclone xthostname-XT_version xthostname-XT_version-dvs smw:/opt/xt-images # rpm -ivh --nodeps --root /opt/xt-images/xthostname-XT_vers /tmp/dvs/dvs-cnl*.rpm S–0005–10 Cray Private 7Cray DVS Installation and Configuration 3.2 Creating the node-map Files Once the Cray DVS RPMs have been installed, create the node-map file using the make-nodemap.sh script. The make-nodemap.sh script creates a node mapping for each node in the Cray XT system, starting at node 0 and moving upward. There are two node-map files created, node-map.ss for the DVS SeaStar IPC interface and node-map.socket for non-XT systems or systems configured to run DVS over TCP/IP. Run the make-nodemap.sh script on the SMW to create the node-map files for the CNL image. smw:~# scp -p root@boot:/rr/current/opt/dvs/XT_version/usr/sbin/make-nodemap.sh \ /opt/xt-images/xthostname-XT_version-dvs/etc/dvs smw:~# cd /opt/xt-images/xthostname-XT_version-dvs/etc/dvs smw:/opt/xt-images/xthostname-XT_version-dvs/etc/dvs # make-nodemap.sh smw:/opt/xt-images/xthostname-XT_version-dvs/etc/dvs # ls -l node-map* lrwxrwxrwx 1 root root 13 Sep 8 05:49 node-map -> ./node-map.ss -rw-r--r-- 1 root root 9012 Sep 8 05:49 node-map.socket -rw-r--r-- 1 root root 5435 Sep 8 05:49 node-map.ss smw:/opt/xt-images/xthostname-XT_version-dvs/etc/dvs # Install the node-map.ss file on the shared root file system. smw:/opt/xt-images/xthostname-XT_version-dvs/etc/dvs # scp -p node-map.ss \ root@boot:/rr/current/software/ smw:/opt/xt-images/xthostname-XT_version-dvs/etc/dvs # ssh root@boot boot001:~# xtopview default/:/ # mkdir /etc/dvs default/:/ # cp /software/node-map.ss /etc/dvs/ default/:/ # ln -s /etc/dvs/node-map.ss /etc/dvs/node-map default/:/ # exit boot001:~# exit 8 Cray Private S–0005–10Cray DVS Configuration [4] Follow these steps to configure Cray DVS on your system. 1. For each NFS file system being projected, verify that the DVS server node is running the NFS file system client. 2. Verify that the same directory path exists on the DVS server node that will serve the DVS file system, and that the same directory path exists on the DVS client nodes. 3. Ensure that all DVS server and client nodes have access to an identical (or shared) copy of the /etc/dvs/node-map file. This file should include a line for each DVS server and client node. 4. Configure the system to mount the DVS file system on the DVS client nodes by completing the steps in the section below, entitled Creating fstab Entries and Mount Points. See the dvs(5) man page for details and examples. 5. Configure a CNL boot image for DVS following the steps in the section below entitled Creating the Boot Image. 6. Start the DVS service on all DVS server and client nodes by rebooting the system. The section entitled Configuring Boot Automation, describes how to start DVS services automatically. 4.1 Creating fstab Entries and Mount Points After Cray DVS software has been successfully installed on both the service and compute nodes, you can mount a network file system on the compute nodes that require access. When a client mounts the file system, all of the information needed is specified on the mount command. Follow the steps in this section to configure your Cray XT system to mount a network file system using Cray DVS. See the dvs(5) man page for more information regarding Cray DVS mount options. S–0005–10 Cray Private 9Cray DVS Installation and Configuration To allow the compute nodes to mount their DVS partitions, you'll need to add appropriate fstab entries. Add a line similar to this example to the /opt/xt-images/xthostname-XT_version-dvs/etc/fstab file on the SMW. This example will use DVS to mount /ufs/home from node c0-0c0s1n0 to /ufs/home on the client node. smw:~# vi /opt/xt-images/xthostname-XT_version-dvs/etc/fstab /ufs/home /ufs/home dvs path=/ufs/home,nodename=c0-0c0s1n0 Create mount point directories in the compute image for each DVS mount in the /etc/fstab file. For the example shown in the previous section, enter the following command: smw:~ # mkdir -p /opt/xt-images/xthostname-XT_version-dvs/ufs/home Optionally, create any symbolic links that will be used in the compute node images . For example: smw:~ # cd /opt/xt-images/xthostname-XT_version-dvs smw:~ # ln -s /ufs/home home 4.2 Creating the Boot Image Create the CNL boot image where parameters is the path to the parameters list and BOOTIMAGE is the either a raw device or the boot image file. smw:~ # xtpackage /opt/xt-images/xthostname-XT_version-dvs smw:~ # xtbootimg -L /opt/xt-images/xthostname-XT_version-dvs/CNL0.load \ -P parameters -c BOOTIMAGE If BOOTIMAGE is a boot image file and not a raw device, update the boot image configuration. smw:~# xtcli boot_cfg update -i BOOTIMAGE The new boot image you have created takes effect when the CNL compute nodes are rebooted. 10 Cray Private S–0005–10Cray DVS Configuration [4] 4.3 Configuring Boot Automation The xtbootsys -a command (see the xtbootsys(8) man page) enables you to specify a file to control automated system boot. You can configure this script to start DVS on the SIO nodes responsible for serving DVS file systems. For example, if login nodes are being used to serve the /ufs/home file system, edit the automation file as follows, where auto.xthostname is the name of the site specific automation file: smw:~> cd /opt/cray/etc smw:~> vi auto.xthostname lappend actions { { crms_exec_via_bootnode "login" "root" "/etc/init.d/dvs start" } Note: DVS should be started on service nodes before booting compute nodes. After you have configured boot automation, start Cray DVS services by rebooting the Cray XT system. S–0005–10 Cray Private 11Cray DVS Installation and Configuration 12 Cray Private S–0005–10S–0005–10 Cray Private 13Cray DVS Installation and Configuration 14 Cray Private S–0005–10dvs(5) Man Page [5] dvs(5) Man Page [5] 5.1 NAME dvs — Cray DVS fstab format and options 5.2 SYNOPSIS /etc/fstab 5.3 IMPLEMENTATION UNICOS/lc operating system: supported for Cray XT CNL compute nodes 5.4 DESCRIPTION The fstab file contains information about which file systems to mount where and with what options. For Cray DVS mounts, the fstab line contains the server's exported mountpoint path in the first field, the local mountpoint path in the second field, and the file system type dvs in the third field. The fourth field contains comma separated DVS-specific mount options described below. 5.5 OPTIONS path=/pathname Set pathname to the mountpoint on the DVS server node. The pathname should be an absolute path, and must exist on the DVS server node. This is a required argument on the options field. S–0005–10 Cray Private 15Cray DVS Installation and Configuration nodename=node Specify the DVS server node name that will provide service to the file system specified by the path argument. The path name must exist on the server node specified. Specify the physical ID for the node, for example c0-0c0s0n0, which maps to an entry in the node-map file where it is translated to a node ordinal. This is a required argument on the options field. blksize=n Sets the DVS block size to n bytes. cache Enables client-side read caching. The client node will perform caching of reads from the DVS server node and provide data to user applications from the page cache if possible, instead of performing a data transfer from the DVS server node. Note: Cray DVS is not a clustered file system; No coherency is maintained between multiple DVS client nodes reading and writing to the same file. If cache is enabled and data consistency is required, applications must take care to synchronize their accesses to the shared file. nocache Disables client-side read caching. This is the default behavior. datasync Enables data synchronization. The DVS server node will wait until data has been written to the underlying media before indicating that the write has completed. nodatasync Disables data synchronization. The DVS server node will return from a write request as soon as the user's data has been written into the page cache on the server node. This is the default behavior. retry Enables the retry option, which affects how a DVS client node behaves in the event of a DVS server node going down. If retry is specified, any user I/O request is retried until it succeeds, receives an error other than a node down indication, or receives a signal to interrupt the I/O operation. This is the default behavior. noretry Disables the retry option. An I/O that failed due to a DVS server node failure will return an EHOSTDOWN error to the user application without attempting the operation again. 16 Cray Private S–0005–10dvs(5) Man Page [5] clusterfs Set the clusterfs option when the DVS servers are providing access to an underlying file system that is shared or clustered. File I/O to DVS clusterfs file systems will go to a single shared file. This is currently the only supported mode for Cray DVS; The clusterfs option is set by default. maxnodes=n Deferred implementation - this option is not currently supported. If the clusterfs option was specified, limit the I/O to a subset of n DVS server nodes out of the list of nodes provided. This allows the administrator to mount a DVS file system that is accessible to a large number of nodes, but have I/O only go to a smaller number nodes out of the possible set. If one of the in-use set of nodes fails, DVS on the client node may choose a replacement node from the larger set. 5.6 EXAMPLES Here is an example /etc/fstab file entry for a DVS client to mount /dvs-shared on the DVS server node as /dvs. /dvs /dvs dvs noauto,path=/dvs-shared,nodename=c0-0c0s1n0 5.7 FILES /etc/fstab Static information about file systems /etc/dvs/node-map Mapping of node ids to node ordinals for DVS 5.8 SEE ALSO fstab(5), mount(8), ummount(8) S–0005–10 Cray Private 17 Application Cleanup by ALPS and Node Health Monitoring Abstract This paper describes the boundaries of responsibility for the separate ALPS and node health monitoring products. These two products cooperate in performing application cleanup following an application unorderly exit. This paper also provides some technical details and troubleshooting hints related to this activity for Cray Linux Environment (CLE) 2.1 and 2.2 Cray XT systems.© 2009 Cray Inc. All Rights Reserved. This document or parts thereof may not be reproduced in any form unless permitted by contract or by written permission of Cray Inc. U.S. GOVERNMENT RESTRICTED RIGHTS NOTICE The Computer Software is delivered as "Commercial Computer Software" as de?ned in DFARS 48 CFR 252.227-7014. All Computer Software and Computer Software Documentation acquired by or for the U.S. Government is provided with Restricted Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7014, as applicable. Technical Data acquired by or for the U.S. Government, if any, is provided with Limited Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7013, as applicable. Cray, LibSci, and UNICOS are federally registered trademarks and Active Manager, Cray Apprentice2, Cray Apprentice2 Desktop, Cray C++ Compiling System, Cray CX1, Cray Fortran Compiler, Cray Linux Environment, Cray SeaStar, Cray SeaStar2, Cray SeaStar2+, Cray SHMEM, Cray Threadstorm, Cray X1, Cray X1E, Cray X2, Cray XD1, Cray XMT, Cray XR1, Cray XT, Cray XT3, Cray XT4, Cray XT5, Cray XT5 h , Cray XT5m, CrayDoc, CrayPort, CRInform, ECOphlex, Libsci, NodeKARE, RapidArray, UNICOS/lc, UNICOS/mk, and UNICOS/mp are trademarks of Cray Inc. Linux is a trademark of Linus Torvalds. UNIX, the “X device,” X Window System, and X/Open are trademarks of The Open Group in the United States and other countries. All other trademarks are the property of their respective owners. Version 2.2 Published July 2009 Supports general availability (GA) release of the Cray Linux Environment (CLE) 2.2 operating system running on Cray XT systems.Application Cleanup by ALPS and Node Health Monitoring Table of Contents Overview . . . . . . . . . . . . . . . . . . . . . . . . . 4 aprun . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 apinit . . . . . . . . . . . . . . . . . . . . . . . . . . 5 apsys . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 apmgrcleanup . . . . . . . . . . . . . . . . . . . . . . . 7 Node Health Monitoring . . . . . . . . . . . . . . . . . . . . 8 Cray Linux Environment 2.1 Node Health Checker Monitoring . . . . . . . . . . . . 9 CLE 2.2 Node Health Checker Monitoring . . . . . . . . . . . . . . . . . . 10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 12 S–0014–22 3Application Cleanup by ALPS and Node Health Monitoring Overview During normal Cray XT operations, applications are run on a set of nodes, complete successfully, then those node resources are reallocated for other applications. When an application exit is considered orderly, a set of up to four unique application process exit codes and exit signals is gathered and consolidated by ALPS on each compute node within the application placement list. Once all of the application processes on a compute node have exited, that compute node adds its local exit information to this consolidated list of exit data. The exit information is sent to aprun over the ALPS application speci?c TCP fan-out tree control network. All of the application processes must have completely exited before this exit information is received by aprun. aprun forwards the compiled exit information to apsys just before aprun itself exits. Once all exit information has been received from the compute nodes, the application exit is considered orderly. An orderly exit does not necessarily mean that the application completed successfully. An orderly exit means that exit information about the application was received by aprun and forwarded to apsys. apsys sends an exit message to apsched, which releases the reserved resources for another application. An unorderly exit means that exit information has not been received by apsys prior to an aprun exit. A typical occurrence of an unorderly exit consists of a SIGKILL signal being sent to aprun by the batch system after the application's wall time limit is exceeded. Since there is no exit information available to apsys during an unorderly exit, apsys does not know the true state of the application processes on the compute nodes. Therefore, ALPS must perform application cleanup on each of the assigned compute nodes before it is safe to free those application resources for another application. Application cleanup begins with ALPS contacting each assigned compute node and sending a SIGKILL signal to any remaining application processes. Node health monitoring checks compute node conditions and will mark a compute node admindown for reasons described in later sections. ALPS cannot free the application resources for reallocation until all of the application processes have exited or node health monitoring has marked applicable compute nodes admindown or suspect. Until that time, the application will continue to be shown in apstat displays. S–0014–22 4Application Cleanup by ALPS and Node Health Monitoring aprun The aprun command is the ALPS application launch command on login nodes and the SDB node. aprun has a persistent TCP connection to a local apsys. aprun also has a persistent TCP connection to an apinit daemon child on the ?rst compute node with in the assigned placement list, but not to an apinit on each assigned compute node. After receiving a placement list from apsched, aprun writes information into the syslog as in the example below. Apr 13 06:40:47 nid00016 aprun[12911]: apid=821502, Starting, user=1356, cmd_line="aprun -a XT -n 40 /ostest/rel.22/xtcnl/apps/ROOT.latest/shmem_ISU/src/sma1/RUN/shmem_lock_tes t_clear.c.x ", num_nodes=5, node_list=583,587,591,772,776 In a typical case of an orderly exit, aprun receives application exit information over the connection from that apinit. aprun then forwards the exit information over the connection to apsys. The ordering of application exit signals and exit codes is arbitrary. aprun displays any nonzero application exit information and uses the application exit information to determine its own exit code: Application 284004 exit signals: Terminated In the case of an unorderly exit, aprun exits without receiving application exit information. When aprun exits, its TCP connections are closed. The socket closes trigger application cleanup activity by both apinit and apsys as described in following sections. An unorderly exit may occur for various reasons. The usual causes of an unorderly exit include the following cases: • The batch system sends a SIGKILL signal to aprun due to the application wall time expiring • apkill or kill are used to send a SIGKILL signal to aprun • aprun receives a fatal message from apinit due to some fatal error during launch or at other points during the application lifetime, causing aprun to write the message to stderr and exit • aprun receives a fatal read, write or unexpected close error on the TCP socket it uses to communicate with apinit apinit apinit is the ALPS privileged daemon that launches and manages applications on compute nodes. For each application, the apinit daemon forks a child apshepherd process. Within ps displays, the child apshepherd processes retain the name "apinit". The per-application TCP fan-out control tree has aprun as the root. Each compute node apshepherd within this control tree has a parent controller and may have a set of controlling nodes. Whenever a parent controller socket connection closes, the local apshepherd attempts to kill any application processes still executing and then will exit. This socket closing process results in a ripple effect through the fan-out control tree, resulting in automatic application tear down. S–0014–22 5Application Cleanup by ALPS and Node Health Monitoring Whenever the aprun TCP connection to the apshepherd on the ?rst compute node within the placement list closes, the tear down process begins. During an application orderly exit, the exit information is sent to aprun, followed by the aprun closure of the socket connection, resulting in the exit of the apshepherd. The apshepherd exit causes its controlling socket connections to close as well. Each of those apshepherds will exit, and the application speci?c fan-out tree shuts down in an orderly fashion. When the aprun TCP socket closure is not expected and the application processes are still executing, the apshepherd will send a SIGKILL signal to each local application process and then exit. There can be local delays in kernel delivery of the SIGKILL signal to the application processes due to application I/O activity. The application process will receive the SIGKILL signal after the I/O completes. The apinit daemon is then responsible to monitor any remaining application processes. This kill and exit process ripples throughout the control tree. However, if any compute node within the control tree is unresponsive, the ripple effect will stop for any compute nodes beyond that branch portion of the tree. In response to this situation, ALPS must take action independent of the shutdown of the control tree to ensure all of the application processes have exited or that compute nodes are marked either admindown or suspect by node health monitoring. The apsys daemon is involved in invoking the independent action. apsys apsys is a local privileged ALPS daemon that runs on each login node and the SDB node. When contacted by aprun, the apsys daemon forks a child agent process to handle that speci?c local aprun. The apsys agent provides a privileged communication path between aprun and apsched for placement and exit information exchanges. The apsys agent name remains "apsys" within ps displays. During an orderly application exit, the apsys agent receives exit information from aprun and forwards that information to apsched. However, during an unorderly exit, when the aprun socket connection closes prior to receipt of exit information, the apsys agent is responsible to start application cleanup on the assigned compute nodes. To begin application cleanup, the apsys agent invokes apmgrcleanup, and the apsys agent blocks until apmgrcleanup completes. At the start of application cleanup, the /var/log/alps/apsysMMDD log ?le will display data similar to the below messages: • on CLE 2.1 09:58:53: [5237] Agent unexpected close of peer connection 6, apid 1950499 09:58:53: [5237] Agent invoking apmgrcleanup for apid 1950499 logger: /opt/xt-service/default/bin/snos64/xtok2 -f /tmp/apsysLJludO -- see /var/log/xtoklog • on CLE 2.2 14:00:20: [32606] Agent unexpected close of peer connection 6, apid 227061 14:00:22: [32606] Agent invoking apmgrcleanup for apid 227061 Mon Feb 2 14:00:22 CST 2009 (xtcleanup_after): Starting /opt/xt-service/default/bin/snos64/xtcheckhealth /tmp/apsys3LbdsN 227061 0 1 < /etc/sysconfig/nodehealth 6 S–0014–22Application Cleanup by ALPS and Node Health Monitoring After apmgrcleanup returns, the apsys log ?le will contain something similar to the sample message below: 14:02:30: [32606] Agent sending ALPSMSG_EXIT message to apsched fd 7, apid 227061 14:02:30: [32606] Agent received ALPSMSG_EXITCONFIRM from apsched fd 7, apid 227061 In the above example,apsched has been told that the resources assigned to that aprun can now be reallocated to another application. The apstat display will no longer show information about this application. apmgrcleanup apsys invokes apmgrcleanup for each application unorderly exit. apmgrcleanup is a shell script that is invoked to do application cleanup for a speci?c application. As part of this cleanup activity, apmgrcleanup calls another script, which may invoke node health monitoring. apmgrcleanup executes with the permissions of the apsys caller, which runs as root. You must be root to edit the apmgrcleanup ?le. apmgrcleanup works with a placement list of assigned compute nodes for a speci?c application. This application cleanup activity will guarantee that a new application is not placed on this set of compute nodes prematurely. A new application placed on these compute nodes prematurely will result in application failure due to compute node core and/or memory resources still being assigned to the current application. apmgrcleanup will contact every node in the placement list supplied to it. apmgrcleanup will ?rst use apmgr to send a kill request message for a speci?c application to each node on the placement list, then requests status information about an application on that compute node. apmgrcleanup uses apmgr to send status request messages to the apinit on that set of compute nodes to ?nd out when all of the local application processes have exited. The kernel may not immediately deliver a SIGKILL signal to application processes if those processes are involved in I/O activity. apmgrcleanup begins by calling apmgr to send a ping kill message to the apinit daemon on each compute node in the placement list for the given application. If there are more than 500 nodes in the list, apmgrcleanup will use nway to perform eight apmgr invocations at a time, in a sliding window fashion, for parallelization. apmgrcleanup continues to loop until the list of nodes reaches zero. apmgr writes messages to the syslog after each successfully sent ping kill message. These messages only mean that a message was received by the compute node apinit daemon. The application processes may still exist if the SIGKILL delivery to an application process remains pending due to I/O activity. Below is a sample of ping kill messages written to the syslog: Apr 13 06:55:31 nid00016 apmgr[20277]: apid=821502, killed on nid=587 Apr 13 06:55:31 nid00016 apmgr[20279]: apid=821502, killed on nid=591 Apr 13 06:55:31 nid00016 apmgr[20281]: apid=821502, killed on nid=772 Apr 13 06:55:31 nid00016 apmgr[20283]: apid=821502, killed on nid=776 S–0014–22 7Application Cleanup by ALPS and Node Health Monitoring Inside its main loop, apmgrcleanup calls the xtcleanup_after script with the initial (full) placement list of compute nodes for the application. Each invocation includes a randomly generated ?lename (/tmp/apsysXXXX) that holds the node list and an invocation count. Apr 13 06:55:31 nid00016 06:55:31: /usr/bin/apmgrcleanup [18964] invoking /opt/xt-service/default/bin/snos64/xtcleanup_after /tmp/apsysdbajiE 821502 0 with 5 entries Then invocation count tells xtcleanup_after if this is the ?rst or subsequent call of the script. The xtcleanup_after script typically calls node health monitoring. The script is site con?gurable to modify its behavior as desired; however, modifying this script is not recommended. On return from xtcleanup_after, apmgrcleanup will wait one or more seconds, depending on machine size, to avoid looping too quickly, then it rechecks the list of nodes. First, apmgrcleanup invokes apstat and checks for compute nodes that are not marked up, removing them from the /tmp/apsysXXXX file. Then, it calls apmgr to send a ping status request to the apinit daemon on the remaining compute nodes. A compute node is removed from the /tmp/apsysXXXX ?le whenever the apinit on that compute node responds to the ping status request stating that no application processes remain on that compute node, or when the node is no longer marked up. The ping status request has a ?ve-second time limit. Any nodes remaining, (i.e. not heard from, still marked up) will stay in the ?le of nodes for the next iteration of the apmgrcleanup loop. When the /tmp/apsysXXXX ?le is empty, apmgrcleanup will exit. Then, apsys writes a message into the syslog and can tell apsched to release the aprun claim for that set of compute nodes. Apr 13 07:18:51 nid00016 apsys[6891]: apid=821502, Finishing, user=1356 After the initial apmgr ping kill messages are sent to the apinit daemon on the set of compute nodes within the /tmp/apsysXXXX ?le, apmgrcleanup calls the xtcleanup_after script to invoke node health monitoring. If node health monitoring is enabled, compute nodes may be marked admindown or suspect by node health monitoring as described in following sections. Node Health Monitoring Node health monitoring is different on CLE 2.1 and CLE 2.2. The following sections describe the behavior for each release. The CLE 2.1 node health checker uses apmgr ping for node health monitoring. The CLE 2.2 node health checker is much faster because it has its own TCP fan-out tree for communication with a compute node resident daemon. 8 S–0014–22Application Cleanup by ALPS and Node Health Monitoring Cray Linux Environment 2.1 Node Health Checker Monitoring The CLE 2.1 node health checker comprises xtcleanup_after and xtok2, which work together to diagnose the health of all compute nodes in an aprun placement list. xtcleanup_after is a bash script which receives a list of nodes from apmgrcleanup and uses xtok2 to check the health of nodes. Nodes are considered healthy if they respond to an apmgr status query of the apinit daemon running on the compute node, and if apinit con?rms that the user application is no longer running on the compute node. The CLE 2.1 node health checker logs its behavior in the /var/log/xtoklog ?le on the service/login node on which it executes. A message from xtcleanup_after shows the time at which the process begins and the apid. 03/03 14:30:37 Checking for ill nodes using /opt/xt-service/default/bin/snos64/xtok2, apid 22800 xtcleanup_after initially runs xtok2 four times. Any time that a node responds to the status query and con?rms the user application has exited, that node is removed from the list of nodes that are being scanned. Nodes that fail one of these tests are marked as suspect. If a suspect node passes these tests in a subsequent run of xtok2, it is returned to the up state. xtok2 - node 1436, marked suspect. xtok2 - node 1431, marked up. At the end of these four passes, if there remain nodes that continue to fail these tests, xtok2 forks off a background scanner. The background scanner will continue to evaluate these nodes, while xtcleanup_after returns an empty node list to apmgrcleanup. This allows the aprun claim on those nodes to be removed, and all healthy nodes are made available for new user applications. The background xtok2 then checks the list of remaining suspect nodes every ?fteen minutes, over a two-hour period. Like other calls to xtok2, any suspect nodes that start passing both tests will be marked up. After eight test runs, all nodes that remain suspect will be marked admindown, and xtok2 will exit. Use of suspect mode and the background xtok2 are a new feature to CLE 2.1UP01. In previous versions, xtcleanup_after called xtok2 six times total, but never created the background xtok2. All nodes that did not respond to an apinit status query within these six calls of xtok2 would be marked admindown. 03/03 14:30:37 Marking unresponsive node(s): suspect, forking off background rescan. Healthy nodes back in usable node list. 1438 1436 xtok2 backgroundscan apid 22800/i 0. xtok2 backgroundscan apid 22800/i 1. xtok2 backgroundscan apid 22800/i 2. xtok2 backgroundscan apid 22800/i 3. xtok2 backgroundscan apid 22800/i 4. xtok2 backgroundscan apid 22800/i 5. xtok2 backgroundscan apid 22800/i 6. xtok2 - node 1436, marked admindown. xtok2 - node 1437, marked admindown. xtok2 - node 1438, marked admindown. xtok2 backgroundscan 22800/7 timeout. S–0014–22 9Application Cleanup by ALPS and Node Health Monitoring Any compute node marked admindown by the background xtok2 requires manual intervention to reboot the node or to change its state to another value, as appropriate, after investigation of the cause. CLE 2.2 Node Health Checker Monitoring On the service/login node, CLE 2.2 node health checker comprises xtcleanup_after script, xtcheckhealth, and the /etc/sysconfig/nodehealth con?guration ?le. xtcleanup_after is a bash script, which receives a placement list of compute nodes assigned to a speci?c application from apmgrcleanup. xtcheckhealth is a binary that checks the health of the nodes in this list. Parameters can be set in the /etc/sysconfig/nodehealth con?guration ?le to change the behavior of node health checker. This con?guration ?le is located on the shared root and is available on all of the login nodes. For more information about CLE 2.2 node health checker, see the intro_NHC(8), xtcheckhealth, xtcleanup_after, and xtok2(8) man pages and Cray System Management, which are provided with the CLE 2.2 release. The NHC con?guration ?le is self-documented. Unlike CLE 2.1 node health checker, CLE 2.2 node health checker has an xtnhd daemon that runs locally on the compute nodes. When invoked, xtcheckhealth sends the /etc/sysconfig/nodehealth con?guration ?le to xtnhd on each compute node within the speci?c application placement list. Based on the parameters set in the node health con?guration ?le, xtnhd launches certain tests on the designated compute nodes. Two tests are enabled by default in the /etc/sysconfig/nodehealth con?guration ?le: ALPS test and Application test. These tests perform similar functions to the checks performed by CLE 2.1 node health checker. The ALPS test uses xtnhd locally to query the status of the apinit daemon on each compute node. apmgr ping from a login node is not used for this test. If the apinit daemon does not respond to the query, then this test fails. The Application test checks locally to see if there are processes running under the apid of the application. If there are processes running, then node health checker waits a period of time (set in the con?guration ?le) to determine if the application processes properly exit. If the process does not exit within this time, then that node health checker test fails. Suspect Mode is a con?gurable option in CLE 2.2 node health checker. If Suspect Mode is enabled, then nodes that fail tests will be put into suspect state. These nodes may be returned to the up state if they recover within the window of time given to them for Suspect Mode. Otherwise, they will be marked admindown at the end of this window of time. The entry time and duration time of Suspect Mode are con?gurable in the node health con?guration ?le. Node health checker calls xtok2 to implement Suspect Mode. If Suspect Mode is not enabled, then nodes that fail tests are immediately are set to admindown. 10 S–0014–22Application Cleanup by ALPS and Node Health Monitoring Node health checker sends its output to the /opt/craylog/bootlogs/console.YYMMDDHHMM log ?le on the SMW. The output includes data from the individual tests that were run on each compute node as well as output from xtcheckhealth, which is run on a login node. The following are some sample console messages: [2009-04-13 06:57:57][c0-0c0s4n0] APID:821502 (xtcheckhealth) WARNING: Could not set 583 to admindown because its state is down. [2009-04-13 06:58:57][c0-0c0s4n0] APID:821502 (xtcheckhealth) WARNING: Node: 587 didn't respond to query. [2009-04-13 06:59:31][c4-0c2s1n3] APID:821502 (launch_tests) WARNING: Warning timeout expired: 240 seconds. Current process: 7029 (check_apid). Test type: APID. Cfg Line: 137 [2009-04-13 07:00:31][c4-0c2s1n3] APID:821502 (check_apid) WARNING: Failure: File /dev/cpuset/821502/tasks exists and is not empty. The following processes are running under expired APID 821502: [2009-04-13 07:00:31][c4-0c2s1n3] APID:821502 (check_apid) WARNING: Pid: 7008 Name: (shmem_lock_test) State: S [2009-04-13 07:00:31][c4-0c2s1n3] APID:821502 (launch_tests) WARNING: Sent SIGTERM to process group 7029 (check_apid). xtcleanup_after writes its output to both the con?gured syslog location and the xtcheckhealth_log ?le. That ?le is located in /var/log/xtcheckhealth_log on the service/login node that is executing the cleanup. The output from xtcleanup_after includes the call to xtcheckhealth and its parameters. xtcleanup_after will also output any errors that it encounters while trying to launch xtcheckhealth. Mon Apr 13 06:55:31 CDT 2009 (xtcleanup_after) /opt/xtservice/ default/bin/snos64/xtcheckhealth /tmp/apsysdbajiE 821502 0 1 < /etc/sysconfig/nodehealth xtok2 writes output to the console log ?le on the SMW. It also writes output to the /var/log/xtoklog ?le located on the executing service/login node, which was the default location for the CLE 2.1 node health checker. 04/13 07:18:49 Checking for ill nodes using /opt/xt-service/default/bin/snos64/xtok2, apid 821502 xtok2 node 772 already down. xtok2 node 587 already down. S–0014–22 11Application Cleanup by ALPS and Node Health Monitoring Conclusion There are two distinct and unexpected situations that may occur during application cleanup following an unorderly exit. The addition of node health monitoring Suspect Mode removes or minimizes the impact of these two cases. • There is a noticeable delay in completion of application cleanup which delays the apsched removal of a batch resource reservation or an aprun claim. • Some compute node has been marked admindown, but some period of time later that node seems to be functioning correctly. There are a number of circumstances that can delay completion of application cleanup after an unorderly exit. This delay is often detected through apstat displays that still show the application and the resource reservation for that application. Please provide the following information plus the applicable compute node /var/log/alps/apinitMMDD log ?les as supporting information for any bugs. As described in previous sections, check the various log ?les to understand what activity has taken place for a speci?c application. • Check the /var/log/alps/apsysMMDD log ?les for that apid; verify apmgrcleanup has been invoked. • On that same login node, use ps to check if apmgrcleanup is still executing. • Check the applicable node health monitoring log ?les (/var/log/xtoklog, /var/log/xtcheckhealth_log, and the syslog) for that apid. • Check the SMW /opt/craylog/bootlogs/console.YYMMDDHHMM log ?le for that apid. 12 S–0014–22 Application Programmer’s I/O Guide S–3695–36© 1994, 1995, 1997-1999, 2001, 2002 Cray Inc. All Rights Reserved. This manual or parts thereof may not be reproduced in any form unless permitted by contract or by written permission of Cray Inc. U.S. GOVERNMENT RESTRICTED RIGHTS NOTICE The Computer Software is delivered as "Commercial Computer Software" as defined in DFARS 48 CFR 252.227-7014. All Computer Software and Computer Software Documentation acquired by or for the U.S. Government is provided with Restricted Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7014, as applicable. Technical Data acquired by or for the U.S. Government, if any, is provided with Limited Rights. Use, duplication or disclosure by the U.S. Government is subject to the restrictions described in FAR 48 CFR 52.227-14 or DFARS 48 CFR 252.227-7013, as applicable. Autotasking, CF77, Cray, Cray Ada, Cray Channels, Cray Chips, CraySoft, Cray Y-MP, Cray-1, CRInform, CRI/TurboKiva, HSX, LibSci, MPP Apprentice, SSD, SuperCluster, UNICOS, UNICOS/mk, and X-MP EA are federally registered trademarks and Because no workstation is an island, CCI, CCMT, CF90, CFT, CFT2, CFT77, ConCurrent Maintenance Tools, COS, Cray Animation Theater, Cray APP, Cray C90, Cray C90D, Cray CF90, Cray C++ Compiling System, CrayDoc, Cray EL, Cray Fortran Compiler, Cray J90, Cray J90se, Cray J916, Cray J932, CrayLink, Cray MTA, Cray MTA-2, Cray MTX, Cray NQS, Cray/REELlibrarian, Cray S-MP, Cray SSD-T90, Cray SV1, Cray SV1ex, Cray SV2, Cray SX-5, Cray SX-6, Cray T90, Cray T94, Cray T916, Cray T932, Cray T3D, Cray T3D MC, Cray T3D MCA, Cray T3D SC, Cray T3E, CrayTutor, Cray X-MP, Cray XMS, Cray-2, CSIM, CVT, Delivering the power . . ., DGauss, Docview, EMDS, GigaRing, HEXAR, IOS, ND Series Network Disk Array, Network Queuing Environment, Network Queuing Tools, OLNET, RQS, SEGLDR, SMARTE, SUPERLINK, System Maintenance and Remote Testing Environment, Trusted UNICOS, and UNICOS MAX are trademarks of Cray Inc. CDC is a trademark of Control Data Systems, Inc. DEC, ULTRIX, VAX, and VMS are trademarks of Digital Equipment Corporation. ER90 is a trademark of EMASS, Inc. ETA is a trademark of ETA Systems, Inc. IBM is a trademark of International Business Machines Corporation. IRIX and SGI are trademarks of Silicon Graphics, Inc. MIPS is a registered trademark and MIPSpro is a trademark of MIPS Technologies, Inc. UNIX, the “X device,” X Window System, and X/Open are trademarks of The Open Group in the United States and other countries. All other trademarks are the property of their respective owners. The UNICOS operating system is derived from UNIX System V. The UNICOS operating system is also based in part on the Fourth Berkeley Software Distribution (BSD) under license from The Regents of the University of California.New Features Application Programmer’s I/O Guide S–3695–36 This version of the manual supports only Cray T3E and Cray SV1 systems. This revision contains editorial changes throughout.Record of Revision Version Description 1.0 May 1994 Original Printing. This document incorporates information from the I/O User’s Guide, publication SG-3075, and the Advanced I/O User’s Guide, publication SG-3076. 1.2 October 1994 Revised for the Programming Environment 1.2 release. 2.0 November 1995 Revised for the Programming Environment 2.0 release. 3.0 May 1997 Revised for the Programming Environment 3.0 release. 3.0.1 August 1997 Revised for the Programming Environment 3.0.1 release and the MIPSpro 7 Fortran 90 compiler release. 3.0.2 March 1998 Revised for the Programming Environment 3.0.2 release and the MIPSpro 7 Fortran 90 compiler release. 3.1 August 1998 Revised for the Programming Environment 3.1 release. 3.2 January 1999 Revised for the Programming Environment 3.2 release. 3.3 July 1999 Revised for the Programming Environment 3.3 release. 3.5 January 2001 Revised for the Programming Environment 3.5 release. 36 May 2002 Revised for the Programming Environment 3.6 release. S–3695–36 iContents Page Preface xv Related Publications . . . . . . . . . . . . . . . . . . . . . . . xv Ordering Documentation . . . . . . . . . . . . . . . . . . . . . . xv Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . xvi Reader Comments . . . . . . . . . . . . . . . . . . . . . . . . xvii Introduction [1] 1 The Message System . . . . . . . . . . . . . . . . . . . . . . . 2 Standard Fortran I/O [2] 5 Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Internal Files . . . . . . . . . . . . . . . . . . . . . . . . . 5 External Files . . . . . . . . . . . . . . . . . . . . . . . . . 6 Fortran Unit Identifiers . . . . . . . . . . . . . . . . . . . . . . . 8 Data Transfer Statements . . . . . . . . . . . . . . . . . . . . . . 11 Formatted I/O . . . . . . . . . . . . . . . . . . . . . . . . 11 Edit-Directed I/O . . . . . . . . . . . . . . . . . . . . . . . 12 Procedure 1: Optimization technique: using single statements . . . . . . . . 12 Procedure 2: Optimization technique: using longer records . . . . . . . . . 13 Procedure 3: Optimization technique: using repeated edit descriptors . . . . . 13 Procedure 4: Optimization technique: using data edit descriptors . . . . . . . 14 List-Directed I/O . . . . . . . . . . . . . . . . . . . . . . . 14 Unformatted I/O . . . . . . . . . . . . . . . . . . . . . . . . 16 Auxiliary I/O . . . . . . . . . . . . . . . . . . . . . . . . . 16 File Connection Statements . . . . . . . . . . . . . . . . . . . . . 17 The INQUIRE Statement . . . . . . . . . . . . . . . . . . . . . 17 File Positioning Statements . . . . . . . . . . . . . . . . . . . . . 18 S–3695–36 iiiApplication Programmer’s I/O Guide Page Private I/O on Cray T3E systems . . . . . . . . . . . . . . . . . . . . 18 Multithreading and Standard Fortran I/O . . . . . . . . . . . . . . . . . 19 Fortran I/O Extensions [3] 21 BUFFER IN/BUFFER OUT Routines . . . . . . . . . . . . . . . . . . . 21 The UNIT Intrinsic . . . . . . . . . . . . . . . . . . . . . . . 22 The LENGTH Intrinsic . . . . . . . . . . . . . . . . . . . . . . 22 Positioning . . . . . . . . . . . . . . . . . . . . . . . . . 23 Random Access I/O Routines . . . . . . . . . . . . . . . . . . . . . 23 Example 1: MS package use . . . . . . . . . . . . . . . . . . . . 26 Example 2: DR package use . . . . . . . . . . . . . . . . . . . . 27 Word-Addressable I/O Routines . . . . . . . . . . . . . . . . . . . . 28 Example 3: WA package use . . . . . . . . . . . . . . . . . . . . 30 Asynchronous Queued I/O (AQIO) Routines . . . . . . . . . . . . . . . . 31 Error Detection by Using AQIO . . . . . . . . . . . . . . . . . . . 33 Example 4: AQIO routines: compound read operations . . . . . . . . . . . 33 Example 5: AQIO routines: error detection . . . . . . . . . . . . . . 36 Logical Record I/O Routines . . . . . . . . . . . . . . . . . . . . . 38 Tape and Named Pipe Support [4] 41 Tape Support . . . . . . . . . . . . . . . . . . . . . . . . . . 41 User EOV Processing . . . . . . . . . . . . . . . . . . . . . . 41 Handling Bad Data on Tapes . . . . . . . . . . . . . . . . . . . . 42 Positioning . . . . . . . . . . . . . . . . . . . . . . . . . 42 Named Pipes . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Piped I/O Example Without End-of-File Detection . . . . . . . . . . . . . 44 Example 6: No EOF detection: writerd . . . . . . . . . . . . . . . 44 Example 7: No EOF detection: readwt . . . . . . . . . . . . . . . 44 Detecting End-of-File on a Named Pipe . . . . . . . . . . . . . . . . . 45 Piped I/O Example With End-of-File Detection . . . . . . . . . . . . . . 45 Example 8: EOF detection: writerd . . . . . . . . . . . . . . . . 46 iv S–3695–36Contents Page Example 9: EOF detection: readwt . . . . . . . . . . . . . . . . . 46 System and C I/O [5] 49 System I/O . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Synchronous I/O . . . . . . . . . . . . . . . . . . . . . . . . 49 Asynchronous I/O . . . . . . . . . . . . . . . . . . . . . . . 49 listio I/O . . . . . . . . . . . . . . . . . . . . . . . . . 50 Unbuffered I/O . . . . . . . . . . . . . . . . . . . . . . . . 50 C I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 C I/O from Fortran . . . . . . . . . . . . . . . . . . . . . . . 50 Example 10: C I/O from Fortran . . . . . . . . . . . . . . . . . . 51 C I/O on Cray T3E systems . . . . . . . . . . . . . . . . . . . . 52 The assign Environment [6] 55 assign Basics . . . . . . . . . . . . . . . . . . . . . . . . . 55 Open Processing . . . . . . . . . . . . . . . . . . . . . . . . 55 The assign Command . . . . . . . . . . . . . . . . . . . . . . 56 Related Library Routines . . . . . . . . . . . . . . . . . . . . . 60 assign and Fortran I/O . . . . . . . . . . . . . . . . . . . . . . 61 Alternative File Names . . . . . . . . . . . . . . . . . . . . . . 61 File Structure Selection . . . . . . . . . . . . . . . . . . . . . . 63 Buffer Size Specification . . . . . . . . . . . . . . . . . . . . . 64 Foreign File Format Specification . . . . . . . . . . . . . . . . . . . 65 File Space Allocation . . . . . . . . . . . . . . . . . . . . . . 65 Device Allocation . . . . . . . . . . . . . . . . . . . . . . . 66 Direct-Access I/O Tuning . . . . . . . . . . . . . . . . . . . . . 67 Fortran File Truncation . . . . . . . . . . . . . . . . . . . . . . 67 The assign Environment File . . . . . . . . . . . . . . . . . . . . 68 Local assign . . . . . . . . . . . . . . . . . . . . . . . . . 69 Example 11: Local assign mode . . . . . . . . . . . . . . . . . . 69 S–3695–36 vApplication Programmer’s I/O Guide Page File Structures [7] 71 Unblocked File Structure . . . . . . . . . . . . . . . . . . . . . . 72 assign -s unblocked File Processing . . . . . . . . . . . . . . . . 73 assign -s sbin File Processing (Not Recommended) . . . . . . . . . . . . 73 assign -s bin File Processing (Not Recommended) . . . . . . . . . . . . 74 assign -s u File Processing . . . . . . . . . . . . . . . . . . . . 74 Text File Structure . . . . . . . . . . . . . . . . . . . . . . . . 74 COS or Blocked File Structure . . . . . . . . . . . . . . . . . . . . . 75 Tape and Bmx File Structure . . . . . . . . . . . . . . . . . . . . . 77 Library Buffers . . . . . . . . . . . . . . . . . . . . . . . . 77 Buffering [8] 79 Buffering Overview . . . . . . . . . . . . . . . . . . . . . . . . 79 Types of Buffering . . . . . . . . . . . . . . . . . . . . . . . . 81 Unbuffered I/O . . . . . . . . . . . . . . . . . . . . . . . . 81 Library Buffering . . . . . . . . . . . . . . . . . . . . . . . . 81 System Cache . . . . . . . . . . . . . . . . . . . . . . . . . 82 Restrictions on Raw I/O . . . . . . . . . . . . . . . . . . . . . 83 Logical Cache Buffering . . . . . . . . . . . . . . . . . . . . . . 83 Default Buffer Sizes . . . . . . . . . . . . . . . . . . . . . . . 84 UNICOS and UNICOS/mk Default Buffer Sizes . . . . . . . . . . . . . 84 Devices [9] 87 Tape . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Tape I/O Interfaces . . . . . . . . . . . . . . . . . . . . . . . 87 Tape Subsystem Capabilities . . . . . . . . . . . . . . . . . . . . 88 SSD . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 SSD File Systems . . . . . . . . . . . . . . . . . . . . . . . . 89 Secondary Data Segments (SDS) . . . . . . . . . . . . . . . . . . . 90 Logical Device Cache (ldcache) . . . . . . . . . . . . . . . . . . . 91 Disk Drives . . . . . . . . . . . . . . . . . . . . . . . . . . 91 vi S–3695–36Contents Page Main Memory . . . . . . . . . . . . . . . . . . . . . . . . . 93 Introduction to FFIO [10] 95 Layered I/O . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Using Layered I/O . . . . . . . . . . . . . . . . . . . . . . . . 97 I/O Layers . . . . . . . . . . . . . . . . . . . . . . . . . 99 Layered I/O Options . . . . . . . . . . . . . . . . . . . . . . 100 Setting FFIO Library Parameters . . . . . . . . . . . . . . . . . . . . 101 Using FFIO [11] 103 FFIO and Common Formats . . . . . . . . . . . . . . . . . . . . . 103 Reading and Writing Text Tiles . . . . . . . . . . . . . . . . . . . 103 Reading and Writing Unblocked Files . . . . . . . . . . . . . . . . . 104 Reading and Writing Fixed-length Records . . . . . . . . . . . . . . . . 104 Reading and Writing COS Blocked Files . . . . . . . . . . . . . . . . . 105 Enhancing Performance . . . . . . . . . . . . . . . . . . . . . . 105 Buffer Size Considerations . . . . . . . . . . . . . . . . . . . . . 105 Removing Blocking . . . . . . . . . . . . . . . . . . . . . . . 106 The bufa and cachea Layers . . . . . . . . . . . . . . . . . . . . 106 The sds Layer (Available Only on UNICOS Systems) . . . . . . . . . . . . . 107 The mr Layer . . . . . . . . . . . . . . . . . . . . . . . . . 108 The cache Layer . . . . . . . . . . . . . . . . . . . . . . . . 109 Sample Programs for UNICOS Systems . . . . . . . . . . . . . . . . . . 111 Example 12: sds using buffer I/O . . . . . . . . . . . . . . . . . . 111 Example 13: Unformatted sequential sds example . . . . . . . . . . . . . 112 Example 14: sds and mr with WAIO . . . . . . . . . . . . . . . . . 113 Example 15: Unformatted direct sds and mr example . . . . . . . . . . . . 115 Example 16: sds with MS package example . . . . . . . . . . . . . . . 116 Example 17: mr with buffer I/O example . . . . . . . . . . . . . . . . 117 Example 18: Unformatted sequential mr examples . . . . . . . . . . . . . 118 Example 19: mr and MS package example . . . . . . . . . . . . . . . . 119 S–3695–36 viiApplication Programmer’s I/O Guide Page Foreign File Conversion [12] 121 Conversion Overview . . . . . . . . . . . . . . . . . . . . . . . 121 Transferring Data . . . . . . . . . . . . . . . . . . . . . . . . 122 Using fdcp to Transfer Files . . . . . . . . . . . . . . . . . . . . 122 Example 20: Copy VAX/VMS tape file to disk . . . . . . . . . . . . . 122 Example 21: Copy unknown tape type to disk . . . . . . . . . . . . . 122 Example 22: Creating files for other systems . . . . . . . . . . . . . . 123 Example 23: Copying to UNICOS text files . . . . . . . . . . . . . . 124 Moving Data Between Systems . . . . . . . . . . . . . . . . . . . 124 Station Conversion Facilities . . . . . . . . . . . . . . . . . . . 124 Magnetic Tape . . . . . . . . . . . . . . . . . . . . . . . . 125 TCP/IP and Other Networks . . . . . . . . . . . . . . . . . . . 127 Data Item Conversion . . . . . . . . . . . . . . . . . . . . . . . 127 Explicit Data Item Conversion . . . . . . . . . . . . . . . . . . . . 127 Implicit Data Item Conversion . . . . . . . . . . . . . . . . . . . . 129 Choosing a Conversion Method . . . . . . . . . . . . . . . . . . . 136 Station Conversion . . . . . . . . . . . . . . . . . . . . . . 136 Explicit Conversion . . . . . . . . . . . . . . . . . . . . . . 137 Implicit Conversion . . . . . . . . . . . . . . . . . . . . . . 137 Disabling Conversion Types . . . . . . . . . . . . . . . . . . . . 137 Foreign Conversion Techniques . . . . . . . . . . . . . . . . . . . . 138 CDC CYBER NOS (VE and NOS/BE 60-bit) Conversion . . . . . . . . . . . . 138 COS Conversions . . . . . . . . . . . . . . . . . . . . . . . 139 CDC CYBER 205 and ETA Conversion . . . . . . . . . . . . . . . . . 141 CTSS Conversion . . . . . . . . . . . . . . . . . . . . . . . . 142 IBM Overview . . . . . . . . . . . . . . . . . . . . . . . . 142 Using the MVS Station . . . . . . . . . . . . . . . . . . . . . 143 Data Transfer between UNICOS and VM . . . . . . . . . . . . . . . . 147 Workstation and IEEE Conversion . . . . . . . . . . . . . . . . . . 148 VAX/VMS Conversion . . . . . . . . . . . . . . . . . . . . . . 150 viii S–3695–36Contents Page Implicit Numeric Conversions (UNICOS systems Only) . . . . . . . . . . . . . 152 I/O Optimization [13] 155 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 An Overview of Optimization Techniques . . . . . . . . . . . . . . . . . 157 Evaluation Tools . . . . . . . . . . . . . . . . . . . . . . . . 157 Optimizations Not Affecting Source Code . . . . . . . . . . . . . . . . 157 Optimizations that Affect Source Code . . . . . . . . . . . . . . . . . 158 Optimizing I/O Speed . . . . . . . . . . . . . . . . . . . . . . 158 Determining I/O Activity . . . . . . . . . . . . . . . . . . . . . . 159 Checking Program Execution Time . . . . . . . . . . . . . . . . . . 160 Optimizing System Requests . . . . . . . . . . . . . . . . . . . . . 160 The MR Feature . . . . . . . . . . . . . . . . . . . . . . . . 161 Using Faster Devices . . . . . . . . . . . . . . . . . . . . . . 164 Using MR/SDS Combinations . . . . . . . . . . . . . . . . . . . . 165 Using a Cache Layer . . . . . . . . . . . . . . . . . . . . . . . 166 Preallocating File Space . . . . . . . . . . . . . . . . . . . . . . 166 User Striping . . . . . . . . . . . . . . . . . . . . . . . . . 167 Optimizing File Structure Overhead . . . . . . . . . . . . . . . . . . . 168 Scratch Files . . . . . . . . . . . . . . . . . . . . . . . . . 168 Alternate File Structures . . . . . . . . . . . . . . . . . . . . . 170 Using the Asynchronous COS Blocking Layer . . . . . . . . . . . . . . . 171 Using Asynchronous Read-Ahead and Write-Behind . . . . . . . . . . . . . 172 Using Simpler File Structures . . . . . . . . . . . . . . . . . . . . 173 Minimizing Data Conversions . . . . . . . . . . . . . . . . . . . . 174 Minimizing Data Copying . . . . . . . . . . . . . . . . . . . . . . 174 Changing Library Buffer Sizes . . . . . . . . . . . . . . . . . . . . 174 Bypassing Library Buffers . . . . . . . . . . . . . . . . . . . . . 175 Other Optimization Options . . . . . . . . . . . . . . . . . . . . . 176 Using Pipes . . . . . . . . . . . . . . . . . . . . . . . . . 176 Overlapping CPU and I/O . . . . . . . . . . . . . . . . . . . . . 176 S–3695–36 ixApplication Programmer’s I/O Guide Page Optimization on UNICOS/mk Systems . . . . . . . . . . . . . . . . . . 177 FFIO Layer Reference [14] 179 Characteristics of Layers . . . . . . . . . . . . . . . . . . . . . . 180 Individual Layers . . . . . . . . . . . . . . . . . . . . . . . . 181 The blankx Expansion/Compression Layer . . . . . . . . . . . . . . . 181 The bmx/tape Layer . . . . . . . . . . . . . . . . . . . . . . 183 The bufa Layer . . . . . . . . . . . . . . . . . . . . . . . . 185 The CYBER 205/ETA (c205) . . . . . . . . . . . . . . . . . . . . 187 The cache Layer . . . . . . . . . . . . . . . . . . . . . . . . 188 The cachea Layer . . . . . . . . . . . . . . . . . . . . . . . 190 The cdc Layer . . . . . . . . . . . . . . . . . . . . . . . . 192 The cos Blocking Layer . . . . . . . . . . . . . . . . . . . . . . 194 The er90 Layer (Available Only on UNICOS Systems) . . . . . . . . . . . . 196 The event Layer . . . . . . . . . . . . . . . . . . . . . . . . 197 The f77 Layer . . . . . . . . . . . . . . . . . . . . . . . . 198 The fd Layer . . . . . . . . . . . . . . . . . . . . . . . . . 200 The global Layer . . . . . . . . . . . . . . . . . . . . . . . 200 The ibm Layer . . . . . . . . . . . . . . . . . . . . . . . . 202 The mr Layer . . . . . . . . . . . . . . . . . . . . . . . . . 205 The nosve Layer . . . . . . . . . . . . . . . . . . . . . . . . 208 The null Layer . . . . . . . . . . . . . . . . . . . . . . . . 210 The sds Layer (Available Only on UNICOS Systems) . . . . . . . . . . . . . 211 The syscall Layer . . . . . . . . . . . . . . . . . . . . . . . 214 The system Layer . . . . . . . . . . . . . . . . . . . . . . . 215 The text Layer . . . . . . . . . . . . . . . . . . . . . . . . 216 The user and site Layers . . . . . . . . . . . . . . . . . . . . 217 The vms Layer . . . . . . . . . . . . . . . . . . . . . . . . 218 Creating a user Layer [15] 221 Internal Functions . . . . . . . . . . . . . . . . . . . . . . . . 221 x S–3695–36Contents Page The Operations Structure . . . . . . . . . . . . . . . . . . . . . 222 FFIO and the Stat Structure . . . . . . . . . . . . . . . . . . . . . 223 user Layer Example . . . . . . . . . . . . . . . . . . . . . . . 224 Appendix A Older Data Conversion Routines 245 Old IBM Data Conversion Routines . . . . . . . . . . . . . . . . . . . 245 Old CDC Data Conversion Routines . . . . . . . . . . . . . . . . . . . 246 Old VAX/VMS Data Conversion Routines . . . . . . . . . . . . . . . . . 246 Glossary 249 Index 253 Figures Figure 1. Access methods and default buffer sizes (UNICOS systems) . . . . . . . . 68 Figure 2. Typical data flow . . . . . . . . . . . . . . . . . . . . . 95 Figure 3. I/O layers . . . . . . . . . . . . . . . . . . . . . . . 156 Figure 4. I/O data movement . . . . . . . . . . . . . . . . . . . . 162 Figure 5. I/O data movement (current) . . . . . . . . . . . . . . . . . 169 Figure 6. I/O processing with library processing eliminated . . . . . . . . . . . 171 Tables Table 1. Fortran access methods and options . . . . . . . . . . . . . . . . 72 Table 2. Disk information . . . . . . . . . . . . . . . . . . . . . 92 Table 3. I/O Layers available on all hardware platforms . . . . . . . . . . . . 99 Table 4. HARDREF Directives . . . . . . . . . . . . . . . . . . . . 102 Table 5. Conversion routines for Cray PVP systems . . . . . . . . . . . . . . 128 Table 6. Conversion routines for Cray MPP systems . . . . . . . . . . . . . 128 Table 7. Conversion routines for Cray T90 systems . . . . . . . . . . . . . . 129 Table 8. Conversion types on Cray PVP systems . . . . . . . . . . . . . . 130 Table 9. Conversion types on Cray MPP systems . . . . . . . . . . . . . . 131 Table 10. Conversion types on Cray T90/IEEE systems . . . . . . . . . . . . 131 S–3695–36 xiApplication Programmer’s I/O Guide Page Table 11. Supported foreign I/O formats and default data types . . . . . . . . . 133 Table 12. Data manipulation: blankx layer . . . . . . . . . . . . . . . . 182 Table 13. Supported operations: blankx layer . . . . . . . . . . . . . . . 182 Table 14. -T specified on tpmnt . . . . . . . . . . . . . . . . . . . 184 Table 15. Data manipulation: bmx/tape layer . . . . . . . . . . . . . . . 184 Table 16. Supported operations: bmx/tape layer . . . . . . . . . . . . . . 184 Table 17. Data manipulation: bufa layer . . . . . . . . . . . . . . . . . 186 Table 18. Supported operations: bufa layer . . . . . . . . . . . . . . . . 186 Table 19. Data manipulation: c205 layer . . . . . . . . . . . . . . . . . 187 Table 20. Supported operations: c205 layer . . . . . . . . . . . . . . . . 188 Table 21. Data manipulation: cache layer . . . . . . . . . . . . . . . . 189 Table 22. Supported operations: cache layer . . . . . . . . . . . . . . . 190 Table 23. Data manipulation: cachea layer . . . . . . . . . . . . . . . . 192 Table 24. Supported operations: cachea layer . . . . . . . . . . . . . . . 192 Table 25. Data manipulation: cdc layer . . . . . . . . . . . . . . . . . 193 Table 26. Supported operations: cdc layer . . . . . . . . . . . . . . . . 194 Table 27. Data manipulation: cos layer . . . . . . . . . . . . . . . . . 195 Table 28. Supported operations: cos layer . . . . . . . . . . . . . . . . 195 Table 29. Data manipulation: er90 layer . . . . . . . . . . . . . . . . . 196 Table 30. Supported operations: er90 layer . . . . . . . . . . . . . . . . 196 Table 31. Data manipulation: f77 layer . . . . . . . . . . . . . . . . . 199 Table 32. Supported operations: f77 layer . . . . . . . . . . . . . . . . 199 Table 33. Data manipulation: global layer . . . . . . . . . . . . . . . . 201 Table 34. Supported operations: global layer . . . . . . . . . . . . . . . 202 Table 35. Values for maximum record size on ibm layer . . . . . . . . . . . . 204 Table 36. Values for maximum block size in ibm layer . . . . . . . . . . . . . 204 Table 37. Data manipulation: ibm layer . . . . . . . . . . . . . . . . . 204 Table 38. Supported operations: ibm layer . . . . . . . . . . . . . . . . 205 Table 39. Data manipulation: mr layer . . . . . . . . . . . . . . . . . . 207 Table 40. Supported operations: mr layer . . . . . . . . . . . . . . . . . 207 xii S–3695–36Contents Page Table 41. Values for maximum record size . . . . . . . . . . . . . . . . 209 Table 42. Values for maximum block size . . . . . . . . . . . . . . . . . 209 Table 43. Data manipulation: nosve layer . . . . . . . . . . . . . . . . 210 Table 44. Supported operations: nosve layer . . . . . . . . . . . . . . . 210 Table 45. Data manipulation: sds layer . . . . . . . . . . . . . . . . . 214 Table 46. Supported operations: sds layer . . . . . . . . . . . . . . . . 214 Table 47. Data manipulation: syscall layer . . . . . . . . . . . . . . . 215 Table 48. Supported operations: syscall layer . . . . . . . . . . . . . . . 215 Table 49. Data manipulation: text layer . . . . . . . . . . . . . . . . . 216 Table 50. Supported operations: text layer . . . . . . . . . . . . . . . . 217 Table 51. Values for record size: vms layer . . . . . . . . . . . . . . . . 219 Table 52. Values for maximum block size: vms layer . . . . . . . . . . . . . 219 Table 53. Data manipulation: vms layer . . . . . . . . . . . . . . . . . 220 Table 54. Supported operations: vms layer . . . . . . . . . . . . . . . . 220 S–3695–36 xiiiPreface This publication describes Fortran input/output (I/O) techniques for use on the UNICOS and UNICOS/mk. It also contains information about advanced I/O topics such as asynchronous queued I/O and logical record I/O. Information about the interaction of the I/O library and the Fortran compiler is also discussed. This document also serves as an I/O optimization guide for Fortran programmers. It describes the types of I/O that are available, including insight into the efficiencies and inefficiencies of each, the ways to speed up various forms of I/O, and the tools used to extract statistics from the execution of a Fortran program. Related Publications The following documents contain additional information that may be helpful: • Application Programmer’s Library Reference Manual • Cray T3E Fortran Optimization Guide • UNICOS Performance Utilities Reference Manual • UNICOS System Calls Reference Manual • UNICOS System Libraries Reference Manual • CF90 Ready Reference • CF90 Commands and Directives Reference Manual • Fortran Language Reference Manual, Volume 1 • Fortran Language Reference Manual, Volume 2 • Fortran Language Reference Manual, Volume 3 Ordering Documentation To order software documentation, contact the Cray Software Distribution Center in any of the following ways: E-mail: S–3695–36 xvApplication Programmer’s I/O Guide orderdsk@cray.com Web: http://www.cray.com/craydoc/ Click on the Cray Publications Order Form link. Telephone (inside U.S., Canada): 1–800–284–2729 (BUG CRAY), then 605–9100 Telephone (outside U.S., Canada): Contact your Cray representative, or call +1–651–605–9100 Fax: +1–651–605–9001 Mail: Software Distribution Center Cray Inc. 1340 Mendota Heights Road Mendota Heights, MN 55120–1128 USA Conventions The following conventions are used throughout this document: Convention Meaning command This fixed-space font denotes literal items, such as file names, pathnames, man page names, command names, and programming language elements. variable Italic typeface indicates an element that you will replace with a specific value. For instance, you may replace filename with the name datafile in your program. It also denotes a word or concept being defined. [ ] Brackets enclose optional portions of a syntax representation for a command, library routine, system call, and so on. xvi S–3695–36Preface Reader Comments Contact us with any comments that will help us to improve the accuracy and usability of this document. Be sure to include the title and number of the document with your comments. We value your comments and will respond to them promptly. Contact us in any of the following ways: E-mail: swpubs@cray.com Telephone (inside U.S., Canada): 1–800–950–2729(Cray Customer Support Center) Telephone (outside U.S., Canada): Contact your Cray representative, or call +1–715–726–4993(Cray Customer Support Center) Mail: Software Publications Cray Inc. 1340 Mendota Heights Road Mendota Heights, MN 55120–1128 USA S–3695–36 xviiIntroduction [1] This manual introduces standard Fortran and supported Fortran extensions, and it provides a discussion of flexible file input/output (FFIO) and other input/output (I/O) methods for UNICOS and UNICOS/mk systems. This manual is for Fortran programmers who need general I/O information or who need information on how to optimize their I/O. Some information in this manual addresses usage information for UNICOS or UNICOS/mk systems only. When this occurs, the information is flagged as applicable only to the intended system. This manual contains the following chapters: • Standard Fortran I/O, Chapter 2, page 5, discusses elements of the Fortran 95 standard that relate to I/O. • Fortran I/O Extensions, Chapter 3, page 21, discusses extensions to the Fortran standard. • Tape and Named Pipe Support, Chapter 4, page 41, discusses tape handling and FIFO special files. • System and C I/O, Chapter 5, page 49, discusses system calls and Fortran callable entry points to C library routines. • The assign Environment, Chapter 6, page 55, discusses the use of the assign(1) command to access and update advisory information from the I/O library and how to create an I/O environment. • File Structures, Chapter 7, page 71, discusses native file structures. • Buffering, Chapter 8, page 79, discusses file buffering as it applies to I/O. • Devices, Chapter 9, page 87, discusses types of storage devices. • Introduction to FFIO, Chapter 10, page 95, provides an overview of the Flexible File I/O system. • Using FFIO, Chapter 11, page 103, describes how to use FFIO with common file structures, and how to use FFIO to enhance program performance. • Foreign File Conversion, Chapter 12, page 121, discusses how to convert data from one file structure to another. S–3695–36 1Application Programmer’s I/O Guide • I/O Optimization, Chapter 13, page 155, discusses methods to speed up I/O processing. • FFIO Layer Reference, Chapter 14, page 179, provides details about individual FFIO layers. • Creating a user Layer, Chapter 15, page 221, provides an example of how to create an FFIO layer. • Older Data Conversion Routines, Appendix A, page 245, lists outdated data conversion routines. 1.1 The Message System The UNICOS operating system contains an error message system that consists of commands, library routines, and files that allow error messages to be retrieved from message catalogs and formatted at run time. The user who receives a message can request more information by using the explain(1) user command. The explain command retrieves a message explanation from an online explanation catalog and displays it on the standard output device. The msgid argument to the explain command is the message ID string that appears when an error message is written. The ID string contains a product group code and the message number. The product group code or product code is a string that identifies the product issuing the message. The product code for the Fortran libraries and for the I/O libraries is lib. The number specifies the number of the message. The following list describes the categories of message numbers: • All Fortran library errors on UNICOS and UNICOS/mk systems are within the range of 1000 to 2000. Libraries may also return system error numbers (the sys product code) in the range of 1 to the first library error number. • Flexible file I/O (FFIO) returns error values that are in the range of 5000 to 6000 and have a product code of lib. • On UNICOS systems, the tape system returns error numbers that are in the range of 90000 through 90500. The Tape Subsystem User’s Guide, lists tape system error messages. 2 S–3695–36Introduction [1] Both of the following are variations of the explain command used with a msgid from the Fortran I/O library: explain lib1100 explain lib-1100 The previous explain command produces the following description on a standard output file: explain lib-1100 lib-1100: A READ operation tried to read a nonexistent record. On a Fortran READ statement, the REC (record) specifier was larger than the largest record number for that direct-access file. Check the value of the REC specifier to ensure that it is a valid record number. Check the file being read to ensure that it is the correct file. Also see the description of input/output statements in your Fortran reference manual. The class of the error is unrecoverable (issued by the Fortran run-time library). There are two classes of Fortran library error messages: UNRECOVERABLE and WARNING. The following is an example of a warning message: lib-1951 a.out: At line in Fortran routine "", in dimension , extents and are not equal. When bounds checking is enabled, this message is issued if an array assignment exceeds the bounds of the result array. The line number in the Fortran routine is where the two array extents ( and ) did not match. Modify the program so as not to exceed the bounds of the array, or ensure that the array extents are equal. Also see the description of array operations in your Fortran reference manual. Note that this message is issued as a warning. Execution of the program will continue. If the message number is not valid, a message similar to the following appears: explain: no explanation for lib-3000 S–3695–36 3Standard Fortran I/O [2] The Fortran standard describes program statements that you can use to transfer data between external media and internal files or between internal files and internal storage. It describes auxiliary I/O statements that can be used to change the position in the external file or to write an end-or-file record. It also describes auxiliary I/O statements that describe properties of the connection to a file or that inquire about the properties of that connection. 2.1 Files The Fortran standard specifies the form of the input data that a Fortran program processes and the form of output data resulting from a Fortran program. It does not specifically describe the physical properties of I/O records, files, and units. This section provides a general overview of files, records, and units. Standard Fortran has two types of files: external and internal. An external file is any file that is associated with a unit number. An internal file is a character variable that is used as the unit specifier in a READ or WRITE statement. A unit is a means of referring to an external file. A unit is connected or linked to a file through the OPEN statement in standard Fortran. An external unit identifier refers to an external file, and an internal file identifier refers to an internal file. See Section 2.2, page 8, for more information about unit identifiers. A file can have a name that can be specified through the FILE= specifier in a Fortran OPEN statement. If no explicit OPEN statement exists to connect a file to a unit, and if assign(1) was not used, the I/O library uses a form of the unit number as the file name. 2.1.1 Internal Files Internal files provide a means of transferring and converting text stored in character variables. An internal file must be a character variable or character array. If the file is a variable, the file can contain only one record. If the file is a character array, each element within the array is a record. On output, the record is filled with blanks if the number of characters written to a record is less than the length of the record. An internal file is always positioned at the beginning of the first record prior to data transfer. Internal files can contain only formatted records. S–3695–36 5Application Programmer’s I/O Guide When reading and writing to an internal file, only sequential formatted data transfer statements that do not specify list-directed formatting can be used. Only sequential formatted READ and WRITE statements can specify an internal file. 2.1.2 External Files In standard Fortran, one external unit can be connected to a file. Cray allows more than one external unit to be connected to the standard input, standard output, or standard error files if the files were assigned with the assign -D command. More than one external unit can be connected to a terminal. External files have properties of form, access, and position as described in the following text. You can specify these properties explicitly by using an OPEN statement on the file. The Fortran standard provides specific default values for these properties. • Form (formatted or unformatted): external files can contain formatted or unformatted records. Formatted records are read or written by formatted I/O data transfer statements. Unformatted records are accessed through unformatted I/O data transfer statements. If the default does not match the form needed, you can specify the form by using an OPEN statement. • File access (sequential or direct access): external files can be accessed through sequential or direct access methods. The file access method is determined when the file is connected to a unit. – Sequential access does not require an explicit open of a file by using an OPEN statement. When connected for sequential access, the external file has the following properties: • The records of the file are either all formatted or unformatted, except that the last record of the file may be an end-of-file record. • The records of the file must not be read or written by direct-access I/O statements when the file is opened for sequential access. • If the file is created with sequential access, the records are stored in the order in which they are written (that is, sequentially). To use sequential access on a file that was created as a formatted direct-access file, open the file as sequential. To use sequential access on a file that was created as an unformatted direct-access file, open the file as sequential, and use the assign command on the file as follows: 6 S–3695–36Standard Fortran I/O [2] assign -s unblocked ... The assign command is required to specify the type of file structure. The I/O libraries need this information to access the file correctly. Buffer I/O files are unformatted sequential access files. – Direct access does require an explicit open of a file by using an OPEN statement. If a file is accessed through a sequential access READ or WRITE statement, the I/O library implicitly opens the file. During an explicit or implicit open of a file, the I/O library tries to access information generated by the assign(1) command for the file. Direct access can be faster than sequential access when a program must access a set of records in a nonsequential manner. When connected for direct access, an external file has the following properties: • The records of the file are either all formatted or all unformatted. If the file can be accessed as a sequential file, the end-of-file record is not considered part of the file when it is connected for direct access. Some sequential files do not contain a physical end-of-file record. • The records of the file must not be read or written by sequential-access I/O statements while the file is opened for direct access. • All records of the file have the same length, which is specified in the RECL specifier of the OPEN statement. • Records do not have to be read or written in the order of their record numbers. • The records of the file must not be read or written using list-directed or namelist formatting. • The record number (a positive integer) uniquely identifies each record. If all of the records in the file are the same length and if the file is opened as direct access, a formatted sequential-access file can be accessed as a formatted direct-access file on UNICOS and UNICOS/mk systems. Unformatted sequential-access files can be accessed as unformatted direct-access files if all of the records are the same length and if the file is opened as direct access, but only if the sequential-access file was created with an unblocked file structure. The following assign commands create these file structures: S–3695–36 7Application Programmer’s I/O Guide assign -s unblocked ... assign -s u ... assign -F system ... For more information about the assign environment and about default file structures, see Chapter 6, page 55. • File position: a file connected to a unit has a position property, which can be either an initial point or a terminal point. The initial point of a file is the position just before the first record, and the terminal point is the position just after the last record. If a file is positioned within a record, that record is considered to be the current record; otherwise, there is no current record. During an I/O data transfer statement, the file can be positioned within a record as each individual input/out or in/out list (iolist) item is processed. The use of a dollar sign ($) or a backslash (\) as a carriage control edit descriptor in a format may cause a file to be positioned within a record. In standard Fortran, the end-of-file (EOF) record is a special record in a sequential access file; it denotes the last record of a file. A file can be positioned after an EOF, but only CLOSE, BACKSPACE, or REWIND statements are then allowed on the file in standard Fortran. Other I/O operations are allowed after an EOF to provide multiple-file I/O if a file is assigned to certain devices or is assigned with a certain file structure. 2.2 Fortran Unit Identifiers A Fortran unit identifier is required for Fortran READ or WRITE statements to uniquely identify the file. A unit identifier can be one of the following: • An integer variable or expression whose value is greater than or equal to 0. Each integer unit identifier i is associated with the fort.i file, which may exist (except as noted in the following text). For example, unit 10 is associated with the fort.10 file in the current directory. • An asterisk (*) is allowed only on READ and WRITE statements. It identifies a particular file that is connected for formatted, sequential access. On READ statements, an asterisk refers to unit 100 (standard input). On WRITE statements, an asterisk refers to unit 101 (standard output). • A Hollerith (integer) variable consisting of 1 to 8 left-justified, blank-filled or zero-filled ASCII characters. Each Hollerith unit identifier is associated with the file of the same name, which may exist. For example, unit ’red’L is associated with the red file in the current working directory. The use 8 S–3695–36Standard Fortran I/O [2] of uppercase and lowercase characters is significant for file names. This extension is supported only on 64-bit systems. Certain Fortran I/O statements have an implied unit number. The PRINT statement always refers to unit 101 (standard output), and the outmoded PUNCH statement always refers to unit 102 (standard error). Fortran INQUIRE and CLOSE statements may refer to any valid or invalid unit number (if referring to an invalid unit number, no error is returned). All other Fortran I/O statements may refer only to valid unit numbers. For the purposes of an executing Fortran program, all unit numbers in use or available for use by that program are valid; that is, they exist. All unit numbers not available for use are not valid; that is, they do not exist. Valid unit numbers are all nonnegative numbers except 100 through 102. Unit numbers 0, 5, and 6 are associated with the standard error, standard input, and standard output files; any unit can also refer to a pipe. All other valid unit numbers are associated with the fort.i file, or with the file name implied in a Hollerith unit number. Use the INQUIRE statement to check the validity (existence) of any unit number prior to using it, as in the following example: logical UNITOK, UNITOP... inquire (unit=I,exist=UNITOK,opened=UNITOP) if (UNITOK .and. .not. UNITOP) then open (unit = I, ...) endif All valid units are initially closed. A unit is connected to a file as the result of one of three methods of opening a file or a unit: • An implicit open occurs when the first reference to a unit number is an I/O statement other than OPEN, CLOSE, INQUIRE, BACKSPACE, ENDFILE, or REWIND. The following example shows an implicit open: WRITE (4) I,J,K If unit number 4 is not open, the WRITE statement causes it to be connected to the associated file fort.4, unless overridden by an assign command that references unit 4. The BACKSPACE, ENDFILE, and REWIND statements do not perform an implicit OPEN. If the unit is not connected to a file, the requested operation is ignored. S–3695–36 9Application Programmer’s I/O Guide • An explicit unnamed open occurs when the first reference to a unit number is an OPEN statement without a FILE specifier. The following example shows an explicit unnamed open: OPEN (7, FORM=’UNFORMATTED’) If unit number 7 is not open, the OPEN statement causes it to be connected to the associated file fort.7, unless an assign(1) command that references unit 7 overrides the default file name. • An explicit named open occurs when the first reference to a unit number is an OPEN statement with a FILE specifier. The following is an example: OPEN (9, FILE=’blue’) If unit number 9 is not open, the OPEN statement causes it to be connected to file blue, unless overridden by an assign command that references the file named blue. Unit numbers 100, 101, and 102 are permanently associated with the standard input, standard output, and standard error files, respectively. These files can be referenced on READ and WRITE statements. A CLOSE statement on these unit numbers has no effect. An INQUIRE statement on these unit numbers indicates they are nonexistent (not valid). These unit numbers exist to allow guaranteed access to the standard input, standard output, and standard error files without regard to any unit actions taken by an executing program. Thus, a READ or WRITE I/O statement with an asterisk unit identifier (which is equivalent to unit 101) or a PRINT statement always works. Nonstandard I/O operations such as BUFFER IN and BUFFER OUT, READMS, and WRITMS on these units are not supported. Fortran applications or library subroutines that must access the standard input, standard output, and standard error files can be certain of access by using unit numbers 100 through 102, even if the user program closes or reuses unit numbers 0, 5, and 6. For all unit numbers associated with the standard input, standard output, and standard error files, the access mode and form must be sequential and formatted. The standard input file is read only, and the standard output and standard error files are write only. REWIND and BACKSPACE statements are permitted on workstation files but have no effect. ENDFILE statements are permitted on terminal files unless they are read only. The ENDFILE statement writes a logical end-of-file record. 10 S–3695–36Standard Fortran I/O [2] The REWIND statement is not valid for any unit numbers associated with pipes. The BACKSPACE statement is not valid if the device on which the file exists does not support repositioning. BACKSPACE after a logical end-of-file record does not require repositioning because the end-of-file record is only a logical representation of an end-of-file record. 2.3 Data Transfer Statements The READ statement is the data transfer input statement. The WRITE and PRINT statements are the data transfer output statements. If the data transfer statement contains a format specifier, the data transfer statement is a formatted I/O statement. If the data transfer statement does not contain a format specifier, the data transfer statement is an unformatted I/O statement. The time required to convert input or output data to the proper form adds to the execution time for formatted I/O statements. Unformatted I/O maintains binary representations of the data. Very little CPU time is required for unformatted I/O compared to formatted I/O. 2.3.1 Formatted I/O In formatted I/O, data is transferred with editing. Formatted I/O can be edit-directed, list-directed, and namelist I/O. If the format identifier is an asterisk, the I/O statement is a list-directed I/O statement. All other format identifiers indicate edit-directed I/O. Formatted I/O should be avoided when I/O performance is important. Unformatted I/O is faster and it avoids potential inaccuracies due to conversion. However, there are occasions when formatted I/O is necessary. The advantages for formatted I/O are as follows: • Formatted data can be interpreted by humans. • Formatted data can be readily used by programs and utilities not written in Fortran, or otherwise unable to process Fortran unformatted files. • Formatted data can be readily exchanged with other computer systems where the structure of Fortran unformatted files may be different. See the Fortran Language Reference manuals for more information about formatted I/O statements. S–3695–36 11Application Programmer’s I/O Guide 2.3.1.1 Edit-Directed I/O The format used in an edit-directed I/O statement provides information that directs the editing between internal representation and the character strings of a record (or sequence of records) in the file. An example of a sequential access, edit-directed WRITE statement follows: C Sequential edit-directed WRITE statement C WRITE (10,10,ERR=101,IOSTAT=IOS) 100,200 10 FORMAT (TR2,I10,1X,I10) An example of a sequential access, edit-directed READ statement follows: C Sequential edit-directed READ statement C READ (10,11,END=99,ERR=102,IOSTAT=IOS) IVAR 11 FORMAT (BN,TR2,I10:1X,I10) An example of a direct access edit-directed I/O statement follows: OPEN (11,ACCESS=’DIRECT’,FORM=’FORMATTED’, + RECL=24) C C Direct edit-directed READ and WRITE statements C WRITE (11,10,REC=3,ERR=103,IOSTAT=IOS) 300,400 READ (11,11,REC=3,ERR=104,IOSTAT=IOS) IVAR There are four general optimization techniques that you can use to improve the efficiency of edit-directed formatted I/O. Procedure 1: Optimization technique: using single statements Read or write as much data with a single READ/WRITE/PRINT statement if possible. The following is an example of an inefficient way to code a WRITE statement: DO J=1,M DO I=1,N WRITE (42, 100) X(I,J) 100 FORMAT (E25.15) ENDDO ENDDO 12 S–3695–36Standard Fortran I/O [2] It is better to write the entire array with a single WRITE statement, as is done in the following two examples: WRITE (42, 100) ((X(I,J),I=1,N),J=1,M) 100 FORMAT (E25.15) or WRITE (42, 100) X 100 FORMAT (E25.15) Each of these three code fragments produce exactly the same output; although the latter two are about twice as fast as the first. Note that the format can be used to control how much data is written per record. Also, the last two cases are equivalent if the implied DO loops write out the entire array, in order and without omitting any items. Procedure 2: Optimization technique: using longer records Use longer records if possible. Because a certain amount of processing is necessary to read or write each record, it is better to write a few longer records instead of more shorter records. For example, changing the statement from Example 1 to Example 2 causes the resulting file to have one fifth as many records and, more importantly, causes the program to execute faster: Example 1: (Not recommended) WRITE (42, 100) X 100 FORMAT (E25.15) Example 2: (Recommended) WRITE (42,101) X 101 FORMAT (5E25.15) You must make sure that the resultant file does not contain records that are too long for the intended application. Certain text editors and utilities, for example, cannot process lines that are longer than a predetermined limit. Generally lines that are 128 characters or less are safe to use in most applications. Procedure 3: Optimization technique: using repeated edit descriptors Use repeated edit descriptors whenever possible. Instead of using the format in Example 1, use the format in Example 2 for integers that fit in four digits (that is, less than 10,000 and greater than –1,000). Example 1: (Not recommended) S–3695–36 13Application Programmer’s I/O Guide 200 FORMAT (16(X,I4)) Example 2: (Recommended) 201 FORMAT (16(I5)) Procedure 4: Optimization technique: using data edit descriptors Character data should be read and written using data edit descriptors that are the same width as the character data. For CHARACTER*n variables, the optimal data edit descriptor is A (or An). For Hollerith data in INTEGER variables, the optimal data edit descriptor is A8 (or R8). 2.3.1.2 List-Directed I/O If the format specifier is an asterisk, list-directed formatting is specified. The REC= specifier must not be present in the I/O statement. In list-directed I/O, the I/O records consist of a sequence of values separated by value separators such as commas or spaces. A tab is treated as a space in list-directed input, except when it occurs in a character constant that is delimited by apostrophes or quotation marks. List-directed and namelist output of real values uses either an F or an E format with a number of decimal digits of precision that assures full-precision printing of the real values. This allows formatted, list–directed, or namelist input of real values to result later in the generation of bit-identical binary floating-point representation. Thus, a value may be written and then reread without changing the stored value. The LISTIO_PRECISION and LISTIO_OUTPUT_STYLE environment variables can be used to control list-directed output, as discussed in the following paragraphs. You can set the LISTIO_PRECISION environment variable to control the number of digits of precision printed by list-directed or namelist output. The following values can be assigned to LISTIO_PRECISION: FULL Prints full precision (this is the default value). PRECISION Prints x or x +1 decimal digits, where x is a value of the Fortran 95 PRECISION() intrinsic function for a given real value. This is a smaller number of digits that usually ensures that the last decimal digit is accurate to within 1 unit. 14 S–3695–36Standard Fortran I/O [2] YMP80 Causes list-directed and namelist output of real values to be of the format used in Cray’s UNICOS 8.0 release and previous library versions on UNICOS systems. LISTIO_OUTPUT_STYLE provides a compatibility mode for the CrayLibs 2.0 release and later versions. When set to OLD, this environment variable causes three effects: • Repeated list-directed output values closely resemble those printed by the CrayLibs 1.2 and prior releases. In these prior releases, the repeat counts never spanned vector array extents passed to the library from the compiler. In the current version of CrayLibs, the libraries coalesce repeat counts as much as possible to compress output and to ensure that compiler optimization does not affect the format of list-directed output. To suppress repeat counts in list-directed output, set the assign -y option to on. • Value separators are not printed between adjacent, nondelimited character values and noncharacter values printed by list-directed output in Fortran 95 files. In CrayLibs 2.0, the libraries produce one blank character as a value separator to comply with the ANSI Fortran 95 standard. No value separator is printed between adjacent, nondelimited character values and noncharacter values in FORTRAN 77 files because the ANSI FORTRAN 77 standard requires that none be printed. • A blank character will not be printed in column 1 when a list-directed statement with no I/O list items is executed. In the CrayLibs 2.0 release, the libraries started printing a blank character in column 1 to comply with the ANSI FORTRAN 77 and ANSI Fortran 95 standards. An example of a list-directed WRITE statement follows: C Sequential list-directed WRITE statement WRITE (10,*,ERR=101,IOSTAT=IOS) 100,200 An example of a list-directed READ statement follows: C Sequential list-directed READ statement READ (10,*,END=99,ERR=102,IOSTAT=IOS) IVAR 2.3.1.2.1 Namelist I/O Namelist I/O is similar to list-directed I/O, but it allows you to group variables by specifying a namelist group name. On input, any namelist item within that list may appear in the input record with a value to be assigned. On output, the entire namelist is written. S–3695–36 15Application Programmer’s I/O Guide The namelist item name is used in the namelist input record to indicate the namelist item to be initialized or updated. During list-directed input, the input records must contain a value or placeholder for all items in the input list. Namelist does not require that a value be present for each namelist item in the namelist group. You can specify a namelist group name in READ, WRITE, and PRINT statements. The following is an example of namelist I/O: NAMELIST/GRP/T,I READ(5,GRP) WRITE(6,GRP) 2.3.2 Unformatted I/O During unformatted I/O, binary data is transferred without editing between the current record and the entities specified by the I/O list. Exactly one record is read or written. The unit must be an external unit. The following is an example of a sequential access, unformatted I/O WRITE statement: C Sequential unformatted WRITE statement WRITE (10,ERR=101,IOSTAT=IOS) 100,200 The following is an example of a sequential access, unformatted I/O READ statement: C Sequential unformatted READ statement READ (10,END=99,ERR=102,IOSTAT=IOS) IVAR The following is an example of a direct access, unformatted I/O statement: OPEN (11,ACCESS=’DIRECT’,FORM=’UNFORMATTED’, RECL=24) C Direct unformatted READ and WRITE statements WRITE (11,REC=3,ERR=103,IOSTAT=IOS) 300,400 READ (11,REC=3,ERR=103,IOSTAT=IOS) IVAR 2.4 Auxiliary I/O The auxiliary I/O statements consist of the OPEN, CLOSE, INQUIRE, BACKSPACE, REWIND, and ENDFILE statements. These types of statements specify file connections, describe files, or position files. See the Fortran Language 16 S–3695–36Standard Fortran I/O [2] Reference manual for your compiler system for more details about auxiliary I/O statements. 2.4.1 File Connection Statements The OPEN and CLOSE statements specify an external file and how to access the file. An OPEN statement connects an existing file to a unit, creates a file that is preconnected, creates a file and connects it to a unit, or changes certain specifiers of a connection between a file and a unit. The following are examples of the OPEN statement: OPEN (11,ACCESS=’DIRECT’,FORM=’FORMATTED’,RECL=24) OPEN (10,ACCESS=’SEQUENTIAL’, FORM=’UNFORMATTED’) OPEN (9,BLANK=’NULL’) The CLOSE statement terminates the connection of a particular file to a unit. A unit that does not exist or has no file connected to it may appear within a CLOSE statement; this would not affect any files. 2.4.2 The INQUIRE Statement The INQUIRE statement describes the connection to an external file. This statement can be executed before, during, or after a file is connected to a unit. All values that the INQUIRE statement assigns are current at the time that the statement is executed. You can use the INQUIRE statement to check the properties of a specific file or check the connection to a particular unit. The two forms of the INQUIRE statement are INQUIRE by file and INQUIRE by unit. The INQUIRE by file statement retrieves information about the properties of a particular file. The INQUIRE by unit statement retrieves the name of a file connected to a specified unit if the file is a named file. The standard input, standard output, and standard error files are unnamed files. An INQUIRE on a unit connected to any of these files indicates that the file is unnamed. An INQUIRE by unit on any unit connected by using an explicitly named OPEN statement indicates that the file is named, and returns the name that was present in the FILE= specifier in the OPEN statement. S–3695–36 17Application Programmer’s I/O Guide An INQUIRE by unit on any unit connected by using an explicitly unnamed OPEN statement, or an implicit open, may indicate that the file is named. A name is returned only if the I/O library can ensure that a subsequent OPEN statement with a FILE= name will connect to the same file. 2.4.3 File Positioning Statements The BACKSPACE and REWIND statements change the position of the external file. The ENDFILE statement writes the last record of the external file. You cannot use file positioning statements on a file that is connected as a direct access file. The REC= record specifier is used for positioning in a READ or WRITE statement on a direct access file. The BACKSPACE statement causes the file connected to the specified unit to be positioned to the preceding record. The following are examples of the BACKSPACE statement: BACKSPACE 10 BACKSPACE (11, IOSTAT=ios, ERR=100) BACKSPACE (12, ERR=100) BACKSPACE (13, IOSTAT=ios) The ENDFILE statement writes an end-of-file record as the next record of the file. The following are examples of the ENDFILE statement: ENDFILE 10 ENDFILE (11, IOSTAT=ios, ERR=100) ENDFILE (12, ERR=100) ENDFILE (13, IOSTAT=ios) The REWIND statement positions the file at its initial point. The following are examples of the REWIND statement: REWIND 10 REWIND (11, IOSTAT=ios, ERR=100) REWIND (12, ERR=100) REWIND (13, IOSTAT=ios) REWIND (14) 2.5 Private I/O on Cray T3E systems Private I/O consists of the READ, WRITE, OPEN, CLOSE, REWIND, ENDFILE, BACKSPACE, and INQUIRE statements. A private READ or WRITE statement 18 S–3695–36Standard Fortran I/O [2] is executed by the processing element (PE) that encounters it with no communication or coordination with other PEs. At program start, unit numbers 0, 5, 6, and 100 through 102 are associated with stdin, stdout, and stderr. If stdin or stdout is not associated with a terminal, it is buffered. Results are unpredictable if more than one PE tries to read from units 5 or 100, or tries to write to units 6 or 101. 2.6 Multithreading and Standard Fortran I/O Multithreading is the concurrent use of multiple threads of control that operate within the same address space. On UNICOS systems, multithreading is available through macrotasking, Autotasking, and the Pthread interface. On UNICOS/mk systems, multithreading is available through the Pthreads interface. Standard Fortran I/O is thread-safe on UNICOS. Standard Fortran I/O is not thread-safe on UNICOS/mk systems. On UNICOS systems, the runtime I/O library performs all the needed locking to permit multiple threads to concurrently execute Fortran I/O statements. The result is proper execution of all Fortran I/O statements and the sequential execution of I/O statements issued across multiple threads to files opened for sequential access. On UNICOS/mk systems (where Fortran I/O is not thread-safe), threaded programs must use locks or other synchronization around Fortran I/O statements to prevent concurrent execution of I/O statements on multiple threads. Failure to do so causes unpredictable results. S–3695–36 19Fortran I/O Extensions [3] This chapter describes additional I/O routines and statements available with the CF90 compiler. These additional routines, known as Fortran extensions, perform unformatted I/O. For details about the routines discussed in this chapter, see the compiler reference manuals. 3.1 BUFFER IN/BUFFER OUT Routines BUFFER IN and BUFFER OUT statements initiate a data transfer between the specified file or unit at the current record and the specified area of program memory. To allow maximum asynchronous performance, all BUFFER IN and BUFFER OUT operations should begin and end on a sector boundary. See Chapter 9, page 87, for more information about sector sizes. The BUFFER IN and BUFFER OUT statements can perform sequential asynchronous unformatted I/O if the files are assigned as unbuffered files. You must declare the BUFFER IN and BUFFER OUT files as unbuffered by using one of the following assign(1) commands. assign -s u ... assign -F system ... If the files are not declared as unbuffered, the BUFFER IN and BUFFER OUT statements may execute synchronously. For tapes, BUFFER IN and BUFFER OUT operate synchronously; when you execute a BUFFER statement, the data is placed in the buffer before you execute the next statement in the program. Therefore, for tapes, BUFFER IN has no advantage over a read statement or a CALL READ statement; however, the library code is doing asynchronous read-aheads to fill its own buffer. The COS blocked format is the default file structure on UNICOS and UNICOS/mk systems for files (not tapes) that are opened explicitly as unformatted sequential or implicitly by a BUFFER IN or BUFFER OUT statement. The BUFFER IN and BUFFER OUT statements decrease the overhead associated with transferring data through library and system buffers. These statements also offer the advantages of asynchronous I/O. I/O operations for several files can execute concurrently and can also execute concurrently with CPU instructions. This can decrease overall wall-clock time. S–3695–36 21Application Programmer’s I/O Guide In order for this to occur, the program must ensure that the requested asynchronous data movement was completed before accessing the data. The program must also be able to do a significant amount of CPU-intensive work or other I/O during asynchronous I/O to increase the program speed. Buffer I/O processing waits until any previous buffer I/O operation on the file completes before beginning another buffer I/O operation. Use the UNIT(3F) and LENGTH(3F) functions with BUFFER IN and BUFFER OUT statements to delay further program execution until the buffer I/O statement completes. For details about the routines discussed in this section, see the individual man pages for each routine. 3.1.1 The UNIT Intrinsic The UNIT intrinsic routine waits for the completion of the BUFFER IN or BUFFER OUT statement. A program that uses asynchronous BUFFER IN and BUFFER OUT must ensure that the data movement completes before trying to access the data. The UNIT routine can be called when the program wants to delay further program execution until the data transfer is complete. When the buffer I/O operation is complete, UNIT returns a status indicating the outcome of the buffer I/O operation. The following is an example of the UNIT routine: STATUS=UNIT(90) 3.1.2 The LENGTH Intrinsic The LENGTH intrinsic routine returns the length of transfer for a BUFFER IN or a BUFFER OUT statement. If the LENGTH routine is called during a BUFFER IN or BUFFER OUT operation, the execution sequence is delayed until the transfer is complete. LENGTH then returns the number of words successfully transferred. A 0 is returned for an end-of-file (EOF). The following is an example of the LENGTH routine: LENG=LENGTH(90) 22 S–3695–36Fortran I/O Extensions [3] 3.1.3 Positioning The GETPOS(3F) and SETPOS(3F) file positioning routines change or indicate the position of the current file. The GETPOS routine returns the current position of a file being read. The SETPOS routine positions a tape or mass storage file to a previous position obtained through a call to GETPOS. You can use the GETPOS and SETPOS positioning statements on buffer I/O files. These routines can be called for random positioning for BUFFER IN and BUFFER OUT processing. These routines can be used with COS blocked files on disk, but not with COS blocked files on tape. You can also use these routines with the standard Fortran READ and WRITE statements. The direct-access mode of standard Fortran is an alternative to the GETPOS and SETPOS functionality. 3.2 Random Access I/O Routines The record-addressable random-access file I/O routines let you generate variable length, individually addressable records. The I/O library updates indexes and pointers. Each record in a random-access file has a 1-word (64-bit) key or number indicating its position in an index table of records for the file. This index table contains a pointer to the location of the record on the device and can also contain a name of each record within the file. Alphanumeric record keys increase CPU time compared to sequential integer record keys because the I/O routines must perform a sequential lookup in the index array for each alphanumeric key. Each record should be named a numeric value n; n is the integer that corresponds to the n th record created on the file. The following two sets of record-addressable random-access file I/O routines are available: • The Mass Storage (MS) package provides routines that perform buffered, record-addressable file I/O with variable-length records. It contains the OPENMS, READMS, WRITMS, CLOSMS, WAITMS, FINDMS, SYNCMS, ASYNCMS, CHECKMS, and STINDX routines. • The Direct Random (DR) package provides routines that perform unbuffered, record-addressable file I/O. It contains the OPENDR, READDR, WRITDR, CLOSDR, WAITDR, SYNCDR, ASYNCDR, CHECKDR, and STINDR routines. The amount of data transferred for a record is rounded up to a multiple of 512 words, because I/O performance is improved for many disk devices. S–3695–36 23Application Programmer’s I/O Guide Both synchronous and asynchronous MS and DR I/O can be performed on a random-access file. You can use these routines in the same program, but they must not be used on the same file simultaneously. The MS and DR packages cannot be used for tape files. If a program uses asynchronous I/O, it must ensure that the data movement is completed before trying to access the data. Because asynchronous I/O has a larger overhead in CPU time than synchronous I/O, only very large data transfers should be done with asynchronous I/O. To increase program speed, the program must be able to do a significant amount of CPU-intensive work or other I/O while the asynchronous I/O is executing. The MS library routines are used to perform buffered record-addressable random-access I/O. The DR library routines are used to perform unbuffered record-addressable random-access I/O. These library routines are not internally locked to ensure single-threading; a program must lock each call to the routine if the routine is called from more than one task. The following list describes these two packages in more detail. For details about the routines discussed in this section, see the individual man pages for each routine. • OPENMS(3F) and OPENDR(3F) open a file and specify the file as a random-access file that can be accessed by record-addressable random-access I/O routines. These routines must be used to open a file before the file can be accessed by other MS or DR package routines. OPENMS sets up an I/O buffer for the random-access file. These routines read the index array for the file into the array provided as an argument to the routine. CLOSMS or CLOSDR must close any files opened by the OPENMS or OPENDR routine. The following are examples of these two routines: CALL OPENMS(80,intarr,len,it,ierr) CALL OPENDR(20,inderr,len,itflg,ierr) • READMS(3F) performs a read of a record into memory from a random-access file. READDR reads a record from a random-access file into memory. If READDR is used in asynchronous mode and the record size is not a multiple of 512 words, user data can be overwritten and not restored. You can use SYNCDR to switch to a synchronous read; the data is copied and restored after the read has completed. The following are examples of these routines: 24 S–3695–36Fortran I/O Extensions [3] CALL READMS(80,ibuf,nwrd,irec,ierr) CALL READDR(20,iloc,nwrd,irec,ierr) • WRITMS(3F) writes to a random-access file on disk from memory. WRITDR writes data from user memory to a record in a random-access file on disk. Both routines update the current index. The following are examples of these routines: CALL WRITMS(20,ibuf,nwrd,irec,irflg,isflag,ierr) CALL WRITDR(20,ibuf,nwrd,irec,irflag,isflg,ierr) • The CLOSMS(3F) and CLOSDR routines write the master index specified in the call to OPENMS or OPENDR from the array provided in the user program to the random-access file and then close the file. These routines also write statistics about the file to the stderr file. The following are examples of these routines: CALL CLOSMS(20,ierr) CALL CLOSDR(20,ierr) • ASYNCMS(3F) and ASYNCDR set the I/O mode for the random-access routines to asynchronous. I/O operations can be initiated and subsequently proceed simultaneously with the actual data transfer. If the program uses READMS, precede asynchronous reads with calls to FINDMS. The following are examples of these routines: CALL ASYNCMS(20,ierr) CALL ASYNCDR(20,ierr) • CHECKMS(3F) and CHECKDR check the status of the asynchronous random-access I/O operation. The following are examples of these routines: CALL CHECKMS(20,istat,ierr) CALL CHECKDR(20,istat,ierr) • WAITMS(3F) and WAITDR wait for the completion of an asynchronous I/O operation. They return a status flag indicating if the I/O on the specified file completed without error. The following are examples of these routines: CALL WAITMS(20,istat,ierr) CALL WAITDR(20,istat,ierr) • SYNCMS(3F) and SYNCDR set the I/O mode for the random-access routines to synchronous. All future I/O operations wait for completion. The following are examples of these routines: S–3695–36 25Application Programmer’s I/O Guide CALL SYNCMS(20,ierr) CALL SYNCDR(20,ierr) • STINDX(3F) and STINDR allow an index to be used as the current index by creating a subindex. These routines reduce the amount of memory needed by a file that contains a large number of records. They also maintain a file containing records logically related to each other. Records in the file, rather than records in the master index area, hold secondary pointers to records in the file. These routines allow more than one index to manipulate the file. Generally, STINDX or STINDR toggle the index between the master index maintained by OPENMS/OPENDR and CLOSMS/CLOSDR and the subindex supplied by the Fortran program. The following are examples of these routines: CALL STINDX(20,inderr,len,itflg,ierr) CALL STINDR(20,inderr,len,itflg,ierr) • FINDMS(3F) asynchronously reads the desired record into the data buffers for the specified file. The next READMS or WRITMS call waits for the read to complete and transfers data appropriately. An example of a call to FINDMS follows: CALL FINDMS(20,inwrd,irec,ierr) The following program example uses the MS package: Example 1: MS package use program msio dimension r(512) dimension idx(512) data r/512*2.0/ irflag=0 call openms(1,idx,100,0,ier) do 100 i=1,100 call writms(1,r,512,i,irflag,0,ier) if(ier.ne.0)then print *,"error on writms=",ier goto 300 end if 100 continue 26 S–3695–36Fortran I/O Extensions [3] do 200 i=1,100 call readms(1,r,512,i,irflag,0,ier) if(ier.ne.0)then print *,"error on readms=",ier goto 300 end if 200 continue 300 continue call closms(1,ier) end The following program uses the DR package: Example 2: DR package use program daio dimension r(512) dimension idx(512) data r/512*2.0/ irflag=0 ierrs=0 call assign(’assign -R’,ier1) call asnunit(1,’-F mr.save.ovf1:10:200:20’,ier2) if(ier1.ne.0.or.ier2.ne.0)then print *,"assign error=",ier1,ier2 ierrs=ierrs+1 end if call opendr(1,idx,100,0,ier) if(ier.ne.0)then print *,"error on opendr=",ier ierrs=ierrs+1 end if do 100 i=1,100 call writdr(1,r,512,i,irflag,0,ier) if(ier.ne.0)then print *,"error on writdr=",ier ierrs=ierrs+1 end if 100 continue do 200 i=1,100 S–3695–36 27Application Programmer’s I/O Guide call readdr(1,r,512,i,irflag,0,ier) if(ier.ne.0)then print *,"error on readdr=",ier ierrs=ierrs+1 end if 200 continue 300 call closdr(1,ier) if(ier.ne.0)then print *,"error on readdr=",ier ierrs=ierrs+1 end if 400 continue if(ierrs.eq.0)then print *,"daio passed" else print *,"daio failed" end if end 3.3 Word-Addressable I/O Routines A word-addressable (WA) random-access file consists of an adjustable number of contiguous words. The WA package performs unformatted, buffered I/O; the WA routines perform efficiently when the I/O buffers are set to a size large enough to hold several records that are frequently read or written. When a WA read operation is executed, the I/O buffers are searched to see if the data that will be read is already in the buffers. If the data is found in the I/O buffers, I/O speedup is achieved, since a system call is not needed to retrieve the data. A program using the package can access a word or a contiguous sequence of words from a WA random-access file. The WA package cannot be used for tape files. Although the WA I/O routines provide greater control over I/O operations than the record-addressable routines, they require that the user track information that the system would usually maintain when other forms of I/O are used. The program must keep track of the word position of each record in a file that it will read or write with WA I/O. This is easiest to do with fixed-length records; with variable-length records, the program must store record lengths for the file so they can be retrieved when the file is accessed. When variable-length records are used, the program should use record–addressable I/O. 28 S–3695–36Fortran I/O Extensions [3] The WA package allows both synchronous and asynchronous I/O. To speed up things up, the program must be able to do a significant amount of CPU-intensive work or other I/O while the asynchronous I/O is executing. These library routines are not internally locked to ensure single-threading; a program must lock each call to the routine if the routine is called from more than one task. The following list briefly describes the routines in this package; for details about the routines discussed in this section, see the individual man pages. • WOPEN(3F) opens a file and specifies it as a word-addressable, random-access file. WOPEN must be called before any other WA routines are called, because it creates the I/O buffer for the file by using blocks. By using WOPEN, you can combine synchronous and asynchronous I/O to a file while the file is open. The following is an example of a call to WOPEN: CALL WOPEN(30,iblks,istat,err) • GETWA(3F) synchronously reads data from a buffered, word-addressable, random-access file. SEEK(3F) is used with GETWA to provide more efficient I/O; the SEEK routine performs an asynchronous pre-fetch of data into a buffer. The following is an example of a call to GETWA: CALL GETWA(30,iloc,iadr,icnt,ierr) • SEEK(3F) asynchronously reads data from the word-addressable file into a buffer. A subsequent GETWA call will deliver the data from the buffer to the user data area. This provides a way for the user to do asynchronous read-ahead. The following is an example of a call to SEEK: CALL SEEK(30,iloc,iadr,icnt,ierr) • PUTWA(3F) synchronously writes from memory to a word-addressable, random-access file. The following is an example of a call to PUTWA: CALL PUTWA(30,iloc,iadr,icnt,ierr) APUTWA(3F) asynchronously writes from memory to a word-addressable, random-access file. The following is an example of a call to APUTWA: CALL APUTWA(30,iloc,iadr,icnt,ierr) • WCLOSE(3F) finalizes changes and additions to a WA file and closes it. The following is an example of a call to WCLOSE: CALL WCLOSE(30,ierr) S–3695–36 29Application Programmer’s I/O Guide The following is an example of a program that uses the WA I/O routines: Example 3: WA package use program waio dimension r(512), r1(512) iblks=10 !use a 10 block buffer istats=1 !print out I/O Stats call wopen(1,iblks,0,ier) if(ier.ne.0)then print *,"error on wopen=",ier goto 300 end if iaddr=1 do 100 k=1,100 do 10 j=1,512 10 r(j)=j+k call putwa(1,r,iaddr,512,ier) if(ier.ne.0)then print *,"error on putwa=",ier," rec=",k goto 300 end if iaddr=iaddr+512 100 continue iaddr=1 do 200 k=1,100 call getwa(1,r1,iaddr,512,ier) if(ier.ne.0)then print *, "error on getwa=",ier," rec=",k goto 300 end if iaddr=iaddr+512 200 continue 300 continue call wclose(1) end 30 S–3695–36Fortran I/O Extensions [3] 3.4 Asynchronous Queued I/O (AQIO) Routines The asynchronous queued I/O (AQIO) routines perform asynchronous, queued I/O operations. Asynchronous I/O allows your program to continue executing while an I/O operation is in progress, and it allows several I/O requests to be active concurrently. AQIO further refines asynchronous I/O by allowing a program to queue several I/O requests and to issue one request to the operating system to perform all I/O operations. When queuing I/O requests, the overhead associated with calling the operating system is incurred only once per group of I/O requests rather than once per request, as with other forms of I/O. AQIO also offers options for streamlining I/O operations that involve fixed-length records with a fixed-skip increment through the user file and a fixed-skip increment through program memory. A form of this is a read or write that involves contiguous fixed-length records. Such an operation is called a compound AQIO request or a compound AQIO operation. AQIO provides separate calls for compound operations so that a program can specify multiple I/O operations in one call, thus saving I/O time. Asynchronous I/O has a larger overhead in system CPU time than synchronous I/O; therefore, only large data transfers should be done using asynchronous I/O. To speed up the program, it must be able to do a significant amount of CPU-intensive work or other I/O while the asynchronous I/O is executing. The value of the queue argument on the AQWRITE/AQWRITEC(3F) or AQREAD/AQREADC(3F) call controls when the operating system is called to process the request. If queue is nonzero, packets are queued in the AQIO buffer and the operating system is not called to start packet processing until the buffer is full. For example, to queue 20 packets, the program would issue 19 AQWRITE calls with queue set to a nonzero value and then set it to 0 on the twentieth call. When a program opens a file using AQOPEN on a Cray T3E system, a file handle is returned. The library associates this handle with information in the processing element’s (PE’s) local memory; therefore, the file handle should not be used by other PEs. More than one PE can open a file with AQOPEN, but the user must do the coordination using synchronization routines such as AQWAIT. The following list briefly describes the AQIO routines; for details about the routines discussed in this section, see the individual man pages for each routine. • AQOPEN(3F) opens a file for AQIO. The AQOPEN call must precede all other AQIO requests in a Fortran program. • AQCLOSE(3F) closes an AQIO file. S–3695–36 31Application Programmer’s I/O Guide • The AQREAD function queues a simple asynchronous I/O read request. • AQREADC(3F) lets you use a compound AQIO request call to transfer fixed-length records repeatedly. You must provide the values for a repeat count, memory skip increment, and disk increment arguments. AQREADC transfers the first record from disk and increments the starting disk block and the starting user memory by the amounts you specify. To transfer data to a continuous array in memory, set the memory skip increment value to the record length in words. To transfer data sequentially from disk, set the disk increment value to the record length in blocks. See Example 4, page 33, for an example of a program using AQIO read routines. • AQWRITE queues a simple asynchronous write request. • AQWRITEC provides a compound AQIO request call when repeatedly transferring fixed-length records. The program supplies the repetition count, the disk skip increment, and the memory skip increment on these compound AQIO calls. AQIO then transfers the first record to or from disk and increments the starting disk block and the starting user memory address. To transfer data from a contiguous array in memory, set the memory skip increment value to the record length in words. To transfer data sequentially to disk, set the disk increment value to the record length in blocks. • AQSTAT checks the status of AQIO requests. AQWAIT forces the program to wait until all queued entries are completed. After queuing a AQWRITE or AQREAD request and calling the operating system, you may need to monitor their completion status to know when it is safe to use the data or to reuse the buffer area. AQSTAT returns information about an individual AQIO request. The reqid argument of AQREAD/AQREADC and AQWRITE/AQWRITEC is stored in the packet buffer and can be used in an AQSTAT call to monitor the completion status of a particular transfer. The aqpsize argument to AQOPEN allows the status to be monitored. A requested ID can be deleted after the request completes but before its status is checked, because each request buffer is reused. This can happen, for example, if you set the aqpsize argument in AQOPEN to 20 and issued 30 requests. If you then request the status of the first request, AQSTAT returns 0, indicating that the requested ID was not found. 32 S–3695–36Fortran I/O Extensions [3] 3.4.1 Error Detection by Using AQIO Because of the asynchronous nature of AQIO, error detection and reporting with AQIO may not occur immediately on return from a call to an asynchronous queued I/O subroutine. If one of the queued I/O requests causes an error when the operating system tries to do the I/O, the error is returned in a subsequent AQIO request. For example, if a program issues an AQWRITE with queue set to 0, I/O is initiated. If no previous errors occurred, a 0 status is returned from this statement, even though this request may ultimately fail. If the request fails, perhaps because it tried to exceed the maximum allowed file size, the error is returned to the user in the subsequent AQIO statement that coincides with its detection. If the next AQIO statement is AQWAIT, the error is detected and returned to the user. If the next AQIO statement is AQSTAT, the error is detected and reported only if the requested ID failed. When an error is once reported to the user, it is not reported again. Checking the status after each AQIO statement ensures that the program detects all errors. Example 4: AQIO routines: compound read operations PROGRAM AQIO1 IMPLICIT INTEGER(A-Z) PARAMETER (TOTREQ=20) PARAMETER (AQPSIZE=20) INTEGER AQP INTEGER BUFFER (TOTREQ*512) INTEGER EVNBUF (TOTREQ/2*512) INTEGER ODDBUF (TOTREQ/2*512) CALL AQOPEN (AQP,AQPSIZE,’FILE4’H,STAT) IF (STAT.NE.0) THEN PRINT *,’AQOPEN FAILED, STATUS= ’,STAT CALL ABORT() ENDIF C INITIALIZE DATA DO 10 I=1,TOTREQ*512 BUFFER(i) = I 10 CONTINUE DO 50 RNUM=1,TOTREQ C QUEUE THE REQUESTS C INITIATE I/O ON THE LAST REQUEST S–3695–36 33Application Programmer’s I/O Guide C THE DATA FROM BUFFER IS WRITTEN IN A SEQUENTIAL C FASHION TO DISK QUEUE=1 IF (RNUM.EQ.TOTREQ) QUEUE=0 OFFSET= (RNUM-1)*512+1 CALL AQWRITE( ’ AQP, ’ BUFFER(OFFSET), !start address ’ RNUM-1, !block address ’ 1, !number of blocks ’ RNUM, !request id ’ QUEUE, !queue request or start I/O ’ STAT) !return status IF (STAT.NE.0)THEN PRINT*,’AQWRITE FAILED, STATUS= ’,STAT CALL ABORT() ENDIF 50 CONTINUE C WAIT FOR I/O TO COMPLETE CALL AQWAIT (AQP,STAT) IF (STAT.LT.0) THEN PRINT*,’AQWAIT AFTER AQWRITE FAILED, STATUS=’,STAT CALL ABORT() ENDIF C NOW ISSUE TWO COMPOUND READS. THE FIRST READ C GETS THE ODD SECTORS AND THE SECOND GETS THE C EVEN SECTORS. C INCS=TOTREQ/2-1 CALL AQREADC( ’ AQP, ’ ODDBUF(1), ! start address ’ 512, ! mem stride ’ 1, ! block number ’ 1, ! number of blocks ’ 2, ! disk stride ’ INCS, ! incs ’ 1, ! request id ’ 1, ! queue request ’ STAT1) ! return status 34 S–3695–36Fortran I/O Extensions [3] CALL AQREADC( ’ AQP, ’ EVNBUF(1), ! start address ’ 512, ! mem stride ’ 0, ! block number ’ 1, ! number of blocks ’ 2, ! disk stride ’ INCS, ! incs ’ 2, ! request id ’ 0, ! start request ’ STAT2) ! return status IF ((STAT1.NE.0). OR. (STAT2.NE.0)) THEN PRINT *,’AQREADC FAILED, STATUS= ’,STAT1,STAT2 CALL ABORT() ENDIF CALL AQWAIT (AQP,STAT) IF (STAT.LT.0) THEN PRINT *,’AQWAIT FAILED, STATUS= ’,STAT CALL ABORT() ENDIF C VERIFY THAT THE DATA READ WAS CORRECT K = 1 DO 90 I = 1,TOTREQ,2 DO 80 J = 1,512 IF (EVNBUF (J+(K-1)*512).NE.J+(I-1)*512)THEN PRINT *,’BAD DATA EVN’,EVNBUF(J+(K-1)*512),J,I,K CALL ABORT() ENDIF 80 CONTINUE K=K+1 90 CONTINUE K = 1 DO 99 I = 2,TOTREQ,2 DO 95 J = 1,512 IF (ODDBUF(J+(K-1)*512).NE.J+(I-1)*512) PRINT *,’BAD DATA ODD’,ODDBUF(J+(K-1)*512),J,I,K CALL ABORT() ENDIF 95 CONTINUE K=K+1 99 CONTINUE S–3695–36 35Application Programmer’s I/O Guide CALL AQCLOSE(AQP,STAT) IF(STAT.NE.0) THEN PRINT *,’AQCLOSE FAILED, STATUS= ’,STAT CALL ABORT() ENDIF END Example 5: AQIO routines: error detection PROGRAM AQIO2 IMPLICIT INTEGER(A-Z) PARAMETER (TOTREQ=20) PARAMETER (AQPSIZE=20) INTEGER AQP INTEGER BUFFER (TOTREQ*512) INTEGER INBUF (512) CALL AQOPEN (AQP,AQPSIZE,’FILE4’H,STAT) IF (STAT.NE.0) THEN PRINT *,’AQOPEN FAILED, STATUS=’,STAT CALL ABORT() ENDIF DO 50 RNUM=1,TOTREQ C QUEUE THE REQUESTS C INITIATE I/O ON THE LAST REQUEST C THE DATA FROM BUFFER WILL BE WRITTEN IN A C SEQUENTIAL FASHION TO DISK QUEUE=1 IF (RNUM.EQ.TOTREQ) QUEUE=0 OFFSET= (RNUM-1)*512+1 CALL AQWRITE ( ’ AQP, ’ BUFFER (OFFSET), ! start address ’ RNUM-1, ! block number ’ 1, ! number of blocks ’ RNUM, ! request id ’ QUEUE, ! queue request or start I/O ’ STAT) ! return status IF (STAT.NE.0) THEN PRINT *,’AQWRITE FAILED, STATUS=’,STAT CALL ABORT () 36 S–3695–36Fortran I/O Extensions [3] ENDIF 50 CONTINUE C WAIT FOR I/O TO COMPLETE CALL AQWAIT (AQP,STAT) IF (STAT.LT.0) THEN PRINT *,’AQWAIT AFTER AQWRITE FAILED, STATUS= ’,STAT CALL ABORT () ENDIF C NOW ISSUE A READ. TO ILLUSTRATE ERROR DETECTION C ATTEMPT TO READ BEYOND THE END OF THE FILE CALL AQREAD ( ’ AQP, ’ INBUF(1), ! start address ’ TOTREQ+1, ! block number ’ 1, ! number of blocks ’ TOTREQ+1, ! request id ’ 0, ! start I/O ’ STAT) ! return status IF (STAT.NE.0)THEN PRINT *,’AQREAD FAILED, STATUS=’,STAT CALL ABORT() ENDIF CALL AQWAIT (AQP,STAT) C BECAUSE WE ATTEMPTED TO READ BEYOND THE END C OF THE FILE, AQWAIT WILL RETURN A NEGATIVE C VALUE IN "STAT", AND THE PROGRAM WILL ABORT IN C THE FOLLOWING STATEMENT IF (STAT.LT.0) THEN PRINT *,’AQWAIT AFTER AQREAD FAILED, STATUS= ’,STAT CALL ABORT() ENDIF CALL AQCLOSE (AQP,STAT) IF (STAT.NE.0) THEN PRINT *,’AQCLOSE, STATUS= ’,STAT CALL ABORT() ENDIF END S–3695–36 37Application Programmer’s I/O Guide The following is the output from running this program: AQWAIT AFTER AQREAD FAILED, STATUS= -1202 3.5 Logical Record I/O Routines The logical record I/O routines provide word or character granularity during read and write operations on full or partial records. The read routines move data from an external device to a user buffer. The write routines move data from a user buffer to an external device. The following list briefly describes these routines; for details about the routines discussed in this section, see the individual man pages. • READ and READP move words of data from disk or tape to a user data area. READ(3F) reads words in full-record mode. READP reads words in partial-record mode. READ positions the file at the beginning of the next record after a READ. READP positions the file at the beginning of the next word in the current record after a READP. Even if foreign record translation is enabled for the specified unit, the bits from the foreign logical records are moved without conversion (see the following two bullets for routines that translate). Therefore, if the file contained IBM data, that data is not converted before it is stored. The following are examples of calls to READ and READP: CALL READ (7,ibuf,icnt,istat,iubc) CALL READP(8,ibuf,icnt,istat,iubc) • READC(3F) reads characters in full-record mode. READCP reads characters in partial-record mode. Characters are moved to the user area with only one character per word and are right-justified in the word. The bits from foreign logical records are moved after conversion when foreign record translation is enabled for the specified unit. The following are examples of calls to READC and READCP: CALL READC (9,ibuf,icnt,istat) CALL READCP (10,ibuf,icnt,istat) • READIBM(3F) reads IBM 32-bit floating-point words that are converted to Cray 64-bit words. The IBM 32-bit format is converted to the equivalent Cray 64-bit value and the result is stored. A conversion routine, IBM2CRAY(3F), converts IBM data to Cray format. A preferred method to obtain the same result is to read the file with an unformatted READ statement and then convert 38 S–3695–36Fortran I/O Extensions [3] the data through a call to IBM2CRAY. The following is an example of a call to READIBM: CALL READIBM (7,ibuf,ileng,incr) • WRITE(3F) writes words in full-record mode. WRITEP writes words in partial-record mode. WRITE and WRITEP move words of data from the user data area to an I/O buffer area. Even it foreign record translation is enabled, no data conversion occurs before the words are stored in the I/O buffer area (see the following bullet for routines that translate). The following are examples of calls to WRITE and WRITEP: CALL WRITE (8,ibuf,icnt,iubc,istat) CALL WRITEP (9,ibuf,icnt,iubc,istat) • WRITEC(3F) writes characters in full-record mode. WRITECP writes characters in partial-record mode. Characters are packed into the buffer for the file. If foreign record translation is enabled, the characters are converted and then packed into the buffer. The following are examples of calls to WRITEC and WRITECP: CALL WRITEC (10,icbuf,iclen,istat) CALL WRITECP (11,icbuf,iclen,istat) • WRITIBM(3F) writes Cray 64-bit values as IBM 32-bit floating-point words. The Cray 64-bit values are converted to IBM 32-bit format, using a conversion routine, CRAY2IBM(3F). After this conversion, you can use an unformatted WRITE statement to write the file. The following is an example of the call to WRITIBM: CALL WRITIBM (12,ibuf,ilen,incr) S–3695–36 39Tape and Named Pipe Support [4] Tape handling is usually provided through the tape subsystem with a minimum of user intervention. However, user end-of-volume (EOV) processing, bad data handling, and some tape positioning actions require additional support routines. Named pipes, or UNIX FIFO special files for I/O requests, are created with the mknod(2) system call; these special files allow any two processes to exchange information. The system call creates an inode for the named pipe and establishes it as a read/write named pipe. It can then be used by standard Fortran I/O or C I/O. Piped I/O is faster than normal I/O; it requires less memory than memory-resident files. The er90 layer is not available on Cray T3E systems. 4.1 Tape Support You can write and read from a tape using formatted or unformatted I/O statements. You can also use BUFFER IN and BUFFER OUT statements and the logical record routines (READC, READP, WRITEC, and WRITEP) to access the tape file from a Fortran program. For complete details about using tape files in Fortran programs on UNICOS and UNICOS/mk platforms, see the Tape Subsystem User’s Guide. 4.1.1 User EOV Processing Several library routines assist users with EOV processing from a Fortran program. Tape-volume switching is usually handled by the tape subsystem and is transparent to the user. However, when a user requests EOV processing, the program gains control at the end of tape, and the program may perform special processing. The following library routines can be used with tape processing: • CHECKTP(3F) checks the tape position. • CLOSEV(3F) closes the volume and mounts the next volume in a volume identifier list. • ENDSP(3F) disables special tape processing. • SETSP(3F) enables and disables EOV processing. • STARTSP(3F) enables special tape processing. S–3695–36 41Application Programmer’s I/O Guide 4.1.2 Handling Bad Data on Tapes The SKIPBAD(3F) and ACPTBAD(3F) routines can be called from a Fortran program to handle bad data on tape files. • SKIPBAD skips bad data; it does not write it to the buffer. • ACPTBAD makes bad data available by transferring it to the user-specified buffer. It allows a program to read beyond bad data within a file by moving it into the buffer and positioning past the bad data. 4.1.3 Positioning The GETTP(3F) and SETTP(3F) file positioning routines change or indicate the position of the current file. • GETTP gets information about an opened tape file. • SETTP positions a tape file at a tape block or a tape volume. 4.2 Named Pipes After a named pipe is created, Fortran programs can access that pipe almost as if it were a typical file; the differences between process communication using named pipes and process communication using normal files is discussed in the following list. The examples show how a Fortran program can use standard Fortran I/O on pipes. • A named pipe must be created before a Fortran program opens it. The following is the syntax for the command to create a named pipe called fort.13: /etc/mknod fort.13 p A named pipe can be created from within a Fortran program by using ISHELL(3F) function or by using the C language library interface to the mknod(2) system call; either of the following examples creates a named pipe: CALL ISHELL(’/etc/mknod fort.13 p’) I = MKNOD (’fort.13’,010600B,0) • Fortran programs can communicate using two named pipes: one to read and one to write. A Fortran program must either read from or write to any named pipe, but it cannot do both at the same time. This is a Fortran restriction on 42 S–3695–36Tape and Named Pipe Support [4] pipes, not a system restriction. It occurs because Fortran does not allow read and write access at the same time. • I/O transfers through named pipes use memory for buffering. A separate buffer is created for each named pipe that is created. The PIPE_BUF parameter defines the kernel buffer size in the /sys/param.h parameter file. The default value of PIPE_BUF is 8 blocks (8 * 512 words), but the full size may not be needed or used. I/O to named pipes does not transfer to or from a disk. However, if I/O transfers fill the buffer, the writing process waits for the receiving process to read the data before refilling the buffer. If the size of the PIPE_BUF parameter is increased, I/O performance may decrease bevause of buffer contention. If memory has already been allocated for buffers, more space will not be allocated. • Binary data transferred between two processes through a named pipe must use the correct file structure. An undefined file structure (specified by assign -s u) should be specified for a pipe by the sending process. An unblocked structure (specified by assign -s unblocked) should be specified for a pipe by the receiving process. You can also select a file specification of system (assign -F system) for the sending process. The file structure of the receiving or read process can be set to either an undefined or an unblocked file structure. However, if the sending process writes a request that is larger than PIPE_BUF, it is essential for the receiving process to read the data from a pipe set to an unblocked file structure. A read of a transfer larger than PIPE_BUF on an undefined file structure yields only the amount of data specified by PIPE_BUF. The receiving process does not wait to see whether the sending process is refilling the buffer. The pipe may be less than the value of PIPE_BUF. For example, the following assign commands specify that the file structure of the named pipe (unit 13, file name pipe) for the sending process should be undefined (-s u). The named pipe (unit 15, file name pipe) is type unblocked (-s unblocked) for the read process. assign -s u -a pipe u:13 assign -s unblocked -a pipe u:15 • A read from a pipe that is closed by the sender causes an end-of-file (EOF). To detect EOF on a named pipe, the pipe must be opened as read-only by the receiving process. S–3695–36 43Application Programmer’s I/O Guide 4.2.1 Piped I/O Example Without End-of-File Detection In this example, two Fortran programs communicate without end-of-file (EOF) detection. In the example, program writerd generates an array that contains the elements 1 to 3 and writes the array to named pipe pipe1. Program readwt reads the three elements from named pipe pipe1, prints out the values, adds 1 to each value, and writes the new elements to named pipe pipe2. Program writerd reads the new values from named pipe pipe2 and prints them. The -a option of the assign(1) command allows the two processes to access the same file with different assign characteristics. Example 6: No EOF detection: writerd program writerd parameter(n=3) dimension ia(n) do 10 i=1,n ia(i)=i 10 continue write (10) ia read (11) ia do 20 i=1,n print*,’ia(’,i,’) is ’,ia(i),’ in writerd’ 20 continue end Example 7: No EOF detection: readwt program readwt parameter(n=3) dimension ia(n) read (15) ia do 10 i=1,n print*,’ia(’,i,’) is ’,ia(i),’ in readwt’ ia(i)=ia(i)+1 10 continue write (16) ia end 44 S–3695–36Tape and Named Pipe Support [4] The following commands execute the programs: f90 -o readwt readwt.f f90 -o writerd writerd.f /etc/mknod pipe1 p /etc/mknod pipe2 p assign -s u -a pipe1 u:10 assign -s unblocked -a pipe2 u:11 assign -s unblocked -a pipe1 u:15 assign -s u -a pipe2 u:16 readwt & writerd The following is the output of the two programs: ia(1) is 1 in readwt ia(2) is 2 in readwt ia(3) is 3 in readwt ia(1) is 2 in writerd ia(2) is 3 in writerd ia(3) is 4 in writerd 4.2.2 Detecting End-of-File on a Named Pipe The following conditions must be met to detect end-of-file on a read from a named pipe within a Fortran program: • The program that sends data must open the pipe in a specific way, and the program that receives the data must open the pipe as read-only. • The program that sends or writes the data must open the named pipe as read and write or write-only. This is the default because the /etc/mknod command creates a named pipe with read and write permission. • The program that receives or reads the data must open the pipe as read-only. A read from a named pipe that is opened as read and write waits indefinitely for the data. 4.2.3 Piped I/O Example With End-of-File Detection This example uses named pipes for communication between two Fortran programs with end-of-file detection. The programs in this example are similar to the programs used in the preceding section. This example shows that program readwt can detect the EOF. S–3695–36 45Application Programmer’s I/O Guide Program writerd generates array ia and writes the data to the named pipe pipe1. Program readwt reads the data from the named pipe pipe1, prints the values, adds one to each value, and writes the new elements to named pipe pipe2. Program writerd reads the new values from pipe2 and prints them. Finally, program writerd closes pipe1 and causes program readwt to detect the EOF. The following commands execute these programs: f90 -o readwt readwt.f f90 -o writerd writerd.f assign -s u -a pipe1 u:10 assign -s unblocked -a pipe2 u:11 assign -s unblocked -a pipe1 u:15 assign -s u -a pipe2 u:16 /etc/mknod pipe1 p /etc/mknod pipe2 p readwt & writerd Example 8: EOF detection: writerd program writerd parameter(n=3) dimension ia(n) do 10 i=1,n ia(i)=i 10 continue write (10) ia read (11) ia do 20 i=1,n print*,’ia(’,i,’) is’,ia(i),’ in writerd’ 20 continue close (10) end Example 9: EOF detection: readwt program readwt parameter(n=3) dimension ia(n) C open the pipe as read-only open(15,form=’unformatted’, action=’read’) read (15,end = 101) ia do 10 i=1,n 46 S–3695–36Tape and Named Pipe Support [4] print*,’ia(’,i,’) is ’,ia(i),’ in readwt’ ia(i)=ia(i)+1 10 continue write (16) ia read (15,end = 101) ia goto 102 101 print *,’End of file detected’ 102 continue end The output of the two programs is as follows: ia(1) is 1 in readwt ia(2) is 2 in readwt ia(3) is 3 in readwt ia(1) is 2 in writerd ia(2) is 3 in writerd ia(3) is 4 in writerd End of file detected S–3695–36 47System and C I/O [5] This chapter describes systems calls used by the I/O library to perform asynchronous or synchronous I/O. This chapter also describes Fortran callable entry points to several C library routines and describes C I/O on UNICOS/mk systems. 5.1 System I/O The I/O library and programs use the system calls described in this chapter to perform synchronous and asynchronous I/O, to queue a list of distinct I/O requests, and to perform unbuffered I/O without system buffering. For more information about the system calls described in this chapter, see the UNICOS System Calls Reference Manual, the UNICOS/mk System Calls Reference Manual, or the individual man pages. 5.1.1 Synchronous I/O With synchronous I/O, an executing program relinquishes control during the I/O operation until the operation is complete. An operation is not complete until all data is moved. The read(2) and write(2) system calls perform synchronous reads and writes. The READ(3F) and WRITE(3F) functions provide a Fortran interface to the read and write system calls. The read system call reads a specified number of bytes from a file into a buffer. The write system call writes from a buffer to a file. 5.1.2 Asynchronous I/O Asynchronous I/O lets the program use the time that an I/O operation is in progress to perform some other operations that do not involve the data in the I/O operation. In asynchronous I/O operations, control is returned to the calling program after the I/O is initiated. The program may perform calculations unrelated to the previous I/O request or it may issue another unrelated I/O request while waiting for the first I/O request to complete. The asynchronous I/O routines provide functions that let a program wait for a particular I/O request to complete. The asynchronous form of BUFFER IN and BUFFER OUT statements used with UNIT and LENGTH routines provide this type of I/O. S–3695–36 49Application Programmer’s I/O Guide On both UNICOS and UNICOS/mk systems, the READA(3F) and WRITEA(3F) functions provide a Fortran interface to the reada(2) and writea(2) system calls. The reada system call reads a specified number of bytes from a file into a buffer. It returns immediately, even if the data cannot be delivered until later. The writea system call writes from a buffer to a file. 5.1.3 listio I/O Use the listio(2) system call to initiate a list of distinct I/O requests and, optionally, wait for all of them to complete. No subroutine or function interface to listio exists in Fortran. The AQIO package provides an indirect Fortran interface to listio. 5.1.4 Unbuffered I/O The open(2) system call opens a file for reading or writing. If the I/O request is well formed and the O_RAW flag is set, the read(3F) or write(3F) system call reads or writes whole blocks of data directly into user space, bypassing system cache. Doing asynchronous system buffered I/O (for example, not using O_RAW) can cause performance problems because system caching can cause performance problems. 5.2 C I/O This section describes C library I/O from Fortran, and describes C library I/O on Cray T3E systems. 5.2.1 C I/O from Fortran The C library provides a set of routines that constitute a user-level I/O buffering scheme to be used by C programmers. UNICOS and UNICOS/mk systems also provide Fortran callable entry points to many of these routines. For more information about the C library functions, see the UNICOS System Libraries Reference Manual. The getc(3C) and putc(3C) inline macros process characters. The getchar and putchar macros, and the higher-level routines fgetc, fgets, fprintf, fputc, fputs, fread, fscanf, fwrite, gets, getw, printf, puts, putw, and scanf all use or act as if they use getc and putc. They can be intermixed. A file with this associated buffering is called a streams and is associated with a pointer to a defined type FILE. The fopen(3C) routine creates descriptive 50 S–3695–36System and C I/O [5] data for a stream and returns a pointer to designate the stream in all further transactions. Three open streams with constant pointers are usually declared in the stdio.h header file and are associated with stdin, stdout, and stderr. Three types of buffering are available with functions that use the FILE type: unbuffered, fully buffered, and line buffered, as described in the following list: • If the stream is unbuffered, no library buffer is used. • For a fully buffered stream, data is written from the library buffer when it is filled, and read into the library buffer when it is empty. • If the stream is line buffered, the buffer is flushed when a new line character is written, the buffer is full, or when input is requested. The setbuf and setvbuf functions let you change the type and size of the buffers. By default, output to a terminal is line buffered, output to stderr is unbuffered, and all other I/O is fully buffered. See the setbuf(3C) man page for details. Fortran interfaces exist for the following C routines that use the FILE type: FCLOSE FPUTS FDOPEN FREAD FGETS FREOPEN FILENO FSEEK FOPEN FWRITE Mixing the use of C I/O functions with Fortran I/O on the same file may have unexpected results. If you want to do this, ensure that the Fortran file structure chosen does not introduce unexpected control words and that library buffers are flushed properly before switching between types of I/O. The following example illustrates the use of some C routines. The assign environment does not affect these routines. Example 10: C I/O from Fortran PROGRAM STDIOEX INTEGER FOPEN, FCLOSE, FWRITE, FSEEK INTEGER FREAD, STRM CHARACTER*25 BUFWR, BUFRD PARAMETER(NCHAR=25) C Open the file /tmp/mydir/myfile for update STRM = FOPEN(’/tmp/mydir/myfile’,’r+’) IF (STRM.EQ.0) THEN S–3695–36 51Application Programmer’s I/O Guide STOP ’ERROR OPENING THE FILE’ ENDIF C Write I = FWRITE(BUFWR, 1, NCHAR, STRM) IF (I.NE.NCHAR*1)THEN STOP ’ERROR WRITING FILE’ ENDIF C Rewind and read the data I = FSEEK(STRM, 0, 0) IF (I.NE.0)THEN STOP ’ERROR REWINDING FILE’ ENDIF I = FREAD(BUFRD, 1, NCHAR, STRM) IF (I.NE.NCHAR*1)THEN STOP ’ERROR READING FILE’ ENDIF C Close the file I = FCLOSE(STRM) IF (I.NE.0) THEN STOP ’ERROR CLOSING THE FILE’ ENDIF END 5.2.1.1 C I/O on Cray T3E systems When using system calls on Cray T3E systems, if more than one processing element (PE) opens the same file with an open(2) system call, distinct file descriptors are returned. If each PE uses its file descriptor to perform a read(2) operation on the file, each PE reads the entire file. If each PE uses its file descriptor to perform a write operation to the file, the results are unpredictable. When a program opens a stream with fopen(3C), a pointer to the stdio.h file structure associated with the stream is returned. This stream pointer points to a structure contained in local memory on a PE; therefore, the stream pointer may not be used from another PE. If a stream is buffered, its buffer is contained in local memory to the PE that opened it, and it is unknown to other PEs. At program startup, each PE has an open stdio stream pointer for stdin, stdout, and stderr; stderr is usually not fully buffered and stdin and stdout are fully buffered only if they do not refer to an interactive device. Buffers associated with stdin, stdout, and stderr are local to a PE. 52 S–3695–36System and C I/O [5] Results are unpredictable if stdin is buffered and more than one PE attempts to read from it and if stdout is buffered and more than one PE attempts to write to it. The file descriptor for any of these streams is shared across all PEs; therefore, applying an fclose(3C) operation to stdin, stdout, or stderr on any PE, closes that stream on all PEs. When a program opens a file for flexible file input/output (FFIO) with ffopen(3C) or ffopens(3C), the library associates a structure local to the PE that contains descriptive data with the value returned to the user. Therefore, the value returned by ffopen may not be used from another PE. The FFIO processing layers may also contain buffering that is local to the PE. Attempting to perform an ffopen operation and do I/O to the same file from more than one PE may produce unpredictable results. S–3695–36 53The assign Environment [6] Fortran programs require the ability to alter many details of a Fortran file connection. You may need to specify device residency, an alternative file name, a file space allocation scheme, file structure, or data conversion properties of a connected file. This chapter describes the assign(1) command and the assign(3F) library routine, which are used for these purposes. The ffassign command provides an interface to assign processing from C. See the ffassign man page for details about its use. 6.1 assign Basics The assign(1) command passes information to Fortran open statements and to the ffopen(3C), aqopen(3F) , wopen(3F), opendr(3F), and openms(3F) routines. This information is called the assign environment; it consists of the following elements: • A list of unit numbers • File names • File name patterns that have attributes associated with them Any file name, file name pattern, or unit number to which assign options are attached is called an assign_object. When the unit or file is opened from Fortran, the options are used to set up the properties of the connection. 6.1.1 Open Processing The I/O library routines apply options to a file connection for all related assign_objects. If the assign_object is a unit, the application of options to the unit occurs whenever that unit becomes connected. If the assign_object is a file name or pattern, the application of options to the file connection occurs whenever a matching file name is opened from a Fortran program. S–3695–36 55Application Programmer’s I/O Guide When any of the previously listed library I/O routines open a file, they use assign options for any assign_objects which apply to this open request. Any of the following assign_objects or categories might apply to a given open request: • g:all options apply to any open request. • g:su, g:sf, g:du, g:aq, and g:ff all apply to types of open requests (for example, sequential unformatted, sequential formatted, and so on). • u:unit_number applies whenever unit_number is opened. • p:pattern applies whenever a file whose name matches pattern is opened. The assign environment can contain only one p: assign_object that matches the current open file. The exception is that the p:%pattern (which uses the % wildcard character) is silently ignored if a more specific pattern also matches the current file name being opened. • f:filename applies whenever a file with the name filename is opened. Options from the assign objects in these categories are collected to create the complete set of options used for any particular open. The options are collected in the listed order, with options collected later in the list of assign objects overriding those collected earlier. 6.1.2 The assign Command The following is the syntax for the assign command: assign [-I] [-O] [-a actualfile] [-b bs] [-c] [-d bdr] [-f fortstd] [-l buflev] [ -m setting] [-n sz[:st]] [-p partlist] [-q ocblks] [-r setting] [-s ft] [-t] [-u bufcnt] [-w setting] [-x setting] [-y setting] [-C charcon] [-D fildes] [-F spec[,specs]] [-L setting] [-N numcon] [-P scope] [-R] [-S setting] [-T setting] [-U setting] [-V] [-W setting] [-Y setting] [-Z setting] assign_object The following specifications cannot be used with any other options: assign -R [assign_object] assign -V [assign_object] 56 S–3695–36The assign Environment [6] The following is a summary of the assign command options. For details, see the assign(1) and intro_ffio(3F) man pages. The assign command is implemented through the assign(3F), asnfile(3F), and asnunit(3F) routines for Programming Environment releases prior to 1.2. The following are the assign command control options: -I Specifies an incremental assign. All attributes are added to the attributes already assigned to the current assign_object. This option and the -O option are mutually exclusive. -O Specifies a replacement assign. This is the default control option. All currently existing assign attributes for the current assign_object are replaced. This option and the -I option are mutually exclusive. -R Removes all assign attributes for assign_object. If assign_object is not specified, all currently assigned attributes for all assign_objects are removed. -V Views attributes for assign_object. If assign_object is not specified, all currently assigned attributes for all assign_objects are printed. The following are the assign command attribute options: -a actualfile The file= specifier or the actual file name. -b bs Library buffer size in 4096–byte blocks. -c Contiguous storage. Must be used with the -n option. -d bdr Online tape bad data recovery. Specify either skipbad or acptbad for bdr. -f fortstd Specify 90 to be compatible with the Fortran 90 and Fortran 95 standards and Cray’s Fortran compiling system. -l buflev Kernel buffering. Specify none, ldcache, or full for buflev. If this is not set, the level of buffering depends on the type of open operation performed. -m setting Special handling of a direct access file that will be accessed concurrently by several processes or tasks. Special handling includes skipping the check that only one Fortran unit be connected to a unit, suppressing file truncation to true size by the I/O buffering routines, and ensuring that the file is not truncated by the I/O buffering routines. Enter either on or off for setting. S–3695–36 57Application Programmer’s I/O Guide -n sz [:st] Amount of system file space to reserve for a file. This is a number of 4096–byte blocks. Used by Fortran I/O, FFIO, and auxiliary I/O (aqio, waio, drio, and msio). The optional st value is an obsolete way to specify the -q assign attribute. Use of -q is preferable to using the st value on -n. -p partlist File system partition list. Used by Fortran I/O, FFIO, and auxiliary I/O. partlist can be a single number, a range (m-n), a set ( m:n), or a combination of ranges and sets separated by colons. -q ocblks Number of 4096–byte blocks to be allocated per file system partition. Used by Fortran I/O, FFIO, and auxiliary I/O. -r setting Activate or suppress the passing of the O_RAW flag to the open(2) system call. setting can be either on or off. -s ft File type. Enter text, cos, blocked, unblocked, u, sbin, bin, bmx, or tape for ft. -t Temporary file. -u bufcnt Buffer count. Specifies the number of buffers to be allocated for a file. -w setting Activate or suppress the passing of the O_WELLFORMED flag to the open(2) system call. Used by Fortran I/O and FFIO. setting may be on or off. -x setting Activate or suppress the passing of the O_PARALLEL flag to the open(2) system call. setting can be either on or off. -y setting Suppresses repeat counts in list-directed output. setting can be either on or off. The default setting is off. -C charcon Character set conversion information. Enter ascii, ebcdic, or cdc for charcon. If you specify the -C option, you must also specify the -F option. ebcdic and cdc are not supported on UNICOS/mk. -D fildes Specifies a connection to a standard file. Enter stdin, stdout, or stderr for fildes. 58 S–3695–36The assign Environment [6] -F spec [,specs] Flexible file I/O (FFIO) specification. See the assign(1) man page for details about allowed values for spec and for details about hardware platform support. See the intro_ffio(3F) man page for details about specifying the FFIO layers. -L setting Activates or suppresses the passing of the O_LDRAW flag to the open(2) system call. Enter either on or off for setting. -N numcon Foreign numeric conversion specification. See the assign(1) man page for details about allowed values for numcon and for details about hardware platform support. -P scope Specifies the scope of a Fortran unit and allows specification of private I/O on UNICOS systems. See the assign(1) man page for details about allowed values for scope. -S setting Suppresses use of a comma as a separator in list-directed output. Enter either on or off for setting. The default setting is off. -T setting Activates or suppresses truncation after write for sequential Fortran files. Enter either on or off for setting. -U setting Produces a non-UNICOS form of list-directed output. This is a global setting which sets the value for the -y, -S, and -W options. Enter either on or off for setting. The default setting is off. -W setting Suppresses compressed width in list-directed output. Enter either on or off for setting. The default setting is off. -Y setting Skips unmatched namelist groups in a namelist input record. Enter either on or off for setting. The default setting is off. -Z setting Recognizes –0.0 for IEEE floating point systems and writes the minus sign for edit-directed, list-directed, and namelist output. Enter either on or off for setting. assign_object Specifies either a file name or a unit number for assign_object. The assign command associates the attributes with the file or unit specified. These attributes are used during the processing S–3695–36 59Application Programmer’s I/O Guide of Fortran open statements or during implicit file opens. Use one of the following formats for assign_object: • f:file_name (for example, f:file1) • g:io_type; io_type can be su, sf, du, df, ff, or aq (for example, g:ff) • p:pattern (for example, p:file%) • u:unit_number (for example, u:9) • file_name (for example, myfile) When the p: pattern form is used, the % and _ wildcard characters can be used. The % matches any string of 0 or more characters. The _ matches any single character. The % performs like the * when doing file name matching in shells. However, the % character also matches strings of characters containing the / character. 6.1.3 Related Library Routines The assign(3F), asnunit(3F), asnfile(3F), and asnrm(3F) routines can be called from a Fortran program to access and update the assign environment. The assign routine provides an easy interface to assign processing from a Fortran program. The asnunit and asnfile routines assign attributes to units and files, respectively. The asnrm routine removes all entries currently in the assign environment. The calling sequences for the assign library routines are as follows: call assign (cmd [,ier]) call asnunit (iunit,astring,ier) call asnfile (fname,astring,ier) call asnrm (ier) cmd Fortran character variable that contains a complete assign command in the format that is also acceptable to the ishell(3F) routine. ier Integer variable that is assigned the exit status on return from the library interface routine. 60 S–3695–36The assign Environment [6] iunit Integer variable or constant that contains the unit number to which attributes are assigned. astring Fortran character variable that contains any attribute options and option values from the assign command. Control options -I, -O, and -R can also be passed. fname Character variable or constant that contains the file name to which attributes are assigned. A status of 0 indicates normal return and a status of greater than 0 indicates a specific error status. Use the explain command to determine the meaning of the error status. For more information about the explain command, see the explain(1) man page. The following calls are equivalent to the assign -s u f:file command: call assign(’assign -s u f:file’,ier) call asnfile(’file’,’-s u’,ier) The following call is equivalent to executing the assign -I -n 2 u:99 command: iun = 99 call asnunit(iun,’-i -n 2’,ier) The following call is equivalent to executing the assign -R command: call asnrm(ier) 6.2 assign and Fortran I/O Assign processing lets you tune file connections. The following sections describe several areas of assign command usage and provide examples of each use. 6.2.1 Alternative File Names The -a option specifies the actual file name to which a connection is made. This option allows files to be created in alternative directories without changing the FILE= specifier on an OPEN statement. For example, consider the following assign command issued to open unit 1: assign -a /tmp/mydir/tmpfile u:1 The program then opens unit 1 with any of the following statements: S–3695–36 61Application Programmer’s I/O Guide WRITE(1) variable ! implicit open OPEN(1) ! unnamed open OPEN(1,FORM=’FORMATTED’) ! unnamed open Unit 1 is connected to file /tmp/mydir/tmpfile. Without the -a attribute, unit 1 would be connected to file fort.1. To allocate a file on an SSD-resident or memory-resident file system on a UNICOS system, you can use an assign command such as the following: assign -a /ssd/myfile u:1 When the -a attribute is associated with a file, any Fortran open that is set to connect to the file causes a connection to the actual file name. An assign command of the following form causes a connection to file $TMPDIR/joe: assign -a $TMPDIR/joe ftfile This is true when any of the following statements are executed in a program: OPEN(IUN,FILE=’ftfile’) CALL AQOPEN(AQP,AQPSIZE,’ftfile’,ISTAT) CALL OPENMS(’ftfile’,INDARR,LEN,IT) CALL OPENDR(’ftfile’,INDARR,LEN,IT) CALL WOPEN(’ftfile’,BLOCKS,ISTATS) WRITE(’ftfile’) ARRAY If the following assign command is issued and is in effect, any Fortran INQUIRE statement whose FILE= specification is foo refers to the file named actual instead of the file named foo for purposes of the EXISTS=, OPENED=, or UNIT= specifiers: assign -a actual f:foo If the following assign command is issued and is in effect, the -a attribute does not affect INQUIRE statements with a UNIT= specifier: assign -a actual ftfile When the following OPEN statement is executed, INQUIRE(UNIT=n,NAME=fname) returns a value of ftfile in fname, as if no assign had occurred: OPEN(n,file=’ftfile’) 62 S–3695–36The assign Environment [6] The I/O library routines use only the actual file (-a) attributes from the assign environment when processing an INQUIRE statement. During an INQUIRE statement that contains a FILE= specifier, the I/O library searches the assign environment for a reference to the file name that the FILE= specifier supplies. If an assign-by-filename exists for the file name, the I/O library determines whether an actual name from the -a option is associated with the file name. If the assign-by-filename supplied an actual name, the I/O library uses the name to return values for the EXIST=, OPENED=, and UNIT= specifiers; otherwise, it uses the file name. The name returned for the NAME= specifier is the file name supplied in the FILE= specifier. The actual file name is not returned. 6.2.2 File Structure Selection Fortran I/O uses five different file structures: text structure, unblocked structure, bmx or tape, pure data structure, and COS blocked structure. By default, a file structure is selected for a unit based on the type of Fortran I/O selected at open time. If an alternative file structure is needed, the user can select a file structure by using the -s and -F options on the assign command. No assign_object can have both -s and -F attributes associated with it. Some file structures are available as -F attributes but are not available as -s attributes. The -F option is more flexible than the -s option; it allows nested file structures and buffer size specifications for some attribute values. The following list summarizes how to select the different file structures with different options to the assign command: Structure assign command COS blocked assign -F cos assign -s cos text assign -F text assign -s text unblocked assign -F system assign -s unblocked assign -s u tape/bmx assign -F tape assign -F bmx assign -s tape assign -s bmx F77 blocked assign -F f77 S–3695–36 63Application Programmer’s I/O Guide For more information about file structures, see Chapter 7, page 71. The following are examples of file structure selection: • To select unblocked file structure for a sequential unformatted file: IUN = 1 CALL ASNUNIT(IUN,’-s unblocked’,IER) OPEN(IUN,FORM=’UNFORMATTED’,ACCESS=’SEQUENTIAL’) • You can use the assign -s u command to specify the unblocked file structure for a sequential unformatted file. When this option is selected, the I/O is unbuffered. Each Fortran READ or WRITE statement results in a read(2) or write(2) system call such as the following: CALL ASNFILE(’fort.1’,’-s u’,IER) OPEN(1,FORM=’UNFORMATTED’,ACCESS=’SEQUENTIAL’) • Use the following command to assign unit 10 a COS blocked structure: assign -F cos u:10 6.2.3 Buffer Size Specification The size of the buffer used for a Fortran file can have a substantial effect on I/O performance. A larger buffer size usually decreases the system time needed to process sequential files. However, large buffers increase a program’s memory usage; therefore, optimizing the buffer size for each file accessed in a program on a case-by-case basis can help increase I/O performance and can minimize memory usage. The -b option on the assign command specifies a buffer size, in blocks, for the unit. The -b option can be used with the -s option, but it cannot be used with the -F option. Use the -F option to provide I/O path specifications that include buffer sizes; the -b, and -u options do not apply when -F is specified. For more information about the selection of buffer sizes, see Chapter 8, page 79, and the assign(1) man page. The following are some examples of buffer size specification using the assign -b and assign -F options: • If unit 1 is a large sequential file for which many Fortran READ or WRITE statements are issued, you can increase the buffer size to a large value, using the following assign command: assign -b 336 u:1 64 S–3695–36The assign Environment [6] • If unit 1 is to be connected to a large sequential unformatted file with COS blocked structure on UNICOS or UNICOS/mk systems, enter either of the following assign commands to specify a buffer size of 336: assign -b 336 u:1 assign -F cos:336 u:1 The buffer size for the example was calculated by multiplying tracks-per-cylinder for one type of disk by the track size in sectors of that disk. • If file foo is a small file or is accessed infrequently, minimize the buffer size using the following assign command: assign -b 1 f:foo 6.2.4 Foreign File Format Specification The Fortran I/O library can read and write files with record blocking and data formats native to operating systems from other vendors. The assign -F command specifies a foreign record blocking; the assign -C command specifies the type of character conversion; the -N option specifies the type of numeric data conversion. When -N or -C is specified, the data is converted automatically during the processing of Fortran READ and WRITE statements. For example, assume that a record in file fgnfile contains the following character and integer data: character*4 ch integer int open(iun,FILE=’fgnfile’,FORM=’UNFORMATTED’) read(iun) ch, int Use the following assign command to specify foreign record blocking and foreign data formats for character and integer data: assign -F ibm.vbs -N ibm -C ebcdic fgnfile 6.2.5 File Space Allocation File allocation can be specified with the -n, -c, and -p options to the assign command. The -n option specifies the amount of disk space to reserve at the time of a Fortran open. The -c and -p options specify the configuration of the allocated space, the -c option specifies contiguous allocation, and the -p option specifies striping (the file system partitions where file allocation will be tried) across disk devices. S–3695–36 65Application Programmer’s I/O Guide There is no guarantee that blocks will actually be allocated on the specified partitions. The partlist argument can be one integer, a range of integers (m - n), a set of integers ( m: n), or a combination of ranges and sets separated by colons. The partition numbers are submitted directly through the ialloc(2) system calls. This option achieves file striping on the specified partition. You cannot specify the -c and -p options without the -n option. The I/O library issues ialloc system calls to preallocate file space and to process the -c and -p attributes. The ialloc system call requires the -n attribute to determine the amount of file space to reserve. For example, to specify file allocation on partitions 0 through 2, partition 4, and partitions 6 through 8, contiguous allocation in each partition, and a total of 100 4096-byte blocks of file space preallocated, you would enter the following command: assign -p 0-2:4:6-8 -c -n 100 foo 6.2.6 Device Allocation The assign -F command has two specifications that alter the device where a file is resident. If you specify -F sds, a file will be SDS-resident; if you specify -F mr, a file will be memory resident. Because the sds and mr flexible file I/O layers do not define a record-based file structure, they must be nested beneath a file structure layer when record blocking is needed. Examples of device allocation follow: • If unit 1 is a sequential unformatted file that is to be SDS-resident, the following Fortran statements connect the unit: CALL ASNUNIT(1,’-F cos,sds.scr.novfl:0:100’,IER) OPEN(1,FORM=’UNFORMATTED’) The -F cos specification selects COS blocked structure. The sds.scr.novfl:0:100 specification indicates that the file should be SDS-resident, that it will not be kept when it is time to close, and that it can grow in size to one hundred 4096-byte blocks. • If unit 2 is a sequential unformatted file that is to be memory resident, the following Fortran statements connect the unit: CALL ASNUNIT (2,’-F cos,mr’,IER) OPEN(2,FORM=’UNFORMATTED’) 66 S–3695–36The assign Environment [6] The -F cos,mr specification selects COS blocked structure with memory residency. For more information about device allocation, see Chapter 9, page 87. 6.2.7 Direct-Access I/O Tuning Fortran unformatted direct-access I/O supports number tuning and memory cache page size (buffer) tuning; it also supports specification of the prevailing direction of file access. The assign -b command specifies the size of each buffer in 4096–byte blocks, and the -u option specifies the number of buffers maintained for the connected file. To open unit 1 for direct unformatted access and to specify 10 separate regions of the file that will be heavily accessed, use the following assign command: assign -u 10 u:1 6.2.8 Fortran File Truncation The assign -T option activates or suppresses truncation after the writing of a sequential Fortran file. The -T on option specifies truncation; this behavior is consistent with the Fortran standard and is the default setting for most assign -s fs specifications. Use assign -T off to suppress truncation in applications in which GETPOS(3F) and SETPOS(3F) are used to simulate random access to a file that has sequential I/O. The assign(1) man page lists the default setting of the -T option for each -s fs specification. It also indicates if suppression or truncation is allowed for each of these specifications. FFIO layers that are specified by using the -F option vary in their support for suppression of truncation with -T off. The following figure summarizes the available access methods and the default buffer sizes for UNICOS systems. S–3695–36 67Application Programmer’s I/O Guide Blocked Unblocked Access method assign option Blocked -s cos Text -s text Undef -s u Binary -s bin Unblocked -s unblocked Buffer size for default Formatted sequential I/O WRITE(9,20) PRINT Valid Default 8 Formatted direct I/O WRITE(9,20,REC=) Unformatted sequential I/O WRITE(9) Unformatted direct I/O WRITE(9,REC=) Buffer in/buffer out Control words Yes NEWLINE No Library buffering System cached Idcache BACKSPACE Record size Default library buffer size 48 8 0 16 8 Any Varies Valid Valid Default Valid Default Valid Default Valid Valid Valid Valid Valid Valid Valid Valid Valid Valid Default 48 48 No No Yes Yes No Yes Yes min(recl+1, 8) bytes max(8, recl) blocks Any Any Any Yes Yes No Yes Yes Yes No No† Yes No†† 8*n††† No No Yes Yes Valid † †† ††† Cached if not well-formed No guarantee when physical size not 512 words Everything done to bin should be word boundaries and word size a10880 * * In units of 4096 bytes, unless otherwise specified Figure 1. Access methods and default buffer sizes (UNICOS systems) 6.3 The assign Environment File The assign command information is stored in the assign environment file, which is named $TMPDIR/.assign by default. To change the location of the current assign environment file, assign the desired path name to the FILENV environment variable. The format of the assign environment file is subject to change with each release. 68 S–3695–36The assign Environment [6] 6.4 Local assign The assign environment information is usually stored in the assign environment file. Programs that do not require the use of the global assign environment file can activate local assign mode. If you select local assign mode, the assign environment will be stored in memory. Thus, other processes could not adversely affect the assign environment used by the program. The ASNCTL(3F) routine selects local assign mode when it is called by using one of the following command lines: CALL ASNCTL(’LOCAL’,1,IER) CALL ASNCTL(’NEWLOCAL’,1,IER) Example 11: Local assign mode In the following example, a Fortran program activates local assign mode and then specifies an unblocked data file structure for a unit before opening it. The -I option is passed to ASNUNIT to ensure that any assign attributes continue to have an effect at the time of file connection. C Switch to local assign environment CALL ASNCTL(’LOCAL’,1,IER) IUN = 11 C Assign the unblocked file structure CALL ASNUNIT(IUN,’-I -s unblocked’,IER) C Open unit 11 OPEN(IUN,FORM=’UNFORMATTED’) If a program contains all necessary assign statements as calls to ASSIGN, ASNUNIT, and ASNFILE, or if a program requires total shielding from any assign commands, use the second form of a call to ASNCTL, as follows: C New (empty) local assign environment CALL ASNCTL(’NEWLOCAL’,1,IER) IUN = 11 C Assign a large buffer size CALL ASNUNIT(IUN,’-b 336’,IER) C Open unit 11 OPEN(IUN,FORM=’UNFORMATTED’) S–3695–36 69File Structures [7] A file structure defines the way records are delimited and how the end-of-file is represented. Cray supports five distinct native file structures: unblocked, pure, text, cos blocked, and tape or bmx. The I/O library provides four different forms of file processing to indicate an unblocked file structure by using the assign -s ft command: unblocked (unblocked), standard binary (sbin), binary (bin), and undefined (u). These alternative forms provide different types of I/O packages used to access the records of the file, different types of file truncation and data alignment, and different endfile record recognitions in a file. The full set of options allowed with the assign -s ft command are the following: • bin (not recommended) • blocked • cos • sbin • tape or bmx • text • u • unblocked For more information about valid arguments to the assign -F command, see Section 6.2.2, page 63. Table 1 summarizes the Fortran access methods and options. S–3695–36 71Application Programmer’s I/O Guide Table 1. Fortran access methods and options Access and form assign -s ft defaults assign -s ft options Unformatted sequential BUFFER IN / BUFFER OUT blocked / cos* bin sbin u unblocked bmx/tape Unformatted direct unblocked bin sbin u unblocked Formatted sequential text blocked cos sbin/text bmx/tape Formatted direct on UNICOS systems text sbin/text Any type of sequential, formatted, unformatted, or buffer I/O to tape bmx/tape bmx/tape * UNICOS systems only 7.1 Unblocked File Structure A file with an unblocked file structure contains undelimited records. Because it does not contain any record control words, it does not have record boundaries. The unblocked file structure can be specified for a file that is opened with either unformatted sequential access or unformatted direct access. It is the default file structure for a file opened as an unformatted direct-access file. If a file with unblocked file structure must be repositioned, a BACKSPACE statement should not be used. You cannot reposition the file to a previous record when record boundaries do not exist. BUFFER IN and BUFFER OUT statements can specify a file that is an unbuffered and unblocked file structure. If the file is specified with assign -s u, BUFFER IN and BUFFER OUT statements can perform asynchronous unformatted I/O. 72 S–3695–36File Structures [7] You can specify the unblocked data file structure by using the assign(1) command in several ways. All methods result in a similar file structure but with different library buffering styles, use of truncation on a file, alignment of data, and recognition of an endfile record in the file. The following unblocked data file structure specifications are available: Specification Structure assign -s unblocked Library-buffered assign -F system No library buffering assign -s u No library buffering assign -s sbin Standard-I/O-compatible buffering; for example, both library and system buffering The type of file processing for an unblocked data file structure depends on the assign -s ft option declared or assumed for a Fortran file. For more information on buffering, see Chapter 8, page 79. 7.1.1 assign -s unblocked File Processing An I/O request for a file specified using the assign -s unblocked command does not need to be a multiple of a specific number of bytes. Such a file is truncated after the last record is written to the file. Padding occurs for files specified with the assign -s bin command and the assign -s unblocked command. Padding usually occurs when noncharacter variables follow character variables in an unformatted direct-access file. No padding is done in an unformatted sequential access file. An unformatted direct-access file created by a Fortran program on a UNICOS or UNICOS/mk system contains records that are the same length. The endfile record is recognized in sequential-access files. 7.1.2 assign -s sbin File Processing (Not Recommended) You can use an assign -s sbin specification for a Fortran file that is opened with either unformatted direct access or unformatted sequential access. The file does not contain record delimiters. The file created for assign -s sbin in this instance has an unblocked data file structure and uses unblocked file processing. The assign -s sbin option can be specified for a Fortran file that is declared as formatted sequential access. Because the file contains records that are S–3695–36 73Application Programmer’s I/O Guide delimited with the new-line character, it is not an unblocked data file structure. It is the same as a text file structure. The assign -s sbin option is compatible with the standard C I/O functions. See Chapter 5, page 49, for more details. Note: Use of assign -s sbin is discouraged. Use assign -s text for formatted files, and assign -s unblocked for unformatted files. 7.1.3 assign -s bin File Processing (Not Recommended) An I/O request for a file that is specified with assign -s bin does not need to be a multiple of a specific number of bytes. On UNICOS and UNICOS/mk systems, padding occurs when noncharacter variables follow character variables in an unformatted record. The I/O library uses an internal buffer for the records. If opened for sequential access, a file is not truncated after each record is written to the file. 7.1.4 assign -s u File Processing The assign -s u command specifies undefined or unknown file processing. An assign -s u specification can be specified for a Fortran file that is declared as unformatted sequential or direct access. Because the file does not contain record delimiters, it has an unblocked data file structure. Both synchronous and asynchronous BUFFER IN and BUFFER OUT processing can be used with u file processing. For best performance, a Fortran I/O request on a file assigned with the assign -s u command should be a multiple of a sector. I/O requests are not library buffered. They cause an immediate system call. Fortran sequential files declared by using assign -s u are not truncated after the last word written. The user must execute an explicit ENDFILE statement on the file to get truncation. 7.2 Text File Structure The text file structure consists of a stream of 8-bit ASCII characters. Every record in a text file is terminated by a newline character (\n, ASCII 012). Some utilities may omit the newline character on the last record, but the Fortran library will treat such an occurrence as a malformed record. This file structure can be specified for a file that is declared as formatted sequential access or formatted 74 S–3695–36File Structures [7] direct access. It is the default file structure for formatted sequential access files. On UNICOS and UNICOS/mk systems, it is also the default file structure for formatted direct access files. The assign -s text command specifies the library-buffered text file structure. Both library and system buffering are done for all text file structures (for more information about library buffering, see Chapter 8, page 79). An I/O request for a file using assign -s text does not need to be a multiple of a specific number of bytes. You cannot use BUFFER IN and BUFFER OUT statements with this structure. Use a BACKSPACE statement to reposition a file with this structure. 7.3 COS or Blocked File Structure The cos or blocked file structure uses control words to mark the beginning of each sector and to delimit each record. You can specify this file structure for a file that is declared as unformatted sequential access. Synchronous BUFFER IN and BUFFER OUT statements can create and access files with this file structure. This file structure is the default structure for files declared as unformatted sequential access on UNICOS and UNICOS/mk systems. You can specify this file structure with one of the following assign(1) commands: assign -s cos assign -s blocked assign -F cos assign -F blocked These four assign commands result in the same file structure. An I/O request on a blocked file is library buffered. For more information about library buffering, see Chapter 8, page 79. In a COS file structure, one or more ENDFILE records are allowed. BACKSPACE statements can be used to reposition a file with this structure. A blocked file is a stream of words that contains control words called Block Control Word (BCW) and Record Control Words (RCW) to delimit records. Each record is terminated by an EOR (end-of-record) RCW. At the beginning of the stream, and every 512 words thereafter, (including any RCWs), a BCW is inserted. An end-of-file (EOF) control word marks a special record that is always empty. Fortran considers this empty record to be an endfile record. The end-of-data S–3695–36 75Application Programmer’s I/O Guide (EOD) control word is always the last control word in any blocked file. The EOD is always immediately preceded by an EOR, or an EOF and a BCW. Each control word contains a count of the number of data words to be found between it and the next control word. In the case of the EOD, this count is 0. Because there is a BCW every 512 words, these counts never point forward more than 511 words. A record always begins at a word boundary. If a record ends in the middle of a word, the rest of that word is zero filled; the ubc field of the closing RCW contains the number of unused bits in the last word. The following is a representation of the structure of a BCW: m unused bdf unused bn fwi (4) (7) (1) (19) (24) (9) Field Bits Description m 0-3 Type of control word; 0 for BCW bdf 11 Bad Data flag (1-bit). bn 31-54 Block number (modulo 2 24 ). fwi 55-63 Forward index; the number of words to next control word. The following is a representation of the structure of an RCW: m ubc tran bdf srs unused pfi pri fwi (4) (6) (1) (1) (1) (7) (20) (15) (9) Field Bits Description m 0-3 Type of control word; 10 8 for EOR, 16 8 for EOF, and 17 8 for EOD. ubc 4-9 Unused bit count; number of unused low-order bits in last word of previous record. tran 10 Transparent record field (unused). bdf 11 Bad data flag (unused). 76 S–3695–36File Structures [7] Field Bits Description srs 12 Skip remainder of sector (unused). pfi 20-39 Previous file index; offset modulo 2 20 to the block where the current file starts (as defined by the last EOF). pri 40-54 Previous record index; offset modulo 2 15 to the block where the current record starts. fwi 55-63 Forward index; the number of words to next control word. 7.4 Tape and Bmx File Structure The tape or bmx file structure is used for online tape access through the UNICOS tape subsystem. You can use any type of sequential, formatted, unformatted, or buffer I/O to read or write an online tape if this file structure was specified. Each read or write request results in the processing of one tape block. This file structure is the default option for doing any type of Fortran I/O to an online tape file. The file structure can be specified with one of the following commands: assign -s bmx assign -s tape assign -F bmx assign -F tape These assign(1) commands result in the same file structure. Each read or write request results in the processing of one tape block. This structure can be used only with online IBM-compatible tape files or with ER90 volumes mounted in blocked mode. See the Tape Subsystem User’s Guide for more information on library interfaces to ER90 volumes. 7.4.1 Library Buffers When using Fortran I/O or FFIO for online tapes and the tape or bmx file structure, all of the user’s data passes through a library buffer. The size and number of buffers can affect performance. Each of the library’s buffers must be a multiple of the maximum block size (MBS) on the tape, as specified by the tpmnt -b command. S–3695–36 77Application Programmer’s I/O Guide On IOS model D systems, one tape buffer is allocated by default. The buffer size is either MBS or (MBS × n), whichever is larger (n is the largest integer such that MBS × n = 65536). On IOS model E systems, the default is to allocate 2 buffers of 4 × MBS each, with a minimum of 65,536 bytes, provided that the total buffer size does not exceed a threshold defined within the library. If the MBS is too large to accommodate this formula, the size of the buffers is adjusted downward, and the number is adjusted downward to remain under the threshold. In all cases, at least one buffer of at least the MBS in bytes is allocated. During a write request, the library copies the user’s data to its buffer. Each of the user’s records must be placed on a 4096-byte boundary within the library buffer. After a user’s record is copied to the library buffer, the library checks the remaining free buffer space. If it is less than the maximum block size specified with the tpmnt -b command, the library issues an asynchronous write (writea(2)) system call. If the user requests that a tape mark be written, this also causes the library to issue a writea system call. When using Fortran I/O or FFIO to read online tapes, the system determines how much data can be placed in the user’s buffers. Reading a user’s tape mark stops all outstanding asynchronous I/O to that file. 78 S–3695–36Buffering [8] This chapter provides an overview of buffering and a description of file buffering as it applies to I/O. 8.1 Buffering Overview I/O is the process of transferring data between a program and an external device. The process of optimizing I/O consists primarily of making the best possible use of the slowest part of the path between the program and the device. The slowest part is usually the physical channel, which is often slower than the CPU or a memory-to-memory data transfer. The time spent in I/O processing overhead can reduce the amount of time that a channel can be used, thereby reducing the effective transfer rate. The biggest factor in maximizing this channel speed is often the reduction of I/O processing overhead. A buffer is a temporary storage location for data while the data is being transferred. A buffer is often used for the following purposes: • Small I/O requests can be collected into a buffer, and the overhead of making many relatively expensive system calls can be greatly reduced. A collection buffer of this type can be sized and handled so that the actual physical I/O requests made to the operating system match the physical characteristics of the device being used. For example, a 42-sector buffer, when read or written, transfers a track of data between the buffer and the DD-49 disk; a track is a very efficient transfer size. • Many data file structures, such as cos, contain control words. During the write process, a buffer can be used as a work area where control words can be inserted into the data stream (a process called blocking). The blocked data is then written to the device. During the read process, the same buffer work area can be used to examine and remove these control words before passing the data on to the user (deblocking). • When data access is random, the same data may be requested many times. A cache is a buffer that keeps old requests in the buffer in case these requests are needed again. A cache that is sufficiently large and/or efficient can avoid a large part of the physical I/O by having the data ready in a buffer. When the data is often found in the cache buffer, it is referred to as having a high hit rate. For example, if the entire file fits in the cache and the file is present in S–3695–36 79Application Programmer’s I/O Guide the cache, no more physical requests are required to perform the I/O. In this case, the hit rate is 100%. • Running the disks and the CPU in parallel often improves performance; therefore, it is useful to keep the CPU busy while data is being moved. To do this when writing, data can be transferred to the buffer at memory-to-memory copy speed and an asynchronous I/O request can be made. The control is then immediately returned to the program, which continues to execute as if the I/O were complete (a process called write-behind). A similar process can be used while reading; in this process, data is read into a buffer before the actual request is issued for it. When it is needed, it is already in the buffer and can be transferred to the user at very high speed. This is another form or use of a cache. Buffers are used extensively. Some of the disk controllers have built-in buffers. The kernel has a cache of buffers called the system cache that it uses for various I/O functions on a system-wide basis. The Cray IOS uses buffers to enhance I/O performance. The UNICOS logical device cache (ldcache) is a buffering scheme that uses a part of the solid-state storage device (SSD) or buffer memory resident (BMR) in the IOS as a large buffer that is associated with a particular file system. The library routines also use buffers. The I/O path is divided into two parts. One part includes the user data area, the library buffer, and the system cache. The second part is referred to as the logical device, which includes the ultimate I/O device and all of the buffering, caching, and processing associated with that device. This includes any caching in the disk controller and the operating system. Users can directly or indirectly control some buffers. These include most library buffers and, to some extent, system cache and ldcache. Some buffering, such as that performed in the IOS, or the disk controllers, is not under user control. A well-formed request refers to I/O requests that meet the criteria for UNICOS systems; a well-formed request for a disk file requires the following: • The size of the request must be a multiple of the sector size in bytes. For most disk devices, this will be 4096 bytes. • The data that will be transferred must be located on a word boundary. • The file must be positioned on a sector boundary. This will be a 4096-byte sector boundary for most disks. 80 S–3695–36Buffering [8] 8.2 Types of Buffering The following sections briefly describe unbuffered I/O, library buffering, system cache buffering, and ldcache. 8.2.1 Unbuffered I/O The simplest form of buffering is none at all; this unbuffered I/O is known as raw I/O. For sufficiently large, well-formed requests, buffering is not necessary; it can add unnecessary overhead and delay. The following assign(1) command specifies unbuffered I/O: assign -s u ... Use the assign command to bypass library buffering and the UNICOS system cache for all well-formed requests. The data is transferred directly between the user data area and the logical device. Requests that are not well formed use system cache. 8.2.2 Library Buffering The term library buffering refers to a buffer that the I/O library associates with a file. When a file is opened, the I/O library checks the access, form, and any attributes declared on the assign command to determine the type of processing that should be used on the file. Buffers are usually an integral part of the processing. If the file is assigned with one of the following options, library buffering is used: -s blocked -s tape -s bmx -F spec (buffering as defined by spec) -s cos -s bin -s unblocked The -F option specifies flexible file I/O (FFIO), which uses library buffering if the specifications selected include a need for some buffering. In some cases, more than one set of buffers might be used in processing a file. For example, the -F blankx,cos option specifies two library buffers for a read of a blank S–3695–36 81Application Programmer’s I/O Guide compressed COS blocked file. One buffer handles the blocking and deblocking associated with the COS blocked control words and the second buffer is used as a work area to process the blank compression. In other cases (for example, -F system), no library buffering occurs. 8.2.3 System Cache The operating system uses a set of buffers in kernel memory for I/O operations. These are collectively called the system cache. The I/O library uses system calls to move data between the user memory space and the system buffer. The system cache ensures that the actual I/O to the logical device is well formed, and it tries to remember recent data in order to reduce physical I/O requests. In many cases, though, it is desirable to bypass the system cache and to perform I/O directly between the user’s memory and the logical device. If requests are well-formed and the O_RAW flag is set by the libraries when the file is opened, system cache is bypassed, and I/O is done directly between the user’s memory space and the logical device. On UNICOS systems, if the requests are not well formed, the system cache is used even if the O_RAW flag was selected at open time. If UNICOS ldcache is present, and the request is well formed, I/O is done directly between the user’s memory and ldcache even if the O_RAW bit was not selected. The following assign(1) command options do not set the O_RAW bit, and they can be expected to use system cache: -s sbin -F spec (FFIO, depends on spec) The following assign command options set the O_RAW flag and bypass the system cache on UNICOS and UNICOS/mk systems: -r on -s unblocked -s cos (or -s blocked) -s bin -s u -F spec (FFIO, depends on spec) 82 S–3695–36Buffering [8] See the Tape Subsystem User’s Guide for details about the use of system caching and tapes. For the assign -s cos , assign -s bin, and assign -s bmx commands, a library buffer ensures that the actual system calls are well formed. This is not true for the assign -s u option. If you plan to bypass the system cache, all requests go through the cache except those that are well-formed. The assign -l buflev option controls kernel buffering. It is used by Fortran I/O, auxiliary I/O, and FFIO. The buflev argument can be any of the following values: • none: sets O_RAW and O_LDRAW • ldcache: sets O_RAW, clears O_LDRAW • full: clears O_RAW and O_LDRAW If this option is not set, the level of system buffering is dependent on the type of open operation being performed. 8.2.3.1 Restrictions on Raw I/O The conditions under which UNICOS/mk can perform raw I/O are different from the conditions under the UNICOS operating system. In order for raw I/O to be possible under UNICOS/mk, the starting memory address of the transfer must be aligned on a cache-line boundary. This means that it must be aligned on a 0 modulus 64 byte address for Cray T3E systems. A C program can cause static or stack data to be aligned correctly by using the following compiler directive: _Pragma(_CRI cache_align buff); buff is the name of the data to be aligned. The malloc library memory allocation functions always return aligned pointers. In most cases where raw I/O cannot be performed due to incorrect alignment, the system will perform buffered I/O instead. The O_WELLFORMED open flag causes the ENOTWELLFORMED error to be returned. 8.2.4 Logical Cache Buffering On UNICOS systems, the following elements are part of the logical device: ldcache, IOS models B, C, and D, IOS buffer memory, and cache in the disk S–3695–36 83Application Programmer’s I/O Guide controllers. These buffers are connected to the file system on which the file resides. 8.2.5 Default Buffer Sizes The Fortran I/O library automatically selects default buffer sizes. You can override the defaults by using the assign(1) command. The following subsections describe the default buffer sizes on various systems. Note: One block is 4,096 bytes on UNICOS and UNICOS/mk systems. 8.2.5.1 UNICOS and UNICOS/mk Default Buffer Sizes The default buffer sizes are as follows: Access Type Buffer Size Sequential access, formatted The default buffer size is eight blocks. Sequential access, unformatted The default buffer size is the larger of the following: • The large I/O size. • The preferred I/O block size. For more information on this, see the stat(2) man page. • 48 blocks. If this results in a buffer larger than 64 blocks, then two buffers are allocated and the I/O is performed asynchronously. For more information, see the description of the cos layer in the INTRO_FFIO(3F) man page. Direct access, formatted The default buffer size is the smaller of the following: • The record length in bytes + 1 • Eight blocks Direct access, unformatted The default buffer size is the larger of the following: • The record length 84 S–3695–36Buffering [8] • Eight blocks The maximum default buffer size is 100 blocks. Four buffers of this size are allocated. For more information, see the description of the cachea layer in the INTRO_FFIO(3F) man page for more details. S–3695–36 85Devices [9] This chapter describes the type of storage devices available on UNICOS and UNICOS/mk systems including tapes, solid-state storage device (SSD), disks, and main memory. The type of I/O device used affects the I/O transfer rate. 9.1 Tape The UNICOS tape subsystem runs on all UNICOS systems and is designed for system users who have large-scale data handling needs. Users can read or write to a tape with formatted or unformatted sequential Fortran I/O statements, buffer I/O, and the READDC(3F), READP(3F), WRITEC(3F), and WRITEP(3F) I/O routines. A Fortran program interfaces with the tape subsystem through the Fortran I/O statements and the I/O library. The Tape Subsystem User’s Guide, describes the tape subsystem in detail. 9.1.1 Tape I/O Interfaces There are two different types of tape I/O interfaces: the traditional read[a] and write[a] system calls and tapelist I/O, which is unique to magnetic tape processing on UNICOS and UNICOS/mk systems. Tapelist I/O allows the user to make several I/O requests in one system exchange. It also allows processing of user tape marks, bad tape data, and end-of-volume (EOV) processing. The system libraries provide the following four common ways to perform tape I/O: • Through the use of the system calls. • Through the stdio library, which is commonly used from C. This method provides no means to detect or regulate the positioning of tape block breaks on the tape. • Through Fortran I/O (not fully supported on UNICOS/mk systems). This provides bad data handling, foreign data conversion, EOV processing, and high-performance asynchronous buffering. Only a subset of these functions are currently supported through Fortran I/O for the ER90 tape device. S–3695–36 87Application Programmer’s I/O Guide • Through the Flexible File I/O (FFIO) system (not available on UNICOS/mk systems). FFIO is used by Fortran I/O and is also available to C users. It provides bad data handling, foreign data conversion, EOV processing, and asynchronous buffering. FFIO uses tapelist I/O. For more information about FFIO see the INTRO_FFIO(3F) man page. Only a subset of these functions are currently supported through Fortran I/O for the ER90 tape device. 9.1.2 Tape Subsystem Capabilities The tape subsystem provides the following capabilities: • Label processing • Reading and writing of tape marks • Tape positioning • Automatic volume recognition (AVR) • Multivolume tape files • Multifile volume allocation • Foreign dataset conversion on UNICOS and UNICOS/mk systems • User end-of-volume (EOV) processing • Concatenated tape files The tape subsystem supports the following user commands on UNICOS and UNICOS/mk systems: Command Description rls(1) Releases reserved tape resources rsv(1) Reserves tape resources tpmnt(1) Requests a tape mount for a tape file tprst(1) Displays reserved tape status for the current session ID tpstat(1) Displays current tape status See the Tape Subsystem User’s Guide, for more details about the tape subsystem. 88 S–3695–36Devices [9] 9.2 SSD The SSD is a high-performance device that is used for temporary storage. It is configured as a linear array of 4096-byte blocks. The total number of available blocks depends on the physical size of the SSD. The data is transferred between the mainframe’s central memory and the SSD through special channels. The actual speed of these transfers depends on the SSD and the system configuration. The SSD Solid-state Storage Device Hardware Reference Manual, publication HR-0031, describes the SSD. The SSD has a very fast transfer rate and a large storage capacity. It is ideal for large scratch files, out-of-core solutions, cache space for I/O transfers such as ldcache, and other high-volume, temporary uses. You can configure the SSD for the following three different types of storage: • SSD file systems • Secondary data segments (SDS) • ldcache All three implementations can be used within the same SSD. The system administrator allocates a fixed amount of space to each implementation, based on system requirements. The following sections describe these implementations. 9.2.1 SSD File Systems In the UNICOS operating system, file storage space is divided into file systems. A file system is a logical device made up of slices from various physical devices. A slice is a set of consecutive cylinders or blocks. Each file system is mounted on a directory name so that users can access the file system through the directory name. Thus, if a file system is composed of SSD slices, any file or its descendants that are written into the associated directory will reside on SSD. To use an SSD file system from a Fortran program, users must ensure that the path name of the file contains the appropriate directory. For example, if an SSD resident file system is mounted on the /tmp directory, use the assign(1) command to assign a file to that directory and the file will reside on the SSD. Example: assign -a /tmp/ssdfile u:10 S–3695–36 89Application Programmer’s I/O Guide Users can also use the OPEN statement in the program to open a file in the directory. SSD file systems are useful for holding frequently referenced files such as system binary files and object libraries. Some sites use an SSD file system for system swapping space such as /drop or /swapdev. Finally, SSD file systems can be used as a fast temporary scratch space. 9.2.2 Secondary Data Segments (SDS) The secondary data segment (SDS) feature allows the I/O routines to treat part of the SSD like an extended or secondary memory. SDS allows I/O requests to move directly between memory and SSD; this provides sustained transfer rates that are faster than that of SSD file systems. Users must explicitly request SDS space for a process but the space is released automatically when the program ends. Users can request that several files reside in SDS space but the total amount of SDS space requested for the files must be within the SDS allocation limit for the user. To request SDS space for unit 11 from a Fortran program, use either of the following assign commands: assign -F cos,sds u:11 or assign -F cachea.sds u:11 The ssread(2) and sswrite(2) system calls can be called from a Fortran program to move data between a buffer and SDS directly. ssread, sswrite, and ssbreak should not be used in a Fortran program that accesses SDS through the assign command because the libraries use SDSALLOC(3F) to control SDS allocation. Using SSBREAK directly from Fortran conflicts with the SDS management provided by SDSALLOC. The UNICOS System Calls Reference Manual, describes ssbreak, ssread, and sswrite. On UNICOS/mk systems, the library does not handle allocation of SDS space from more than one processing element (PE). For files opened from different PEs, do not use SDSALLOC, assign -F sds, or the sds option of assign -F cache or assign -F cachea. A Fortran programmer can use the CDIR$ AUXILIARY compiler directive to assign SDS space to the arrays specified on the directive line. The name of an auxiliary array or variable must not appear in an I/O statement. See the Fortran 90 S–3695–36Devices [9] Language Reference manuals for your compiler system for a description of this feature. The UNICOS File Formats and Special Files Reference Manual, describes SDS. 9.2.3 Logical Device Cache (ldcache) The system administrator can allocate a part of the SDS space as ldcache. ldcache is a buffer space for the most heavily-used disk file systems. It is assigned one file system at a time. Allocation of the units within each assigned space is done on a least recently used basis. When a given file system’s portion of the ldcache is full, the least recently accessed units are flushed to disk. You do not need to change a Fortran program to make use of ldcache. The program or operating system issues physical I/O requests to disk. 9.3 Disk Drives Several permanent mass storage devices or disks are available with UNICOS and UNICOS/mk systems. A disk system for UNICOS and UNICOS/mk systems consists of I/O processors, disk controller units, and disk storage units. A sector is the smallest unit of allocation for a file in the file system. It is also the smallest unit of allocation; all I/O is performed in sectors. In each disk storage unit, the recording surface available to a read/write head group is called a disk track. Each track contains a number of sectors in which data can be recorded and read back. The data in one sector is called a data block; the size of the data block varies with the disk type. The number of sectors per track, the number of tracks per cylinder, and the number of cylinders per drive also vary according to the type of disk storage unit. For example, a DD-49 disk storage unit contains 886 cylinders with 8 tracks per cylinder and 42 sectors per track. See the dsk(4), disksipn(7), disksfcn(7), and disksmpn(7) man pages for complete details. The following table lists sector size, track size, and tracks per cylinder for a variety of disks: S–3695–36 91Application Programmer’s I/O Guide Table 2. Disk information Disk type Sector size (in words) Track size (in sectors) Tracks per cylinder DD-49 512 42 8 DD-40 512 48 19 DD-41 512 48 15 DD-42 512 48 19 DD-40r 512 48 19 DD-60 2048 23 2 DA-60 8192 23 2 DD-61 512 11 19 DD-62 512 28 9 DA-62 2048 26 9 DD-301 512 25 7 DA-301 2048 25 7 DD-302 4096 28 7 DA-302 16384 28 7 This information is useful when you must determine an efficient buffer size. Disk-based storage under the UNICOS operating system is divided into logical devices. A logical disk device is a collection of blocks on one or more physical disks or other logical disk devices. These blocks are collected into partitions to be used as file system entities. A block is a sector. An optional striping capability exists for all disk drives. Striping allows a group of physical devices to be treated as one large device with a potential I/O rate of a single device multiplied by the number of devices in the striped group. Striped devices must consist of physical devices that are all of the same type. I/O requests using striping should be in multiples of n × ts bytes; n is the number of devices in the group and ts is the track size of the disk in bytes (not in words or sectors). For most disks this figure will be n × 4096 bytes. For DD-60 disks, n must be rounded to the nearest multiple of 4 because its sector size is 16 Kbytes. 92 S–3695–36Devices [9] Disk striping on some systems can enhance effective transfer rates to and from disks. 9.4 Main Memory The assign(1) command provides an option to declare certain files to be memory resident. This option causes these files to reside within the field length of the user’s process; its use can result in very fast access times. To be most effective, this option should be used only with files that will fit within the user’s field length limit. A program with a fixed-length heap and memory resident files may deplete memory during execution. Sufficient space for memory resident files may exist but may not exist for other run-time library allocations. See Chapter 6, page 55, for details about using the assign command. S–3695–36 93Introduction to FFIO [10] This chapter provides an overview of the capabilities of the flexible file I/O (FFIO) system, sometimes called layered I/O. The FFIO system is used to perform many I/O-related tasks. For details about each individual I/O layer, see Chapter 14, page 179. 10.1 Layered I/O The FFIO system is based on the concept that for all I/O a list of processing steps must be performed to transfer the user data between the user’s memory and the desired I/O device. Computer manufacturers have always provided I/O options to users because I/O is often the slowest part of a computational process. In addition, it is extremely difficult to provide one I/O access method that works optimally in all situations. The following figure depicts the typical flow of data from the user’s variables to and from the I/O device. Kernel job User’s System call a10844 Figure 2. Typical data flow It is useful to think of each of these boxes as a stopover for the data, and each transition between stopovers as a processing step. Each transition has benefits and costs. Different applications might use the I/O process in different ways. For example, if I/O requests are large, the library buffer is unnecessary, because the buffer is used primarily to avoid making system calls for every small request. You can achieve better I/O throughput with large I/O requests by not using library buffering. S–3695–36 95Application Programmer’s I/O Guide If you don’t include library buffering, I/O requests should be on sector boundaries; otherwise, I/O performance will be degraded. On the other hand, if all I/O requests are very small, the library buffer is essential to avoid making a costly system call for each I/O request. It is useful to be able to modify the I/O process to prevent intermediate steps (such as buffering) for existing programs without requiring that the source code be changed. The assign(1) command lets you modify the total user I/O path by establishing an I/O environment. The FFIO system lets you specify each stopover in Figure 2, page 95. You can specify a comma-separated list of one or more processing steps by using the assign -F command: assign -F spec1,spec2,spec3... Each spec in the list is a processing step that requests one I/O layer, or logical grouping of layers. The layer specifies the operations that are performed on the data as it is passed between the user and the I/O device. A layer refers to the specific type of processing being done. In some cases, the name corresponds directly to the name of a layer. In other cases, specifying one layer invokes the routines used to pass the data through multiple layers. See the INTRO_FFIO(3F) man page for details about using the -F option to the assign command. On a WRITE operation, processing steps for the -F option are ordered with the user specification listed first and the system or device last as in the following example: assign -F user,blankx,system With this specification, a WRITE operation first performs the user operation on the data, then performs the blankx operation, and then sends the data to the system. In a READ operation, the order is reversed; the process is performed from the end of the option arguments to the beginnning, or from right to left. The data moves from the system to the user. The layers closest to the user are higher-level layers; those closer to the system are lower-level layers. The FFIO system has an internal model of the world of data, which it maps to any logical file type. Four of these concepts are basic to understanding the inner workings of the layers. Concept Definition Data Data is a stream of bits. Record marks End-of-record marks (EOR) are boundaries between logical records. 96 S–3695–36Introduction to FFIO [10] File marks End-of-file marks (EOF) are special types of record marks that separate files in some file formats. End-of-data (EOD) An end-of-data (EOD) is a point immediately beyond the last data bit, EOR, or EOF in the file. All files are streams of 0 or more bits that may contain record or file marks. Individual layers have varying rules about which of these things can appear in a file and in which order they can appear. Fortran programmers and C programmers can use the capabilities described in this document. Fortran users can use the assign(1) command to specify these FFIO options. For C users, the FFIO layers are available only to programs that call the FFIO routines directly (ffopen(3C), ffread(3C), and ffwrite(3C)). You can use FFIO with the following Fortran I/O forms: • Buffer I/O • Unformatted sequential • Unformatted direct access • Word addressable • Mass Storage (MS) and Direct Random (DR) packages • Formatted sequential • Namelist • List-directed The MS package and the DR package include the OPENMS, WRITMS, READMS, FINDMS, CHECKMS, WAITMS, ASYNCMS, SYNCMS, STINDX, CLOSMS, OPENDR, WRITDR, READDR, and CLOSDR library routines. 10.2 Using Layered I/O The specification list on the assign -F command comprises all of the processing steps that the I/O system performs. If assign -F is specified, any default processing is overridden. For example, unformatted sequential I/O is assigned a default structure of cos. The -F cos option provides the same structure. The FFIO system provides detailed control over I/O processing requests. However, to effectively use the cos option (or any FFIO option), you must understand the I/O processing details. S–3695–36 97Application Programmer’s I/O Guide As a very simple example, suppose you were making large I/O requests and did not require buffering or blocking on your data. You could specify the following: assign -F system The system layer is a generic system interface that chooses an appropriate layer for your file. If the file is on disk, it chooses the syscall layer, which maps each user I/O request directly to the corresponding system call. A Fortran READ statement is mapped to one or more read(2) system calls and a Fortran WRITE statement to one or more write(2) system calls. This results in almost the same processing as would be done if the assign -s u command was used. If you want your file to be cos blocked (the default blocking for Fortran unformatted I/O), you can specify the following: assign -F cos,system If you want your file to be F77 blocked , you can specify the following: assign -F f77,system These two specs request that each WRITE request first be blocked (blocking adds control words to the data in the file to delimit records). The cos layer then sends the blocked data to the system layer. The system layer passes the data to the device. The process is reversed for READ requests. The system layer retrieves blocked data from the file. The blocked data is passed to the next higher layer, the cos layer, where it is deblocked. The deblocked data is then presented to the user. A cos blocked, blank-compressed file can also be read. The following are the processing steps necessary to do this: 1. Issue system calls to read data from the device. 2. Deblock the data and deliver blank-compressed characters. 3. Decompress the characters and deliver them to the user. In this case, the spec with system is on the right end and would be as follows: assign -F blankx,cos,system You do not need to specify the system spec; it is always implied on the right end. To read the cos blocked, blank-compressed file, use the following specification: assign -F blankx,cos 98 S–3695–36Introduction to FFIO [10] Because the system spec is assumed, it is never required. 10.2.1 I/O Layers Several different layers are available for the spec argument. Each layer invokes one or more layers, which then handles the data it is given in an appropriate manner. For example, the syscall layer essentially passes each request to an appropriate system call. The tape layer uses an array of more sophisticated system calls to handle magnetic tape I/O. The blankx layer passes all data requests to the next lower layer, but it transforms the data before it is passed. The mr layer tries to hold an entire file in a buffer that can change size as the size of the file changes; it also limits actual I/O to lower layers so that I/O occurs only at open, close, and overflow. The following tables list the classes you can specify for the spec argument to the assign -F option: Table 3. I/O Layers available on all hardware platforms Layer Function blankx or blx Both specify the blank compression or expansion layer. bufa Asynchronous buffering layer c205and eta Both specify the CDC CYBER 205 and ETA record formats. cache Memory cached I/O cachea Asynchronous memory cached I/O cdc CDC 60-bit NOS/SCOPE file formats cos or blocked COS blocking event Monitors I/O layers er90 ER90 handlers f77 Record blocking common to most UNIX Fortran implementations fd File descriptor open global Distributed cache layer ibm IBM file formats mr Memory-resident file handlers nosve CDC NOS/VE file formats S–3695–36 99Application Programmer’s I/O Guide sds SDS-resident file handlers tape or bmx Online tape handling vms VAX/VMS file formats null Syntactic convenience for users (does nothing) site Site-specific layer syscall System call I/O system Generic system interface text Newline separated record formats user User-written layer 10.2.2 Layered I/O Options You can modify the behavior of each I/O layer. The following spec format shows how you can specify a class and one or more opt and num fields: class.opt1.opt2:num1:num2:num3 For class, you can specify one of the layers listed in the previous tables. Each of the layers has a different set of options and numeric parameter fields that can be specified. This is necessary because each layer performs different duties. The following rules apply to the spec argument: • The class and opt fields are case-insensitive. For example, the following two specs are identical: Ibm.VBs:100:200 IBM.vbS:100:200 • The opt and num fields are usually optional, but sufficient separators must be specified as placeholders to eliminate ambiguity. For example, the following spec s are identical: cos..::40, cos.::40 cos::40 In this example, opt1, opt2, num1, and num2 assume default values. Similarly, the sds layer also allows optional opt and num fields, and it sets opt1, opt2, num1, num2, and num3 to default values as required. 100 S–3695–36Introduction to FFIO [10] • To specify more than one spec, use commas between specs. Within each spec, you can specify more than one opt and num. Use periods between opt fields, and use colons between num fields. The following options all have the same effect. They all specify the sds layer and set the initial SDS allocation to 100 512-word sectors: -F sds:100 -F sds.:100 -F sds..:100 The following option contains one spec for an sds layer that has an opt field of scr, which requests scratch file behavior: -F sds.scr The following option requests two class es with no opt s: -F cos,sds The following option contains two spec s and requests two layers: cos and sds. The cos layer has no options; the sds layer has options scr and ovfl, which specify that the file is a scratch file that is allowed to overflow, and that the maximum SDS allocation is 1000 sectors: -F cos,sds.scr.ovfl::1000 When possible, the default settings of the layers are set so that optional fields are seldom needed. 10.3 Setting FFIO Library Parameters The UNICOS operating system supports a number of library parameters that can be tuned. Sites can use these parameters to change both the performance of the libraries and some of their limits. Through a similar technique, users can also change these parameters when linking an application. When SEGLDR is invoked, one of its first actions is to read the /lib/segdirs file, which defines the parameters of SEGLDR; this file contains an LINCLUDE directive for the file /usr/lib/segdirs/def_lib, which by default is empty. An administrator can place directives in this file to modify the SEGLDR behavior. The following HARDREF directives select optional capabilities of the FFIO package to include in the standard libraries compiled into user programs by default. S–3695–36 101Application Programmer’s I/O Guide Table 4. HARDREF Directives HARDREF = FFIO option _f_ffvect F-type records, fixed length _v_ffvect V-type records, variable length _x_ffvect X-type records _cos_ffvect COS-type records, COS blocking _tape_ffvect Magnetic tape handlers _cdc_ffvect CDC 60-bit record handlers _sds_ffvect SDS-resident file handlers _mr_ffvect Memory-resident file handlers _trc_ffvect Trace layer _txt_ffvect Text-type records, newline separated records _fd_ffvect Specified file descriptor _blx_ffvect Blank compression handlers _cch_ffvect Cache layer Each of these directives refers to a list of function pointers. Each function-pointer list represents the set of routines necessary to process one or more options on the assign(1) command. Some of these layers are tied to specific hardware, such as tape or SDS. Others are foreign conversion options such as ETA System V-format data. Not all of these layers are loaded into user programs by default. As delivered, the UNICOS operating system can read and write data in many different ways, however, only a subset of these capabilities is loaded into user programs by default, so that user executables are smaller. If UNICOS source code is available, it is better to change the switches in fdcconfig.h, rather than to use these HARDREF directives, primarily because assign still issues warnings to users who use layers disabled in fdcconfig.h. Also, changing fdcconfig.h is the only way to disable layers that are shipped enabled by default. 102 S–3695–36Using FFIO [11] This chapter describes how you can use flexible file I/O (FFIO) with common file structures and how to enhance code performance without changing your source code. 11.1 FFIO and Common Formats This section describes the use of FFIO with common file structures and describes the correlation between the common and/or default file structures and the FFIO usage that handles them. 11.1.1 Reading and Writing Text Tiles Most human-readable files are in text format; this format contains records comprised of ASCII characters with each record terminated by an ASCII line-feed character, which is the newline character in UNIX terminology. The FFIO specification that selects this file structure is assign -F text. The FFIO package is seldom required to handle text files. In the following types of cases, however, using FFIO may be necessary: • Optimizing text file access to reduce I/O wait time • Handling multiple EOF records in text files • Converting data files to and from other formats I/O speed is important when optimizing text file access. Using assign -F text is expensive in terms of CPU time, but it lets you use memory-resident and SDS files, which can reduce or eliminate I/O wait time. The FFIO system also can process text files that have embedded EOF records. The ~e string alone in a text record is used as an EOF record. Editors such as sed(1) or other standard utilities can process these files, but it is sometimes easier with the FFIO system. On UNICOS and UNICOS/mk systems, the text layer is also useful in conjunction with the fdcp(1) command. The text layer provides a standard output format. Many forms of data that are not considered foreign are sometimes encountered in a heterogeneous computing environment. If a record format can be described with an FFIO specification, it can usually be converted to text format by using the following script: S–3695–36 103Application Programmer’s I/O Guide OTHERSPEC=$1 INFILE=$2 OUTFILE=$3 assign -F ${OTHERSPEC} ${INFILE} assign -F text ${OUTFILE} fdcp ${INFILE} ${OUTFILE} Use the fdcp command to copy files while converting record blocking. 11.1.2 Reading and Writing Unblocked Files The simplest form of data file format is the simple binary stream or unblocked data. It contains no record marks, file marks, or control words. This is usually the fastest way to move large amounts of data, because it involves a minimal amount of CPU and system overhead. The FFIO package provides several layers designed specifically to handle this binary stream of data. These layers are syscall, sds, and mr. These layers behave the same from the user’s perspective; they only use different system resources. The unblocked binary stream is usually used for unformatted data transfer. It is not usually useful for text files or when record boundaries or backspace operations are required. The complete burden is placed on the application to know the format of the file and the structure and type of the data contained in it. This lack of structure also allows flexibility; for example, a file declared with one of these layers can be manipulated as a direct-access file with any desired record length. In this context, fdcp can be called to do the equivalent of the cp(1) command only if the input file is a binary stream and to remove blocking information only if the output file is a binary stream. 11.1.3 Reading and Writing Fixed-length Records The most common use for fixed-length record files is for Fortran direct access. Both unformatted and formatted direct-access files use a form of fixed-length records. The simplest way to handle these files with the FFIO system is with binary stream layers, such as system, syscall, cache, cachea, sds, and mr. These layers allow any requested pattern of access and also work with direct-access files. The syscall and system layers, however, are unbuffered and do not give optimal performance for small records. 104 S–3695–36Using FFIO [11] The FFIO system also directly supports some fixed-length record formats. 11.1.4 Reading and Writing COS Blocked Files The COS blocking format is the default file structure for all Fortran sequential unformatted files on UNICOS and UNICOS/mk systems, except tape files. The cos layer is provided to handle these files. It provides for COS blocked files on disk and on magnetic tape and it supports multifile COS blocked datasets. The cos layer must be specified for COS blocked files. If COS is not the default file structure, or if you specify another layer, such as sds, you may have to specify a cos layer to get COS blocking. 11.2 Enhancing Performance FFIO can be used to enhance performance in a program without changing the source code or recompiling the code. This section describes some basic techniques used to optimize I/O performance. Additional optimization options are discussed in Chapter 13, page 155. 11.2.1 Buffer Size Considerations In the FFIO system, buffering is the responsibility of the individual layers; therefore, you must understand the individual layers in order to control the use and size of buffers. The cos layer has high payoff potential to the user who wants to extract top performance by manipulating buffer sizes. As the following example shows, the cos layer accepts a buffer size as the first numeric parameter: assign -F cos:42 u:1 The preceding example declares a working buffer size for the cos layer of forty-two 4096–byte blocks. This is an excellent size for a file that resides on a DD-49 disk drive because a track on a DD-49 disk drive is comprised of forty-two 4096–byte blocks (sectors). If the buffer is sufficiently large, the cos layer also lets you keep an entire file in the buffer and avoid almost all I/O operations. S–3695–36 105Application Programmer’s I/O Guide 11.2.2 Removing Blocking I/O optimization usually consists of reducing overhead. One part of the overhead in doing I/O is the CPU time spent in record blocking. For many files in many programs, this blocking is unnecessary. If this is the case, the FFIO system can be used to deselect record blocking and thus obtain appropriate performance advantages. The following layers offer unblocked data transfer: Layer Definition syscall System call I/O bufa Buffering layer cachea Asynchronous cache layer sds SDS-resident I/O cache Memory-resident buffer cache mr Memory-resident (MR) I/O You can use any of these layers alone for any file that does not require the existence of record boundaries. This includes any applications that are written in C that require a byte stream file. The syscall layer offers a simple direct system interface with a minimum of system and library overhead. If requests are larger than approximately 32 Kbytes, this method can be appropriate, especially if the requests are a uniform multiple of 4096 bytes. The other layers are discussed in the following sections. 11.2.3 The bufa and cachea Layers The bufa layer and cachea layer permits efficient file processing. Both layers provide library-managed asynchronous buffering, and the cachea layer allows recently accessed parts of a file to be cached either in main memory or in a secondary data segment. The number of buffers and the size of each buffer is tunable. In the bufa:bs:nbufs or cachea:bs:nbufs FFIO specifications, the bs argument specifies the size in 4096–byte blocks of each buffer. The default on UNICOS systems and UNICOS/mk systems depends on the st_oblksize field returned from a stat(2) system call of the file; if this return value is 0, the default is 489 for ER90 files and 8 for all other files. The nbufs argument specifies the number of buffers to use. 106 S–3695–36Using FFIO [11] 11.2.4 The sds Layer (Available Only on UNICOS Systems) The sds layer is not available on UNICOS/mk systems. It is only available on UNICOS systems. The sds layer lets you use the secondary data segment (SDS) feature as an I/O device for almost any file. SDS is one use of the solid-state storage device (SSD). SDS as a device is described in the UNICOS File Formats and Special Files Reference Manual. If SDS is available, the sds layer can yield very high performance. The sds transfer rate can approach 2 Gbit/s. Used in combination with the other layers, COS blocked files, text files, and direct-access files can reside in SDS without recoding. This can provide excellent performance for any file or part of a file that can reside in SDS. The sds layer offers the capability to declare a file to be SDS resident. It features both scratch and save mode, and it performs overflow to the next lower layer (usually disk) automatically. You can declare that a file should reside in SDS to the extent possible. The simplest specification is assign -F sds fort.1. This specification assumes default values for all options on the sds layer. By default, the sds layer is in save mode, which makes the SDS appear like an ordinary file. Because save is the assumed mode, any existing file is loaded into SDS when the file is opened. When the file is closed, the data is written back to the disk if the data was changed. The sds layer overflows if necessary. Data that does not fit in the SDS space overflows to the next lower-level layer. This happens regardless of the reason for insufficient SDS space. For example, if you are not validated to use SDS, all of the files that are declared to be SDS-resident immediately overflow. In the previous assign(1) example, the overflow goes to disk file fort.1. The part of the file that fits in SDS remains there until the file is closed, but the overflowed portion resides on disk. The assign -F command specifies the entire set of processing steps that are performed when I/O is requested. You can use other layers in conjunction with the sds layer to produce the desired file structures. In the previous example, no specification exists for blocking on the file. Therefore, the resulting file structure is identical to the following: assign -s u fort.1 This is also identical to the following: assign -F syscall fort.1 S–3695–36 107Application Programmer’s I/O Guide If a file is COS blocked, a specification must be used that handles block and record control words. The following three examples produce identical files: assign -s cos fort.1 assign -F cos fort.1 assign -F cos,sds fort.1 If the file is read or written more than once, adding sds to the assign command provides speed. If SDS space is unlimited, almost any unformatted sequential file referenced from Fortran I/O can be declared by using the following command: assign -F cos,sds unf_seq Any formatted sequential file could be declared by using the following command: assign -F text,sds fmt_seq Record blocking is not required for unformatted direct-access files; therefore, any unformatted direct-access file can be declared by using the following command: assign -F sds fort.1 In many cases, the cos specification is not necessary, but that decision must be made based on the specifics of the particular file and program. All SDS space that the sds layer uses is obtained from the sdsalloc(3) library routines. Parameters, environment variables, and rules that pertain to these routines are fully applicable to this I/O technique. For information about possible fragmentation with SDS, see the ldcache(8) man page. Section 11.3, page 111, contains several sds layer examples. 11.2.5 The mr Layer The mr layer lets you use main memory as an I/O device for many files. Used in combination with the other layers, COS blocked files, text files, and direct-access files can all reside in memory without recoding. This can result in excellent performance for any file or part of a file that can reside in memory. 108 S–3695–36Using FFIO [11] If the file is small enough to fit in memory and is traversed many times, the wall-clock time can be reduced dramatically by using the mr layer to keep the file entirely in memory. The mr layer lets you declare that a file is memory resident. It features both scratch and save mode, and it performs overflow to the next lower layer (usually disk) automatically. Memory-resident files can run either in interactive or batch mode. The format for the mr layer on the assign(1) command is as follows: assign -F mr.savscr.ovfopt:min:max:incr The assign -F command specifies the entire set of processing steps that are performed when I/O is requested. If the mr layer is specified alone, the resulting file structure is identical to the following: assign -s unblocked fort.1 If a file is COS blocked, you must specify the handling of block and record control words as in the following example: assign -s cos fort.1 The previous assign specification is identical to both of the following: assign -F cos fort.1 assign -F cos,mr fort.1 Section 11.3, page 111, contains several mr program examples. 11.2.6 The cache Layer The cache layer permits efficient file processing for repeated access to one or more regions of a file. It is a library-managed buffer cache that contains a tunable number of pages of tunable size. To specify the cache layer, use the following option: assign -F cache[:[bs][:[nbufs]]] The bs argument specifies the size in 4096–byte blocks of each cache page; the default is 8. The nbufs argument specifies the number of cache pages to use. The default is 4. You can achieve improved I/O performance by using one or more of the following strategies: S–3695–36 109Application Programmer’s I/O Guide • Use a cache page size (bs) that is a multiple of the disk sector or track size. This improves the performance when flushing and filling cache pages. • Use a cache page size that is a multiple of the user’s record size. This ensures that no user record straddles two cache pages. If this is not possible or desirable, it is best to allocate a few additional cache pages (nbufs). • Use a number of cache pages that is greater than or equal to the number of file regions the code accesses at one time. If the number of regions accessed within a file is known, the number of cache pages can be chosen first. To determine the cache page size, divide the amount of memory to be used by the number of cache pages. For example, suppose a program uses direct access to read 10 vectors from a file and then writes the sum to a different file: integer VECTSIZE, NUMCHUNKS, CHUNKSIZE parameter(VECTSIZE=1000*512) parameter(NUMCHUNKS=100) parameter(CHUNKSIZE=VECTSIZE/HUMCHUNKS) read a(CHUNKSIZE), sum(CHUNKSIZE) open(11,access=’direct’,recl=CHUNKSIZE*8) call asnunit (2,’-s unblocked’,ier) open (2,form=’unformatted’) do i = 1,NUMCHUNKS sum = 0.0 do j = 1,10 read(11,rec=(j-1)*NUMCHUNKS+i)a sum=sum+a enddo write(2) sum enddo end If 4 Mbytes of memory are allocated for buffers for unit 11, 10 cache pages should be used, each of the following size: 4MB/10 = 40000 bytes = 97 blocks Make the buffer size an even multiple of the record length of 40960 bytes by rounding it up to 100 blocks (= 40960 bytes), then use the following assign command: assign -F cache:100:10 u:11 110 S–3695–36Using FFIO [11] 11.3 Sample Programs for UNICOS Systems The following examples contain coding examples using the different layers that were discussed previously. Example 12: sds using buffer I/O The following is an example of a batch request shell script that uses an sds layer with buffer I/O. In the following example, a batch job named exam1 contains the following statements: #QSUB -r exam1 -lT 10 -lQ 500000 #QSUB -eo -o exam1.out set -x cd $TMPDIR cat > ex1.f <IBM implicit numeric conversion 152 S–3695–36Foreign File Conversion [12] HARDREF=CRAY2VAX HARDREF=VAX2CRAY Cray<->VAX/VMS implicit numeric conversion HARDREF=CRAY2NVE HARDREF=NVE2CRAY Cray<->NOS/VE implicit numeric conversion HARDREF=CRAY2IEG HARDREF=IEG2CRAY Cray<->IEEE implicit numeric conversion HARDREF=CRAY2ETA HARDREF=ETA2CRAY Cray<->ETA implicit numeric conversion HARDREF=CRAY2CDC HARDREF=CDC2CRAY Cray<->CDC 60-bit implicit numeric conversion HARDREF=CRI2IBM HARDREF=IBM2CRI Cray IEEE<->IBM HARDREF=CRI2IEG HARDREF=IEG2CRI Cray IEEE<->Generic IEEE S–3695–36 153I/O Optimization [13] Although I/O performance is one of the strengths of supercomputers, speeding up the I/O in a program is an often neglected area of optimization. A small optimization effort can often produce a surprisingly large gain. The run-time I/O library contains low overhead, built-in instrumentation that can collect vital statistics on activities such as I/O. This run-time library offers a powerful tool set that can analyze the program I/O without accessing the program source code. A wide selection of optimization techniques are available through the flexible file I/O (FFIO) system. You can use the assign(1) command to invoke FFIO for these optimization techniques. This chapter stresses the use of assign and FFIO because these optimization techniques do not require program recompilation or relinking. For information about other optimization techniques, see Optimizing Application Code on UNICOS Systems. For information about optimization techniques on UNICOS/mk systems, see either the Cray T3E C and C++ Optimization Guide or the Cray T3E Fortran Optimization Guide. This chapter describes ways to identify code that can be optimized and the techniques that you can use to optimize the code. 13.1 Overview I/O can be represented as a series of layers of data movement. Each layer involves some processing. Figure 3, page 156 shows typical output flow from the UNICOS system to disk. S–3695–36 155Application Programmer’s I/O Guide ~0.1 ms ~1 ms ~18 ms a10845 Figure 3. I/O layers On output, data moves from the user space to a library buffer, where small chunks of data are collected into larger, more efficient chunks. When the library buffer is full, a system request is made and the kernel moves the data to a system buffer. From there, the data is sent through the I/O processor (IOP), perhaps through ldcache, to the device. On input, the path is reversed. The times shown in Figure 3 may not be duplicated on your system because many variables exist that affect timing. These times do, however, give an indication of the times involved in each processing stage. For optimization purposes, it is useful to differentiate between permanent files and temporary files. Permanent files are external files that must be retained after the program completes execution. Temporary files or scratch files are usually created and reused during the execution of the program, but they do not need to be retained at the end of the execution. 156 S–3695–36I/O Optimization [13] Permanent files must be stored on actual devices. Temporary files exist in memory and do not have to be written to a physical device. With temporary files, the strategy is to avoid using system calls (going to "lower layers" of I/O processing). If a temporary file is small enough to reside completely in memory, you can avoid using system calls. Permanent files require system calls to the kernel; because of this, optimizing the I/O for permanent files is more complicated. I/O on permanent files may require the full complement of I/O layers. The goal of I/O optimization is to move data to and from the devices as quickly as possible. If that is not fast enough, you must find ways to overlap I/O with computation. 13.2 An Overview of Optimization Techniques This chapter briefly describes the optimization techniques that are discussed in the remainder of this chapter. 13.2.1 Evaluation Tools Use the following ja(1) tool to determine the initial I/O performance and to verify improvements in I/O performance after you try different optimization techniques. See Section 13.3.1, page 160, for details. 13.2.2 Optimizations Not Affecting Source Code The following types of optimization may improve I/O performance: • Use the type of storage devices that are effective for the types of I/O done by the program. Try the mr or ssd layers (see Section 13.4.1, page 161, or Section 13.4.3, page 165). • Specify the cache page size so that one or more records will fit on a cache page if the program is using unformatted direct access I/O (see Section 13.4.4, page 166, for details). • Use file structures without record control information to bypass the overhead associated with records (see Section 13.5.5, page 173, for details). • Choose file processing with appropriate buffering strategies. The cos, bufa, and cachea FFIO layers implement asynchronous write-behind (see Section 13.5.4, page 172, for details). The cos and bufa FFIO layers implement asynchronous read-ahead; this is available for the cachea layer through use of an assign option. S–3695–36 157Application Programmer’s I/O Guide • Choose efficient library buffer sizes. Bypass the library buffers when possible by using the system or syscall layers (see Section 13.7.1, page 174, for details). • Determine whether the use of striping, preallocation of the file, and the use of contiguous disk space would improve I/O performance (see Section 13.4.6, page 167, for details). • Use the assign command to specify scratch files to prevent writes to disk and to delete the files when they are closed (see Section 13.5.1, page 168, for details). Section 11.2, page 105, also provides further information about using FFIO to enhance I/O performance. 13.2.3 Optimizations that Affect Source Code The following source program changes may affect the I/O performance of a Fortran program: • Use unformatted I/O when possible to bypass conversion of data. • Use whole array references in I/O lists where possible. The generated code passes the entire array to the I/O library as the I/O list item rather than pass it through several calls to the I/O library. • Use special packages such as buffer I/O, random-access I/O, and asynchronous queued I/O. • Overlap CPU time and I/O time by using asynchronous I/O. 13.2.4 Optimizing I/O Speed I/O optimization can often be accomplished by simply addressing I/O speed. The following UNICOS storage systems are available, ranked in order of speed: • CPU main memory • Optional SSD • Magnetic disk drives • Optional magnetic tape drives Fast storage systems are expensive and have smaller capacities. You can specify a fast device through FFIO layers and use several FFIO layers to gain the maximum 158 S–3695–36I/O Optimization [13] performance benefit from each storage medium. The remainder of this chapter discusses many of these FFIO optimizations. These easy optimizations are frequently those that yield the highest payoffs. 13.3 Determining I/O Activity Before you can optimize I/O, you must first identify the activities that use the most time. The most time-intensive I/O activities are the following: • System requests • File structure overhead • Data conversion • Data copying This section describes different commands you can use to examine your programs and determine how much I/O activity is occurring. After you determine the amount of I/O activity, you can then determine the most effective way to optimize the I/O. The sections that follow make frequent references to the following sample program: program t parameter (nrec=2000, ndim=500) dimension a(ndim) do 5 i=1,ndim a(i) = i 5 continue istat = ishell(’rm fort.1’) call timef(t0) do 10 i=1,nrec write(1) a 10 continue c rewind and read it 3 times do 30 i=1,3 rewind(1) do 20 j=1,nrec read(1) a 20 continue 30 continue call timef(t1) nxfer = 8*nrec*ndim*(1+3) S–3695–36 159Application Programmer’s I/O Guide write(*,*) ’unit 1: ’, + nxfer/(1000*(t1-t0)), + ’ Mbytes/sec’ stop end 13.3.1 Checking Program Execution Time The ja(1) command is a job accounting command that can help you determine if optimizing your program will return any significant gain. For complete details about the ja command, see the ja man page. To use ja(1), enter the following commands: ja a.out ja -ct These commands produce the following program execution summary that indicates the time spent in I/O: Command Started Elapsed User CPU Sys CPU I/O Wait I/O Wait Name At Seconds Seconds Seconds Sec Lck Sec Unlck ======== ======== =========== ========== ========== ======== ========== a.out 17:15:56 4.5314 0.2599 0.2242 3.9499 0.1711 This output indicates that this program has a large amount of I/O wait time. The following section describes how to obtain a profile of the I/O activity in the program. 13.4 Optimizing System Requests In a busy interactive environment, queuing for service is time consuming. In tuning I/O, the first step is to reduce the number of physical delays and the queuing that results by reducing the number of system requests, especially the number of system requests that require physical device activity. System requests are made by the library to the kernel. They request data to be moved between I/O devices. Physical device activity consumes the most time of all I/O activities. Typical requests are read, write, and seek. These requests may require physical device I/O. During physical device I/O, time is spent in the following activities: 160 S–3695–36I/O Optimization [13] • Transferring data between disk and memory. • Waiting for physical operations to complete. For example, moving a disk head to the cylinder (seek time) and then waiting for the right sector to come under the disk head (latency time). System requests can require substantial CPU time to complete. The system may suspend the requesting job until a relatively slow device completes a service. Besides the time required to perform a request, the potential for congestion also exists. The system waits for competing requests for kernel, disk, IOP, or channel services. System calls to the kernel can slow I/O by one or two orders of magnitude. The information in this section summarizes some ways you can optimize system requests. 13.4.1 The MR Feature Main memory is extremely fast. There are many ways to use memory to avoid delays that are associated with transfers to and from physical devices. The mr FFIO layer, which permits files to reside in main memory, is available on all UNICOS and UNICOS/mk systems. If the memory space is large enough, you can eliminate all system requests for I/O on a file. To apply 8 Mbytes of memory to this file, use the following assign command and then rerun the job: assign -F blocked,mr::1961 u:1 The maximum size of 1961 is calculated by dividing the file size of 8,032,256 bytes by the sector size of 4096 bytes. The -F option invokes FFIO. The blocked,mr specification selects the blocked layer followed by the mr layer of FFIO. The u:1 argument specifies unit 1. Figure 4 shows I/O data movement when you use the assign command. S–3695–36 161Application Programmer’s I/O Guide a10846 Figure 4. I/O data movement The data only moves to and from the buffer of the mr layer during the operation of the READ, WRITE, and REWIND I/O statements. It gets moved from disk during OPEN processing if it exists and when SCRATCH is not specified. It gets moved to disk only during CLOSE processing when DELETE is not specified. When the program is rerun under procview, the procview report is as follows: ======================================================================= Fortran Unit Number 1 File Name fort.1 Command Executed a.out Date/Time at Open 09/04/91 17:29:38 Date/Time at Close 09/04/91 17:29:39 System File Descriptor 4 Type of I/O sequential unformatted File Structure COS blocked File Size 8032256 (bytes) Total data transferred 8032256 (bytes) Assign attributes -F blocked, mr::1961 162 S–3695–36I/O Optimization [13] Fortran I/O Count of Real Statement Statements Time ------------ ---------- -------------- READ 6000 .1663 WRITE 2000 .0880 REWIND 3 .0005 CLOSE 1 .9055 1003.7 Bytes transferred per Fortran I/O statement 99.99% Of Fortran I/O statements did not initiate a system request System I/O # of # Bytes # Bytes Wait Time (Clock Periods) Function Calls Processed Requested Max Min Total ----------- ------- --------- --------- --------- --------- --------- Write 1 8032256 8032256 150197242 150197242 150197242 Seek 2 n/a n/a 3655 3654 7309 Truncate 1 n/a n/a 5207 5207 5207 System I/O Avg Bytes Percent of Average I/O Rate Function Per Call File Moved (MegaBytes/Second) ------------ ----------- ---------- ------------------- Write 8032256.0 100.0 8.913 Seek n/a n/a n/a Truncate n/a n/a n/a =========================================================================== In the new report, notice the following: • Read time is 0 (no entry for Read exists under System I/O Function). All of the data that was read was moved from the MR buffer to user space. Data transferred is 0; consequently, the time spent in Read is reduced by more than one order of magnitude. • Write time is reduced because the data is moved only to the MR buffer during Fortran write s. • Total write time stays relatively unchanged because the file still has to be flushed to disk at CLOSE processing. S–3695–36 163Application Programmer’s I/O Guide 13.4.2 Using Faster Devices The optional solid-state storage device (SSD) is the fastest I/O device. The SSD stores data in memory chips and operates at speeds about as fast as main memory or 10 to 50 times faster than magnetic disks. Because SSD capacity is usually much larger than main memory, SSD is used when not enough main memory is available to store all of the possible data. You can access the SSD through ldcache. The system uses SSD to cache the data from file systems that the system administrator selects. Caching is automatic for files in these file systems and their subdirectories. You can also access the SSD with the FFIO sds layer. When this layer is present, library routines use the SSD to hold the file between open and close. You should use the FFIO sds layer for files that are larger than the amount of ldcache available for the file. The SDSLIMIT and SDSINCR environment variables may have significant impact if all subfields are not specified after the SDS keyword (use of these variables is not recommended). The following timings from a Cray Y-MP/8 system show the typical effects of optimization on the program used in Section 13.4.1, page 161. In that example, the program writes a file and reads it three times. Because it is unnecessary to save the file afterward, the .scr type (scratch file) can be used. See Section 13.5.1, page 168, for more information about scratch files. Some of the following commands appear to produce a range because of the fluctuation in results. assign command I/O speed (relative) Default (no ldcache) 1 Default (ldcache) 8 (with no ldcache) I/O speed (relative) Default 1 assign -F cos,sds 7 assign -F cos.sync,sds:3000 9 assign -F cos,sds.scr 10 assign -F sds.scr:3000 9 assign -F sds.scr 3-9 (with ldcache) I/O speed (relative) Default 1 164 S–3695–36I/O Optimization [13] assign -F cos,sds 1.4 assign -F cos.sync,sds:3000 1.2 assign -F cos,sds.scr 1.2 assign -F sds.scr:3000 1.2 assign -F sds.scr 0.5-1.2 13.4.3 Using MR/SDS Combinations You can use the sds layer and ldcache in conjunction with the mr layer. For example, to allocate 2048 Mbytes (512 sectors) of main memory for the file, with the remainder on SSD, use the following assign(1) command: assign -F mr.scr:512:512:0,sds.scr The first 512 blocks of the file reside in main memory and the remainder of the blocks reside on SSD. Generally, the combination of the mr and sds layers makes the maximum amount of high performance storage available to the program. The SSD is typically used in case the file size exceeds the estimated amount of main memory you can access. The following timings from a Cray Y-MP/8 system show the typical effects of optimization on the program used in Section 13.4.1, page 161. The program writes a file and reads it three times. Because it is not necessary to save the file afterward, you can use the .scr (scratch file) type. See Section 13.5.1, page 168, for more information about scratch files. Command I/O speed (relative) (with no ldcache:) Default 1 assign -F sds.scr 4 assign -F mr.scr:512:512:0,sds.scr 4 (with ldcache:) Default 1 assign -F cos,sds.scr 1.2 assign -F mr.scr:512:512:0,sds.scr 1.2 S–3695–36 165Application Programmer’s I/O Guide 13.4.4 Using a Cache Layer The FFIO cache layer keeps recently used data in fixed size main memory or SDS buffers or cache pages in order to reuse the data directly from these buffers in subsequent references. It can be tuned by selecting the number of cache pages and the size of these pages. The use of the cache layer is especially effective when access to a file is localized to some regions of the whole file. Well-tuned cached I/O can be an order of magnitude faster than the default I/O. Even when access is sequential, the cache layer can improve the I/O performance. For good performance, use page sizes large enough to hold the largest records. The cache layers work with the standard Fortran I/O types and the Cray extensions of BUFFER IN/OUT, READMS/WRITMS, and GETWA/PUTWA. The following assign command requests 100 pages of 42 blocks each: assign -F cache:42:100 f:filename Specifying cache pages of 42 blocks matches the track size of a DD-49 disk. 13.4.5 Preallocating File Space It is a good idea to preallocate space; this saves system overhead by making fewer system requests for allocation, and may reduce the number of physical I/O requests. You can allocate space by using the default value from the -A and -B options for the mkfs(8) command, or by using the assign(1) command with the -n option, as follows: assign -n sz[:st] -q ocblks The sz argument specifies the decimal number of 512-word blocks reserved for the data file. If this option is used on an existing file, sz 512-word blocks are added to the end of the file. The -q ocblks option specifies the number of 512-word blocks to be allocated per file system partition. These options are generally used with the -p option to do user-level striping. The st (stride) argument to the -n option is obsolete and should not be used; it specifies the allocation increment when allocating sz blocks. Note: For immediate preallocation, use the setf(1) command because assign does not preallocate space until the file is opened. 166 S–3695–36I/O Optimization [13] Use the -c option on the assign or setf command to get contiguous allocation of space so that disk heads do not have to reposition themselves as frequently. It is important to note that if contiguous allocation is unavailable, the request fails and the process might abort also. Generally, most users should not do user-level striping (the -p option on the assign and setf commands), because it requires disk head seek operations on multiple devices. Only jobs performing I/O with large record lengths can benefit from user-level striping. Large records are those in excess of several times the size of IOS read-ahead/write-behind segments (this varies with the disk device, but it is usually at least 16 sectors), or several times the disk track size (this varies with the disk device). In addition, asynchronous I/O has a much higher payoff with user-level striping than synchronous I/O. The assign and setf commands have a partition option, -p, that is very important for applications that perform multifile asynchronous I/O. By placing different files on different partitions (which must be on different physical devices), multiple I/O requests can be made from a job, thus increasing the I/O bandwidth to the job. The -c option has no effect without the -n option. 13.4.6 User Striping When a file system is composed of partitions on more than one disk, major performance improvements can result from using the disks at the same time. This technique is called disk striping. For example, if the file system spans three disks, partitions 0, 1, and 2, it may be possible to increase performance by spreading the file over all three equally. Although 300 sequential writes may be required, only 100 must go to each disk, and the disks may be writing simultaneously. You can specify striping in the following two ways, using the assign command: assign -p 0-2 -n 300 -q 48 -b 144 f:filename assign -p 0:1:2 -n 300 -q 48 -F cos:144 f:filename The previous example also specifies a larger buffer size (144), which is three tracks (one per disk) if there are 48 sectors per track. Using the bufa layer enhances the usefulness of user striping because bufa issues asynchronous I/O system calls, which are handled more efficiently by the kernel for user-striped files. In addition, the double buffering helps load balance the CPU and I/O processing. Using the previous example, better performance could be obtained from the bufa layer by using the following: assign -p 0-2 -n 1000 -q 48 -F bufa:144:6 S–3695–36 167Application Programmer’s I/O Guide or assign -p 0-2 -n 1000 -q 16 -F bufa:48:6 See Section 11.2.3, page 106, for information about the bufa layers. Other factors, such as channel capacity, may limit the benefit of striping. Disk space on each partition should be contiguous and preallocated for maximum benefit. Use striping only for very large records because all of the disk heads must do seeks on every transfer. Use the df(1) command to list the partitions of a file system. For more information about the df command, see the UNICOS User Commands Reference Manual. 13.5 Optimizing File Structure Overhead The Fortran standard uses the record concept to govern I/O. It allows you to skip to the next record after reading only part of a record, and you can backspace to a previous record. The I/O library implements Fortran records by maintaining an internal record structure. In the case of a sequential unformatted file, it uses a COS blocked file structure, which contains control information that helps to delimit records. The I/O library inserts this control information on write operations and removes the information on read operations. This process is known as record translation, and it consumes time. If the I/O performed on a file does not require this file structure, you can avoid using the blocked structure and record translation. However, if you must do positioning in the file, you cannot avoid using the blocked structure. The information in this section describes ways to optimize your file structure overhead. 13.5.1 Scratch Files Scratch files are temporary and are deleted when they are closed. To decrease I/O time, move applications’ scratch files from user file systems to high-speed file systems, such as /tmp, secondary data segments (SDS), or /ssd. 168 S–3695–36I/O Optimization [13] When optimizing, you should avoid writing the data to disk. This is especially important if most of the data can be held in SDS or main memory. Fortran lets you open a file with STATUS=’SCRATCH’. It also lets you close temporary files by using a STATUS=’DELETE’. These files are placed on disk, unless the .scr specification for FFIO or the assign -t command is specified for the file. Files specified as assign -t or .scr are deleted when they are closed. The following assign commands are examples of using these options: assign -t f:filename assign -F mr.scr f:filename assign -F sds.scr f:filename assign -F cos,sds.scr f:filename You can think of the program’s file as a scratch file and avoid flushing it at CLOSE by using the following command: assign -F mr.scr u:1 Figure 5 shows the program’s current data movement: a10847 Figure 5. I/O data movement (current) S–3695–36 169Application Programmer’s I/O Guide The following procview report shows the difference in I/O times; the last two lines of the report indicate that both the Fortran WRITE statement time and system I/O write () time were reduced to 0. ================================================================== Fortran Unit Number 1 File Name fort.1 Command Executed a.out Date/Time at Open 09/04/91 17:31:38 System File Descriptor -1 Type of I/O sequential unformatted File Structure COS blocked - ’blocked’ Assign attributes -F blocked,mr.scr Fortran I/O Count of Real Statement Statements Time ------------ ---------- -------------- READ 6000 .1622 WRITE 2000 .0862 REWIND 3 .0005 CLOSE 1 .0000 0 Bytes transferred per Fortran I/O statement 100% Of Fortran I/O statements did not initiate a system request ==================================================================== If unit 1 is declared as a scratch file by using the assign command, fort.1 will no longer exist after program execution. 13.5.2 Alternate File Structures Because the original procview report indicates that no BACKSPACE was done on the file, the program might not depend on the blocked structure. Perhaps the program reads all of the data that is on every record. If it does, you can avoid using the blocked structure and save more time. Even if you cannot be sure that you do not need the blocked structure, you can still try it by using this command: assign -F mr.scr u:1 170 S–3695–36I/O Optimization [13] The program will probably fail if it does require blocked structure. If it runs successfully, you will notice that it runs faster. The layer of library processing that does the record keeping was eliminated, and the program’s memory use now looks like that in Figure 6. a10848 Figure 6. I/O processing with library processing eliminated The program is now much faster. The time saved by using the assign commands described in this section is as follows: Command Speed Default 4.6 Mbyte/s assign -F blocked,mr::1961 27.7 Mbyte/s × 6 speedup assign -F blocked,mr.scr 129.3 Mbyte/s × 28 speedup Total optimization impact is I/O that is 15 times faster. You may not see these exact improvements because many variables (such as configurations) exist that affect timings. 13.5.3 Using the Asynchronous COS Blocking Layer When writing a sequential COS blocked file, the library usually waits until its buffer is full before initiating a system request to write the data to the physical device. When the system request completes, the library resumes processing the user request. S–3695–36 171Application Programmer’s I/O Guide The FFIO asynchronous COS layer divides this buffer in half and begins a write operation when the first half is full, but it continues processing the user request in the second half of the buffer while the system is writing data from the first half. When reading, the library tries to read ahead into the second half of the buffer to reduce the time the job must wait while waiting for system requests. This can be twice as fast as sequential I/O requests. The asynchronous COS layer is specified with the assign -F command, as follows: assign -F cos.async f:filename assign -F cos.async:96 f:filename The second assign command specifies a larger buffer because the library requests (half the specified buffer size) should be the disk track size, which is assumed to be 48 sectors. 13.5.4 Using Asynchronous Read-Ahead and Write-Behind Several FFIO layers automatically enhance I/O performance by performing asynchronous read-ahead and write-behind. These layers include: • cos: default Fortran sequential unformatted file. Specified by assign -F cos. • bufa: specified by assign -F bufa. • cachea: default Fortran direct unformatted files. Specified by assign -F cachea. Default cachea behavior provides asynchronous write-behind. Asynchronous read-ahead is not enabled by default, but is available by an assign option. If records are accessed sequentially, the cos and bufa layers will automatically and asynchronously pre-read data ahead of the file position currently being accessed. This behavior can be obtained with the cachea layer with an assign option; in that case, the cachea layer will also detect sequential backward access patterns and pre-read in the reverse direction. Many user codes access the majority of file records sequentially, even with ACCESS=’DIRECT’ specified. Asynchronous buffering provides maximum performance when: • Access is mainly sequential, but the working area of the file cannot fit in a buffer or is not reused frequently. 172 S–3695–36I/O Optimization [13] • Significant CPU-intensive processing can be overlapped with the asynchronous I/O. Use of automatic read-ahead and write-behind may decrease execution time by half because I/O and CPU processing occur in parallel. The following assign command specifies a specific cachea layer with 10 pages, each the size of a DD-40 track. Three pages of asynchronous read-ahead are requested. The read-ahead is performed when a sequential read access pattern is detected. assign -F cachea:48:10:3 f:filename This command would work for a direct access or sequential Fortran file which has unblocked file structure. To utilize asynchronous read-ahead and write-behind with ER90 tape files, you can use the bufa and the er90 layers, as in the following example: assign -F bufa,er90 f:filename The bufa layer must be used with the er90 layer because it supports file types that are not seekable. The bufa layer can also be used with disk files, as in the following example: assign -F bufa:48:10 f:filename This command specifies the same buffer configuration as the previous cachea example. The bufa layer uses all its pages for asynchronous read-ahead and write-behind. When writing, each page is asynchronously flushed as soon as it is full. 13.5.5 Using Simpler File Structures Marking records incurs overhead. If a program reads all of the data in any record it accesses and avoids the use of BACKSPACE, you can make some minor performance savings by eliminating the overhead associated with records. This can be done in several ways, depending on the type of I/O and certain other characteristics. For example, the following assign statements specify the unblocked file structure: assign -s unblocked f:filename assign -s u f:filename assign -s bin f:filename S–3695–36 173Application Programmer’s I/O Guide 13.6 Minimizing Data Conversions When possible, avoid formatted I/O. Unformatted I/O is faster, and it avoids potential inaccuracies due to conversion. Formatted Fortran I/O requires that the library interpret the FORMAT statement and then convert the data from an internal representation to ASCII characters. Because this must be done for every item generated, it can be very time-consuming for large amounts of data. Whenever possible, use unformatted I/O to avoid this overhead. Do not use edit-directed I/O on scratch files. Major performance gains are possible. You can explicitly request data conversions during I/O. The most common conversion is through Fortran edit-directed I/O. I/O statements using a FORMAT statement, list-directed I/O, and namelist I/O require data conversions. Conversion between internal representation and ASCII characters is time-consuming because it must be performed for each data item. When present, the FORMAT statement must be parsed or interpreted. For example, it is very slow to convert a decimal representation of a floating-point number specified by an E edit descriptor to an internal binary representation of that number. For more information about data conversions, see Chapter 12, page 121. 13.7 Minimizing Data Copying The Fortran I/O libraries usually use main memory buffers to hold data that will be written to disk or was read from disk. The library tries to do I/O efficiently on a few large requests rather than in many small requests. This process is called buffering. Overhead is incurred and time is spent whenever data is copied from one place to another. This happens when data is moved from user space to a library buffer and when data is moved between buffers. Minimizing buffer movement can help improve I/O performance. 13.7.1 Changing Library Buffer Sizes The libraries generally have default buffer sizes. The default is suitable for many devices, but major performance improvements can result from requesting an efficient buffer size. The optimal buffer size for very large files is usually a multiple of a device allocation for the disk. This may be the size of a track on the disk. The df -p command lists thresholds for big file allocations. If optimal size buffers are used 174 S–3695–36I/O Optimization [13] and the file is contiguous, disk operations are very efficient. Smaller sizes require more than one operation to access all of the information on the allocation or track. Performance does not improve much with buffers larger than the optimal size, unless striping is specified. When enough main memory is available to hold the entire file, the buffer size can be selected to be as large as the file for maximum performance. The maximum length of a formatted record depends on the size of the buffer that the I/O library uses for a file. The size of the buffer depends on the following: • hardware system and UNICOS level • Type of file (external or internal) • Type of access (sequential or direct) • Type of formatted I/O (edit-directed, list-directed, or namelist) On UNICOS systems, the RECL parameter on the OPEN statement is accepted by the Fortran library for sequential access files. For a sequential access file, RECL is defined as the maximum record size that can be read or written. Thus, the RECL parameter on the OPEN statement can be used to adjust the maximum length of formatted records that can be read or written for that file. If RECL is not specified, the following default maximum record lengths apply: Input Output Edit-directed formatted I/O 267 267 List-directed formatted I/O 267 133 Namelist I/O 267 133 Internal I/O none none ENCODE/DECODE none none 13.7.2 Bypassing Library Buffers After a request is made, the library usually copies data between its own buffers and the user data area. For small requests, this may result in the blocking of many requests into fewer system requests, but for large requests when blocking is not needed, this is inefficient. You can achieve performance gains by bypassing the library buffers and making system requests to the user data directly. S–3695–36 175Application Programmer’s I/O Guide To bypass the library buffers and to specify a direct system interface, use the assign -s u option or specify the FFIO system, or syscall layer, as is shown in the following assign command examples: assign -s u f:filename assign -F system f:filename assign -F syscall f:filename The user data should be in multiples of the disk sector size (usually 4096 bytes) for best disk I/O performance. If library buffers are bypassed, the user data should be on a sector boundary to prevent I/O performance degradation. 13.8 Other Optimization Options There are other optimizations that involve changing your program. The following sections describe these optimization techniques. 13.8.1 Using Pipes When a program produces a large amount of output used only as input to another program consider using pipes. If both programs can run simultaneously, data can flow directly from one to the next by using a pipe. It is unnecessary to write the data to the disk. See Chapter 4, page 41, for details about pipes. 13.8.2 Overlapping CPU and I/O Major performance improvements can result from overlapping CPU work and I/O work. This approach can be used in many high-volume applications; it simultaneously uses as many independent devices as possible. To use this method, start some I/O operations and then immediately begin computational work without waiting for the I/O operations to complete. When the computational work completes, check on the I/O operations; if they are not completed yet, you must wait. To repeat this cycle, start more I/O and begin more computations. As an example, assume that you must compute a large matrix. Instead of computing the entire matrix and then writing it out, a better approach is to compute one column at a time and to initiate the output of each column immediately after the column is computed. An example of this follows: 176 S–3695–36I/O Optimization [13] dimension a(1000,2000) do 20 jcol= 1,2000 do 10 i= 1,1000 a(i,jcol)= sqrt(exp(ranf())) 10 continue 20 continue write(1) a end First, try using the assign -F cos.async f:filename command. If this is not fast enough, rewrite the previous program to overlap I/O with CPU work, as follows: dimension a(1000,2000) do 20 jcol= 1,2000 do 10 i= 1,1000 a(i,jcol)= sqrt(exp(ranf())) 10 continue BUFFER OUT(1,0) (a(1,jcol),a(1000,jcol) ) 20 continue end The following Fortran statements and library routines can return control to the user after initiating I/O without requiring the I/O to complete: • BUFFER IN and BUFFER OUT statements (buffer I/O) • Asynchronous queued I/O statements (AQIO) • FFIO cos blocking asynchronous layer • FFIO cachea layer • FFIO bufa layer 13.9 Optimization on UNICOS/mk Systems The information in this section describes some optimization guidelines for UNICOS/mk systems. For more information about optimization on UNICOS/mk systems, see the CRAY T3E Fortran Optimization Guide. • Choose the largest possible transfer sizes: Using large transfer sizes alleviates the longer system call processing time. S–3695–36 177Application Programmer’s I/O Guide • Check the MAXASYN settings: An application can become limited by the MAXASYN settings on the host machine. The default value of 35 asynchronous I/O structures limits you to 17 outstanding asynchronous I/O requests. The system administrator can view the current settings by using the crash command. The values to be checked are in the var structure; the fields that may need to be changed are v_pbuf, v_asyn, and v_maxasyn. These values can be changed by changing the values for NPBUF, NASYN, and MASAXYN in config.h. • Coordinate PEs performing I/O: When creating files by using a UNICOS/mk application and if raw (unbuffered) I/O performance is expected, you must coordinate the PEs doing the I/O so the write requests are issued sequentially. If the PEs issue the I/O at their own speed, the host will interpret this as a non-sequential extension of a file. When this occurs, the host uses the system buffer cache to zero the space between the old EOF and the new I/O request. • Resequence I/O when converting applications: When converting sequential applications to run on the UNICOS/mk system, resequence the I/O (from a disk perspective) by user striping the file across N tracks with N PEs performing all of the I/O, where a single PE will stride through the file by N records. The following diagram shows how the record numbers are assigned to the disk slices of a filesystem and shows how the PE will be performing the I/O request: Slice Slice ~ Slice A/PE-X B/PE-Y C/PE-Z 1 2 N N+1 N+2 2N 2N+1 2N+2 3N ~ ~ ~ ~ K*N+1 K*N+2 (K+1)*N • Use Fortran and IEEE data conversion facilities: When an unformatted Cray PVP data file is to be read on the Cray MPP system, write a conversion program to run on the Cray PVP system that uses the Fortran compiler and the T3D data conversion layer. For data files that have integer elements, no conversion is necessary. For data files that have real or logical elements, use an assign -N t3d statement for the output data file. 178 S–3695–36FFIO Layer Reference [14] This chapter provides details about each of the following FFIO layers.: Layer Definition blankx or blx Blank compression/expansion layer bmx or tape UNICOS online tape handling bufa * Library-managed asynchronous buffering c205 CDC CYBER 205 record formats cache* cache layer cachea * cachea layer cdc CDC 60-bit NOS/SCOPE file formats cos * COS blocking er90 ER90 handling event * I/O monitoring (not available on Cray T3E systems) f77 * UNIX record blocking fd* File descriptor global* Cache distribution layer ibm IBM file formats mr Memory-resident file handlers nosve CDC NOS/VE file formats null * The null layer sds SDS resident file handlers (not available on Cray T3E systems) syscall * System call I/O system * Generic system layer text * Newline separated record formats user* and site * Writable layer vms* VAX/VMS file formats S–3695–36 179Application Programmer’s I/O Guide 14.1 Characteristics of Layers In the descriptions of the layers that follow, the data manipulation tables use the following categories of characteristics: Characteristic Description Granularity Indicates the smallest amount of data that the layer can handle. For example, layers can read and write a single bit; other layers, such as the syscall layer, can process only 8-bit bytes. Still others, such as some CDC formats, process data in units of 6-bit characters in which any operation that is not a multiple of 6 bits results in an error. Data model Indicates the data model. Three main data models are discussed in this section. The first type is the record model, which has data with record boundaries, and may have an end-of-file (EOF). The second type is stream (a stream of bits). None of these support the EOF. The third type is the filter, which does not have a data model of its own, but derives it from the lower-level layers. Filters usually perform a data transformation (such as blank compression or expansion). Truncate on write Indicates whether the layer forces an implied EOD on every write operation (EOD implies truncation). Implementation strategy Describes the internal routines that are used to implement the layer. The X-record type referred to under implementation strategy refers to a record type in which the length of the record is prepended and appended to the record. For f77 files, the record length is contained in 4 bytes at the beginning and the end of a record. The v type of NOS/VE and the w type of CYBER 205/ETA also prepend and append the length of the record to the record. In the descriptions of the layers, the supported operations tables use the following categories: 180 S–3695–36FFIO Layer Reference [14] Operation Lists the operations that apply to that particular layer. The following is a list of supported operations: ffopen ffclose ffread ffflush ffreada ffweof ffreadc ffweod ffwrite ffseek ffwritea ffpos ffwritec ffbksp Support Uses three potential values: Yes, No, or Passed through. “Passed through” indicates that the layer does not directly support the operation, but relies on the lower-level layers to support it. Used Lists two values: Yes or No. “Yes” indicates that the operation is required of the next lower-level layer. “No” indicates that the operation is never required of the lower-level layer. Some operations are not directly required, but are passed through to the lower-layer if requested of this layer. These are noted in the comments. Comments Describes the function or support of the layer’s function. On many layers, you can also specify the numeric parameters by using a keyword. This functionality is available if your application is linked with CrayLibs 3.0 or later release. See the INTRO_FFIO(3F) man page for more details about FFIO layers. 14.2 Individual Layers The remaining sections in this chapter describe the individual FFIO layers in more detail. 14.2.1 The blankx Expansion/Compression Layer The blankx or blx layer performs blank compression and expansion on a stream of 8-bit characters. The syntax for this layer is as follows: blankx[.type]:[num1]:[num2] blx[.type]:[num1]:[num2] S–3695–36 181Application Programmer’s I/O Guide The keyword specification for this layer is as follows: blankx.[type][.blxchr=num1][.blnk=num2] blx.[type][.blxchr=num1][.blnk=num2] The type field can have one of the following three values: Value Definition cos COS-style blank compression . (blxchr= 27 or 0x1D) ctss CTSS-style blank compression. (blxchr= 48 or 0x30) c205 CYBER 205–style blank compression. (blxchr= 48 or 0x30) The num1 field contains the decimal value of the ASCII character used as the escape code to control the blank compression. The num2 field contains the decimal value of the ASCII character that is the object of the compression. This is usually the ASCII blank (0x20). Table 12. Data manipulation: blankx layer Granularity Data model Truncate on write Implementation strategy 8 bits Filter. Takes characteristics of lower-level layer but does some data transformation. No blx specific Table 13. Supported operations: blankx layer Supported operations Required of next lower level? Operation Supported Comments Used Comments ffopen Yes Yes ffread Yes Yes ffreada Yes Always synchronous No ffreadc Yes No ffwrite Yes Yes 182 S–3695–36FFIO Layer Reference [14] Supported operations Required of next lower level? Operation Supported Comments Used Comments ffwritea Yes Always synchronous No ffwritec Yes No ffclose Yes Yes ffflush Yes No ffweof Passed through Only in lower-level layer Yes Only if explicitly requested ffweod Passed through Only in lower-level layer Yes Only if explicitly requested ffseek No Only seek (fd,0,0) for rewind Yes Only on rewind ffpos Yes NA NA ffbksp Passed through Only in lower-level layer Yes Only if explicitly requested 14.2.2 The bmx/tape Layer The bmx or tape layer handles the interface to online magnetic tapes. The bmx layer uses the tape list I/O interface on Cray systems. A magnetic tape does not use control words to delimit records and files; however, control words are part of the physical representation of the data on the medium. On a magnetic tape, each tape block is considered a record. The following is the syntax for this layer: bmx:[num1]:[num2] tape:[num1]:[num2] The keyword specification is as follows: bmx[.bufsize=num1][.num_buffers=num2] S–3695–36 183Application Programmer’s I/O Guide tape[.bufsize=num1][.num_buffers=num2] The num1 argument specifies the size in 512-word blocks for each buffer. The num2 argument specifies the number of buffers to use. The bmx layer may be used with ER90 files that have been mounted in blocked mode. The ER90 device places restrictions on the amount of data that can be written to a tape block; see the Tape Subsystem User’s Guide, for details. Table 14 describes the EOF and EOD behavior of the bmx layer. Table 14. -T specified on tpmnt Type of tapes EOF/EOD No Yes Labeled EOF Never returned At user tape marks EOD At end-of-file At label/end-of-file Unlabeled EOF Never returned At user tape marks EOD At double tape mark Never returned The EOF label is always considered an EOD. For unlabeled tapes without the -T option specified, nothing can be considered an EOF. The double tape mark shows the EOD. For unlabeled tapes specified with -T, nothing can be considered an EOD and every tape mark is returned as an EOF. No optional fields are permitted. Table 15. Data manipulation: bmx/tape layer Granularity Data model Truncate on write Implementation strategy 8 bits Record with multiple EOF if users specify with tpmnt -T Yes bmx specific Table 16. Supported operations: bmx/tape layer Operation Supported Comments ffopen Yes ffread Yes 184 S–3695–36FFIO Layer Reference [14] Operation Supported Comments ffreada Yes Always synchronous ffreadc Yes ffwrite Yes ffwritea Yes Always synchronous ffwritec Yes ffclose Yes ffflush Yes ffweof Yes Writes tape mark if allowed ffweod Yes ffseek No seek (fd,0,0) only (equal to rewind) ffpos Yes ffbksp Yes Lower-level layers are not allowed. Exact implementation depends on operating system and hardware platform. 14.2.3 The bufa Layer The bufa layer provides library-managed asynchronous buffering. This can reduce the number of low-level I/O requests for some files. The syntax is as follows: bufa:[num1]:[num2] The keyword syntax is as follows: bufa[.bufsize=num1][.num_buffers=num2] The num1 argument specifies the size, in 4096-byte blocks, of each buffer. The default buffer size depends on the device on which your file is located. The maximum allowed value on UNICOS and UNICOS/mk systems 1,073,741,823. S–3695–36 185Application Programmer’s I/O Guide You may not, however, be able to use a value this large because this much memory may not be available. The num2 argument specifies the number of buffers. The default is 2. Table 17. Data manipulation: bufa layer Granularity Data model Truncate on write 1 bit (UNICOS and UNICOS/mk) Stream No 8 bits Stream No Table 18. Supported operations: bufa layer Supported operations Required of next lower level? ffopen Yes Yes ffread Yes Yes ffreada Yes Always synchronous Yes ffreadc Yes No ffwrite Yes Yes ffwritea Yes Always synchronous Yes ffwritec Yes No ffclose Yes Yes ffflush Yes Yes ffweof Passed through Yes Only if explicitly requested ffweod Yes Yes ffseek Yes Only if supported by the underlying layer Yes Only if explicitly requested ffpos Yes Yes Only if explicitly requested ffbksp No No 186 S–3695–36FFIO Layer Reference [14] 14.2.4 The CYBER 205/ETA (c205) The c205 layer performs blocking and deblocking of the native type for the CDC CYBER 205 or ETA computer systems. The general format of the specification follows: c205.w:[recsize]:[bufsize] The keyword specification follows: c205.w[.bufsize=num2] The w is CYBER 205 W-type records and must be specified. The recsize field should not be specified because it is reserved for future use as a maximum record size. The bufsize refers to the working buffer size for the layer and should be specified as a nonnegative decimal number (in bytes). To achieve maximum performance, ensure that the working buffer size is large enough to completely hold any records that are written, plus the control words. Control words consist of 8 bytes per record. If a record plus control words is written larger than the buffer, the layer must perform some inefficient operations to do the write. If the buffer is large enough, these operations are avoided. On reads, the buffer size is not as important, although larger sizes usually perform better. If the next lower-level layer is magnetic tape, this layer does not support I/O. Table 19. Data manipulation: c205 layer Granularity Data model Truncate on write Implementation strategy 8 bits Record Yes. CDC end-of-group delimiter (EOG) maps to EOF, and CDC end-of-file (EOF) maps to EOD. x records S–3695–36 187Application Programmer’s I/O Guide Table 20. Supported operations: c205 layer Supported operations Required of next lower level? Operation Supported Comments Used Comments ffopen Yes Yes ffread Yes Yes ffreada Yes Always synchronous No ffreadc Yes No ffwrite Yes Yes ffwritea Yes Always synchronous No ffwritec Yes No ffclose Yes Yes ffflush Yes No-op No ffweof Yes Mapped to end-of-group No ffweod Yes Mapped to end-of-file Yes ffseek Yes seek(fd,0,0) only (equals rewind) Yes Requires that the underlying interface be a stream ffpos Yes NA ffbksp No No 14.2.5 The cache Layer The cache layer allows recently accessed parts of a file to be cached either in main memory or in a secondary data segment (SDS). This can significantly reduce the number of low-level I/O requests for some files that are accessed randomly. This layer also offers efficient sequential access when a buffered, unblocked file is needed. The syntax is as follows: cache[.type]:[num1]:[num2][num3] 188 S–3695–36FFIO Layer Reference [14] The following is the keyword specification: cache[.type][.page_size=num1][.num_pages=num2 [.bypass_size=num3]] The type argument can be either mem or sds. (sds is not allowed on Cray T3E systems.) mem directs that cache pages reside in main memory; sds directs that the pages reside in secondary data segments (SDS). num1 specifies the size, in 4096–byte blocks, of each cache page buffer. The default is 8. The maximum allowed value on UNICOS and UNICOS/mk systems is 1,073,741,823. You may not, however, be able to use a value this large because this much memory may not be available. num2 specifies the number of cache pages. The default is 4. num3 is the size in 4096–byte blocks at which the cache layer attempts to bypass cache layer buffering. If a user’s I/O request is larger than num3, the request might not be copied to a cache page. On UNICOS and UNICOS/mk systems, the default is num3=num1×num2. When a cache page must be preempted to allocate a page to the currently accessed part of a file, the least recently accessed page is chosen for preemption. Every access stores a time stamp with the accessed page so that the least recently accessed page can be found at any time. Table 21. Data manipulation: cache layer Granularity Data model Truncate on write 1 bit (UNICOS and UNICOS/mk systems) Stream (mimics UNICOS system calls) No 8 bit Stream No 512 words (cache.sds) Stream No S–3695–36 189Application Programmer’s I/O Guide Table 22. Supported operations: cache layer Supported operations Required of next lower level? Operation Supported Comments Used Comments ffopen Yes Yes ffread Yes No ffreada Yes Always synchronous Yes ffreadc Yes No ffwrite Yes No ffwritea Yes Always synchronous Yes ffwritec Yes No ffclose Yes Yes ffflush Yes No ffweof No No ffweod Yes Yes ffseek Yes Yes Requires underlying interface to be a stream ffpos Yes NA ffbksp No NA 14.2.6 The cachea Layer The cachea layer allows recently accessed parts of a file to be cached either in main memory or in a secondary data segment (SDS). This can significantly reduce the number of low-level I/O requests for some files that are accessed randomly. This layer can provide high write performance by asynchronously writing out selective cache pages. It can also provide high read performance by detecting sequential read access, both forward and backward. When sequential access is detected and when read-ahead is chosen, file page reads are anticipated and issued asynchronously in the direction of file access. The syntax is as follows: 190 S–3695–36FFIO Layer Reference [14] cachea[type]:[num1]:[num2]:[num3]:[num4] The keyword syntax is as follows: cachea[type][.page_size=num1][.num_pages=num2] [.max_lead=num3][.shared_cache=num4] type Directs that cache pages reside in main memory (mem) or SDS (sds). SDS is available only on UNICOS systems. num1 Specifies the size, in 4096-byte blocks, of each cache page buffer. Default is 8. The maximum allowed value on UNICOS and UNICOS/mk systems 1,073,741,823. You may not, however, be able to use a value this large because this much memory may not be available. num2 Specifies the number of cache pages to be used. Default is 4. num3 Specifies the number of cache pages to asynchronously read ahead when sequential read access patterns are detected. Default is 0. num4 Specifies a cache number in the range of 1 to 15. Cache number 0 is a cache which is private to the current FFIO layer. Any cache number larger than 0 is shared with any other file using a cachea layer with the same number. Multiple cachea layers in a chain may not contain the same nonzero cache number. On UNICOS and UNICOS/mk systems, stacked shared layers are supported, but in multitasked programs, different files must not mix the order of the shared caches. The following examples demonstrate this functionality: • The following specifications cannot both be used by a multitasked program: assign -F cachea::::1,cachea::::2 u:1 assign -F cachea::::2,cachea::::1 u:2 • The following specifications can both be used by a multitasked program on UNICOS systems: assign -F cachea::::1,cachea::::2 u:1 assign -F cachea::::2,cachea::::1 u:2 S–3695–36 191Application Programmer’s I/O Guide Table 23. Data manipulation: cachea layer Granularity Data model Truncate on write 1 bit (UNICOS and UNICOS/mk systems Stream (mimics UNICOS system calls) No 8 bit Stream (mimics UNICOS system calls) No Table 24. Supported operations: cachea layer Supported operations Required of next lower level? Operation Supported Comments Used Comments ffopen Yes Yes ffread Yes No ffreada Yes Yes ffreadc Yes No ffwrite Yes No ffwritea Yes Yes ffwritec Yes No ffclose Yes Yes ffflush Yes No ffweof No No ffweod Yes Yes ffseek Yes Yes Requires that the underlying interface be a stream ffpos Yes NA ffbksp No NA 14.2.7 The cdc Layer The cdc layer handles record blocking for four common record types from the 60-bit CYBER 6000 and 7000 series computer systems, which run the CDC 192 S–3695–36FFIO Layer Reference [14] 60-bit NOS, NOS/VE, or SCOPE operating system. The general format of the specification follows: cdc[.recfmt].[tpfmt] There is no alternate keyword specification for this layer. The supported recfmt values are as follows: Values Definition iw I-type blocks, W-type records cw C-type blocks, W-type records cs C-type blocks, S-type records cz C-type blocks, Z-type records The tpfmt field can have one of the following three values that indicate the presence of block trailers and other low-level characteristics. Field Definition disk Disk type structure, for use with station transfers of CYBER data i NOS internal tape format si System internal or SCOPE internal tape format Note: The i and si fields require a lower-level layer that handles records. A stream model in the lower-level layers does not work. The disk field requires a lower layer that handles records when endfile makes exist prior to the end of data. Table 25. Data manipulation: cdc layer Granularity Data model Truncate on write Implementation strategy 6 bits for cz records, 1 bit for iw records, and 60 bits for cs and cw records. Record Yes cdc specific S–3695–36 193Application Programmer’s I/O Guide Table 26. Supported operations: cdc layer Supported operations Required of next lower level? Operation Supported Comments Used Comments ffopen Yes Yes ffread Yes Yes ffreada Yes Always synchronous No ffreadc Yes No ffwrite Yes Yes ffwritea Yes Always synchronous No ffwritec Yes No ffclose Yes Yes ffflush Yes No-op No ffweof Yes No ffweod Yes Yes ffseek Yes seek(fd,0,0) only (equals rewind) Yes seek(fd,0,0) only ffpos Yes NA ffbksp No No 14.2.8 The cos Blocking Layer The cos layer performs COS blocking and deblocking on a stream of data. The general format of the cos specification follows: cos:[.type][.num1] The format of the keyword specification follows: cos[.type][.bufsize=num1] 194 S–3695–36FFIO Layer Reference [14] The num1 argument specifies the working buffer size in 4096-byte blocks. If not specified, the default buffer size is the larger of the following: the large I/O size (UNICOS and UNICOS/mk systems only); the preferred I/O block size (see the stat(2) man page for details), or 48 blocks. See the INTRO_FFIO(3F) man page for more details. When writing, full buffers are written in full record mode, specifically so that the magnetic tape bmx layer can be used on the system side to read and write COS transparent tapes. To choose the block size of the tape, select the buffer size. Reads are always performed in partial read mode; therefore, you do not have to know the block size of a tape to read it (if the tape block size is larger than the buffer, partial mode reads ensure that no parts of the tape blocks are skipped). Table 27. Data manipulation: cos layer Granularity Data model Truncate on write Implementation strategy 1 bit Records with multi-EOF capability Yes cos specific Table 28. Supported operations: cos layer Supported operations Required of next lower level? Operation Supported Comments Used Comments ffopen Yes Yes ffread Yes Yes ffreada Yes Always synchronous Yes ffreadc Yes No ffwrite Yes Yes ffwritea Yes Always synchronous Yes ffwritec Yes No ffclose Yes Yes ffflush Yes No-op Yes ffweof Yes No ffweod Yes Yes Truncation occurs only on close S–3695–36 195Application Programmer’s I/O Guide Supported operations Required of next lower level? Operation Supported Comments Used Comments ffseek Yes Minimal support (see following note) Yes ffpos Yes NA ffbksp Yes No records No Note: seek operations are supported only to allow for rewind (seek(fd,0,0)), seek-to-end (seek(fd,0,2)) and a GETPOS(3F) or SETPOS(3F) operation, where the position must be on a record boundary. 14.2.9 The er90 Layer (Available Only on UNICOS Systems) The er90 layer handles the interface to the ER90 files. No arguments are accepted. The er90 layer is not supported on Cray T3E systems. It is available only on UNICOS systems. Table 29. Data manipulation: er90 layer Granularity Data model Truncate on write 8 bits Stream Yes Table 30. Supported operations: er90 layer Operation Supported Comments ffopen Yes ffread Yes ffreada Yes ffreadc No ffwrite Yes ffwritea Yes ffwritec Yes 196 S–3695–36FFIO Layer Reference [14] Operation Supported Comments ffclose Yes ffflush Yes ffweof No ffweod Yes ffseek Yes ffseek(fd,0,0) only (equals rewind) ffbksp No Lower-level layers are not allowed. 14.2.10 The event Layer The event layer monitors I/O activity (on a per-file basis) which occurs between two I/O layers. It generates statistics as an ASCII log file and reports information such as the number of times an event was called, the event wait time, the number of bytes requested, and so on. You can request the following types of statistics: • A list of all event types • Event types that occur at least once • A single line summary of activities that shows information such as amount of data transferred and the data transfer rate. Statistics are reported to stderr by default. The FF_IO_LOGFILE environment variable can be used to name a file to which statistics are written by the event layer. The default action is to overwrite the existing statistics file if it exists. You can append reports to the existing file by specifying a plus sign (+) before the file name, as in this example: setenv FF_IO_LOGFILE +saveIO This layer report counts for read, reada, write, and writea. These counts represent the number of calls made to an FFIO layer entry point. In some cases, the system layer may actually use a different I/O system call, or multiple system calls. On Cray T3E systems, if more than one PE is using the event layer, and you set the FF_IO_LOGFILE environment variable, you must use the plus sign (+) before the file name to prevent PE a from overwriting the information written by PE b. Using the plus sign also means that the information will be appended to an existing file. S–3695–36 197Application Programmer’s I/O Guide On Cray T3E systems, you can also use the FF_IO_LOGFILEPE environment variable to name a file to which statistics are written. The file name will be x.n, where x is the name specified by the environment variable and n is the number of the PE which wrote the file. The default action is to overwrite the existing file. To append information to an existing file, specify a plus sign (+) before the file name. The event layer is enabled by default and is included in the executable file; you do not have to relink to study the I/O performance of your program. To obtain event statistics, rerun your program with the event layer specified on the assign command, as in this example: assign -F bufa, event, cachea, event, system The syntax for the event layer is as follows: event[.type] There is no alternate keyword specification for this layer. The type argument selects the level of performance information to be written to the ASCII log file; it can have one of the following values: Value Definition nostat No statistical information is reported. summary Event types that occur at least once are reported. brief A one line summary for layer activities is reported. 14.2.11 The f77 Layer The f77 layer handles blocking and deblocking of the f77 record type, which is common to most UNIX Fortran implementations. The syntax for this layer is as follows: f77[.type]:[num1]:[num2] The following is the syntax of the keyword specification: f77[.type][.recsize=num1][.bufsize=num2] 198 S–3695–36FFIO Layer Reference [14] The type argument specifies the record type and can take one of the following two values: Value Definition nonvax Control words in a format common to large machines such as the MC68000; default. vax VAX format (byte-swapped) control words. The num1 field refers to the maximum record size. The num2 field refers to the working buffer size. To achieve maximum performance, ensure that the working buffer size is large enough to hold any records that are written plus the control words (control words consist of 8 bytes per record). If a record plus control words are larger than the buffer, the layer must perform some inefficient operations to do the write. If the buffer is large enough, these operations can be avoided. On reads, the buffer size is not as important, although larger sizes will usually perform better. If the next lower-level layer is magnetic tape, this layer does not support I/O. Table 31. Data manipulation: f77 layer Granularity Data model Truncate on write Implementation strategy 8 bits Record Yes x records Table 32. Supported operations: f77 layer Supported operations Required of next lower level? Operation Supported Comments Used Comments ffopen Yes Yes ffread Yes Yes ffreada Yes Always synchronous No ffreadc Yes No ffwrite Yes Yes ffwritea Yes Always synchronous No S–3695–36 199Application Programmer’s I/O Guide Supported operations Required of next lower level? Operation Supported Comments Used Comments ffwritec Yes No ffclose Yes Yes ffflush Yes No ffweof Passed through Yes Only if explicitly requested ffweod Yes Yes ffseek Yes ffseek(fd,0,0) equals rewind; ffseek(fd,0,2) seeks to end Yes ffpos Yes NA ffbksp Yes Only in lower-level layer No 14.2.12 The fd Layer The fd layer allows connection of a FFIO file to a system file descriptor. You must specify the fd layer, as follows: fd:[num1] The keyword specification is as follows: fd[.file_descriptor=num1] The num1 argument must be a system file descriptor for an open file. The ffopen or ffopens request opens a FFIO file descriptor that is connected to the specified file descriptor. The file connection does not affect the file whose name is passed to ffopen. All other properties of this layer are the same as the system layer. See Section 14.2.20, page 215, for details. 14.2.13 The global Layer The global layer is a caching layer that distributes data across all multiple SHMEM or MPI processes. Open and close operations require participation by all 200 S–3695–36FFIO Layer Reference [14] processes which access the file; all other operations are independently performed by one or more processes. The following is the syntax for the global layer: global[. type]:[num1]:[num2] The following is the syntax for the keyword specification: global[. type][.page_size=num1][.num_pages=num2] The type argument can be privpos (default), in which is the file position is private to a process or globpos (deferred implementation), in which the file position is global to all processes. The num1 argument specifies the size in 4096–byte blocks of each cache page. num2 specifies the number of cache pages to be used on each process. If there are n processes, then n × num2 cache pages are used. num2 buffer pages are allocated on every process which shares access to a global file. File pages are direct-mapped onto processes such that page n of the file will always be cached on process (n mod NPES), where NPES is the total number of processes sharing access to the global file. Once the process is identified where caching of the file page will occur, a least-recently-used method is used to assign the file page to a cache page within the caching process. Table 33. Data manipulation: global layer Granularity Data model Truncate on write 8 bits Stream No S–3695–36 201Application Programmer’s I/O Guide Table 34. Supported operations: global layer Supported operations Required of next lower level? Operation Supported Comments Used Comments ffopen Yes Yes ffread Yes No ffreada Yes Always synchronous Yes ffreadc Yes No ffwrite Yes No ffwritea Yes Always synchronous Yes ffwritec Yes No ffclose Yes Yes ffflush Yes No ffweof No No ffweod Yes Yes ffseek Yes Yes Requires underlying interface to be a stream ffpos Yes NA ffbksp No NA 14.2.14 The ibm Layer The ibm layer handles record blocking for seven common record types on IBM operating systems. The general format of the specification follows: ibm.[type]:[num1]:[num2] The keyword specification follows: ibm[.type][.recsize=num1][.mbs=num2] 202 S–3695–36FFIO Layer Reference [14] The supported type values are as follows: Value Definition u IBM undefined record type f IBM fixed-length records fb IBM fixed-length blocked records v IBM variable-length records vb IBM variable-length blocked records vbs IBM variable-length blocked spanned records The f format is fixed-length record format. For fixed-length records, num1 is the fixed record length (in bytes) for each logical record. Exactly one record is placed in each block. The fb format records are the same as f format records except that you can place more than one record in each block. num1 is the length of each logical record. num2 must be an exact multiple of num1. The v format records are variable-length records. recsize is the maximum number of bytes in a logical record. num2 must exceed num1 by at least 8 bytes. Exactly one logical record is placed in each block. The vb format records are variable-length blocked records. This means that you can place more than one logical record in a block. num1 and num2 are the same as with v format. The vbs format records have no limit on record size. Records are broken into segments, which are placed into one or more blocks. num1 should not be specified. When reading, num2 must be at least large enough to accommodate the largest physical block expected to be encountered. The num1 field is the maximum record size that may be read or written. The vbs record type ignores it. The num2 (maximum block size) field is the maximum block size that the layer uses on reads or writes. S–3695–36 203Application Programmer’s I/O Guide Table 35. Values for maximum record size on ibm layer Field Minimum Maximum Default Comments u 1 32,760 32,760 f 1 32,760 None Required fb 1 32,760 None Required v 5 32,756 32,752 Default is num2-8 if not specified vb 5 32,756 32,752 Default is num2-8 if not specified vbs 1 None None No maximum record size Table 36. Values for maximum block size in ibm layer Field Minimum Maximum Default Comments u 1 32,760 32,760 Should be equal to num1 f 1 32,760 num1 Must be equal to num1 fb 1 32,760 num1 Must be multiple of num1 v 9 32,760 32,760 Must be >= num1 + 8 vb 9 32,760 32,760 Must be >= num1 + 8 vbs 9 32,760 32,760 Table 37. Data manipulation: ibm layer Granularity Data model Truncate on write Implementation strategy 8 bits Record No for f and fb records. Yes for v, vb, and vbs records. f records for f and fb. v records for u, v, vb, and vbs. 204 S–3695–36FFIO Layer Reference [14] Table 38. Supported operations: ibm layer Supported operations Required of next lower level? Operation Supported Comments Used Comments ffopen Yes Yes ffread Yes Yes ffreada Yes Always synchronous No ffreadc Yes No ffwrite Yes Yes ffwritea Yes Always synchronous No ffwritec Yes No ffclose Yes Yes ffflush Yes No ffweof Passed through Yes ffweod Yes Yes ffseek Yes seek(fd, 0, 0) only (equals rewind) Yes seek(fd,0,0) only ffpos Yes NA ffbksp No No 14.2.15 The mr Layer The memory-resident (mr) layer lets users declare that a file should reside in memory. The mr layer tries to allocate a buffer large enough to hold the entire file. The options are as follows: mr[.type[.subtype]]:num1:num2:num3 The keyword specification is as follows: S–3695–36 205Application Programmer’s I/O Guide mr[.type[.subtype]][.start_size=num1][.max_size=num2] [.inc_size=num3] The type field specifies whether the file in memory is intended to be saved or is considered a scratch file. This argument accepts the following values: Value Definition save Loads (reads) as much of the file as possible into memory when the file is opened (if it exists). If the data in memory is changed, the file data is written back to the next lower layer at close time. The save option also modifies the behavior of overflow processing. save is the default. scr Does not try to load at open and discards data on close (scratch file). The scr option also modifies the behavior of overflow processing. The subtype field specifies the action to take when the data can no longer fit in the allowable memory space. It accepts the following values: Value Definition ovfl Excess data that does not fit in the specified medium is written to the next lower layer. ovfl is the default value. novfl When the memory limit is reached, any further operations that try to increase the size of the file fail. The num1, num2, and num3 fields are nonnegative integer values that state the number of 4096–byte blocks to use in the following circumstances: Field Definition num1 When the file is opened, this number of blocks is allocated for the file. Default: 0. num2 This is the limit on the total size of the memory space allowed for the file in this layer. Attempted growth beyond this limit causes either overflow or operation failure, depending on the overflow option specified. Default: 2 46 –1 num3 This is the minimum number of blocks that are allocated whenever more memory space is required to accommodate file growth. Default: 256 for SDS files and 32 for memory resident files. The num1 and num3 fields represent best-effort values. They are intended for tuning purposes and usually do not cause failure if they are not satisfied 206 S–3695–36FFIO Layer Reference [14] precisely as specified (for example, if the available memory space is only 100 blocks and the chosen num3 value is 200 blocks, growth is allowed to use the 100 available blocks rather than failing to grow, because the full 200 blocks requested for the increment are unavailable). When using the mr layer, large memory-resident files may reduce I/O performance for sites that provide memory scheduling that favors small processes over large processes. Check with your system administrator if I/O performance is diminished. ! Caution: Use of the default value for the max parameter can cause program failure if the file grows and exhausts the entire amount of memory available to the process. If the file size might become quite large, always provide a limit. Memory allocation is done by using the malloc(3C) and realloc(3C) library routines. The file space in memory is always allocated contiguously. When allocating new chunks of memory space, the num3 argument is used in conjunction with realloc as a minimum first try for reallocation. Table 39. Data manipulation: mr layer Primary function Granularity Data model Truncate on write Avoid I/O to the extent possible, by holding the file in memory. 1 bit Stream (mimics UNICOS system calls) No Table 40. Supported operations: mr layer Supported operations Required of next lower level? Operation Supported Comments Used Comments ffopen Yes Yes Sometimes delayed until overflow ffread Yes Yes Only on open ffreada Yes No ffreadc Yes No ffwrite Yes Yes Only on close, overflow ffwritea Yes No ffwritec Yes No S–3695–36 207Application Programmer’s I/O Guide Supported operations Required of next lower level? Operation Supported Comments Used Comments ffclose Yes Yes ffflush Yes No-op No ffweof No No representation No No representation ffweod Yes Yes ffseek Yes Full support (absolute, relative, and from end) Yes Used in open and close processing ffpos Yes NA ffbksp No No records No 14.2.16 The nosve Layer The nosve layer handles record blocking for five common record types on CDC NOS/VE operating systems. The general format of the specifications is as follows: nosve[.type]:[num1]:[num2] The format of the keyword specifications is as follows: nosve[.type][.recsize=num1][.mbs=num2] The supported type fields follow: Field Definition v NOS/VE format record f ANSI F fixed-length records s ANSI S format (segmented) records d ANSI D format (variable-length) records u NOS/VE undefined record 208 S–3695–36FFIO Layer Reference [14] The num1 field is the maximum record size that can be read or written. The s and v record types ignore it. Table 41. Values for maximum record size recfmt Minimum Maximum Default Comments v 1 No maximum None f 1 65,536 None Required s 1 No maximum None d 1 9,995 4,123 u 1 32,760 32,760 Table 42. Values for maximum block size recfmt Minimum Maximum Default Comments v 32 Memory size 32,768 Working buffer size f 1 65,536 num1 s 6 32,767 4,128 d 5 32,767 4,128 u 1 32,760 32,760 For the nosve.v format, the working buffer size can affect performance. If the buffer size is at least as large as the largest record that will be written, the system overhead is minimized. For the nosve.u record format, num1 and num2 are the same thing. For nosve.f and nosve.d records, the maximum block size must be at least as large as the maximum record size. You can place more than one record in a block. For nosve.s records, one or more segments are placed in each block (a record is composed of one or more segments). S–3695–36 209Application Programmer’s I/O Guide Table 43. Data manipulation: nosve layer Granularity Data model Truncate on write Implementation strategy 8 bits Record No for f records. Yes for u, s, d, and v records. f records for f. v records for u, s, and d. x records for v. Table 44. Supported operations: nosve layer Supported operations Required of next lower level? Operation Supported Comments Used Comments ffopen Yes Yes ffread Yes Yes ffreada Yes Always synchronous No ffreadc Yes No ffwrite Yes Yes ffwritea Yes Always synchronous No ffwritec Yes No ffclose Yes Yes ffflush Yes No ffweof Passed through Yes for s records; passed through for others Yes Only if explicitly requested ffweod Yes Yes ffseek Yes ffseek(fd,0,0) only (equals rewind) Yes Extensively for nosve.v ffpos Yes NA ffbksp No No 14.2.17 The null Layer The null layer is a syntactic convenience for users; it has no effect. This layer is commonly used to simplify the writing of a shell script when a shell variable is used to specify a FFIO layer specification. For example, the following is a line 210 S–3695–36FFIO Layer Reference [14] from a shell script with a tape file using the assign command and overlying blocking is expected on the tape (as specified by BLKTYP): assign -F $BLKTYP,bmx fort.1 If BLKTYP is undefined, the illegal specification list ,bmx results. The existence of the null layer lets the programmer set BLKTYP to null as a default, and simplify the script, as in the following: assign -F null,bmx fort.1 This is identical to the following command: assign -F bmx fort.1 14.2.18 The sds Layer (Available Only on UNICOS Systems) The sds layer is not available on Cray T3E systems. The sds layer lets users declare that a file should reside on SDS. The specification for this layer follows: sds[.type:[subtype]]:[num1]:[num2]:[num3] The keyword specification is as follows: sds[.type[.subtype]][.start_size=num1][.max_size=num2] [.inc_size=num3] The type field specifies whether the file to reside in SDS is intended to be saved. This field can have the following values: Value Definition save Loads (reads) as much of the file as possible into SDS as soon as the file is opened (if it exists). If the data in SDS is changed, the SDS data is written back to the next lower layer at close time. The save option also modifies the behavior of overflow. save is the default. scr Does not attempt to load at open and discards data on close (scratch file). The scr option also modifies the behavior of overflow processing. S–3695–36 211Application Programmer’s I/O Guide The subtype field specifies the action to take when the data can no longer fit in the allowable SDS space. It can have the following values: Value Definition ovfl Excess data that does not fit in the specified medium is written to the next lower layer. This is the default. novfl When the SDS limit is reached, any further operations that try to increase the size of the file fails. The num1, num2, and num3 fields are nonnegative integer values that state the number of 4096-byte blocks to use in the following circumstances. Field Definition num1 When the file is opened, this number of blocks is allocated for the file. num2 This is the limit on the total size of the SDS space allowed for the file in this layer. Attempted growth beyond this limit causes either overflow or operation failure, depending on the overflow option specified. num3 This is the minimum number of blocks that are allocated each time more SDS space is required to accommodate file growth. The num1 and num3 fields are used for tuning purposes and usually do not fail if they are not used precisely as specified. For example, if the available SDS space is only 100 blocks, and the chosen increase (num3) value is 200 blocks, growth is allowed to use the 100 available blocks instead of failing to grow (because the full 200 blocks requested for the increment are unavailable). Similarly, the num3 value of 200 implies allocation in minimum size chunks of 200 blocks. If 200 blocks of contiguous space is unavailable, the allocation is satisfied with whatever space is available. The specification for sds is equivalent to the following specification: sds.save.ovfl:0:35184372088832:256 Overflow is provided when the requested data cannot completely reside in SDS. This can occur either because the SDS space requested from the system is not available or because the num2 (maximum size) argument was specified. When overflow occurs, a message prints to standard error stating the file name and the overflow size. The overflow I/O to the next lower layer depends on the type argument. If save is specified, the sds layer assumes that the part of 212 S–3695–36FFIO Layer Reference [14] the file that resides in SDS must eventually be written to the lower-level layer (usually disk). The overflowed data is written to the lower-level layer at a position in the file that corresponds to the position it will occupy after the SDS-resident data is flushed. Space is reserved at overflow time in the file to accommodate the SDS resident part of the file. If the scr option is selected, the SDS resident part of the file is considered disposable. Space for it is not reserved in the file on the lower-level layer. The overflow operations behave as though the first overflowed bit in the file is bit 0 of the lower-level layer, as in the following example: # requests a max of 1 512-word block of SDS assign -F sds.save.ovfl:0:1 fort.1 Assume that the file does not initially exist. The initial SDS size is 0 blocks, and the size is allowed to grow to a maximum of 1 block. If a single write of 513 words was done to this file, the first 512 words are written to SDS. The remaining word is written to file fort.1 at word position 512. Words 0 through 511 are not written until the sds layer is closed and the SDS data is flushed to the lower-level layer. Immediately after the write completes, SDS contains 512 words, and fort.1 consists of 513 words. Only the last word contains valid data until the file is closed. If the assign command is of the following form, it is assumed that the entire file is disposable if 513 words are written to the file: # requests a max of 1 512-word block of SDS assign -F sds.scr.ovfl:0:1 fort.1 It is not necessary to reserve space in fort.1 for the SDS data. When the 513 words are written to the file, the first 512 words are written to SDS. The 513th word is written to word 0 of fort.1. After the completion of the write, fort.1 consists of 1 word. The fort.1 file is deleted when the file is closed. SDS allocation is done through the sdsalloc(3F) library routine. The file space in SDS is allocated (as far as possible) in a contiguous manner, but if contiguous space is not found, any available fragments are used before overflow is forced on any file. When allocating new chunks of SDS space, the num3 argument is used as a minimum first try for allocation. S–3695–36 213Application Programmer’s I/O Guide Table 45. Data manipulation: sds layer Primary function Granularity Data model Truncate on write The sds layer lets the users obtain the fastest possible I/O rates through the SDS hot path. 1 bit Stream (mimics UNICOS system calls) No Table 46. Supported operations: sds layer Supported operations Required of next lower level? Operation Supported Comments Used Comments ffopen Yes Yes Sometimes delayed until overflow ffread Yes Yes Only on open ffreada Yes No ffreadc Yes No ffwrite Yes Yes Only on close, overflow ffwritea Yes No ffwritec Yes No ffclose Yes Yes ffflush Yes Flushes only internal buffer not SDS No ffweof No No No representation ffweod Yes No representation Yes ffseek Yes Full support (absolute, relative, and from end) Yes Used in open and close processing ffpos Yes NA ffbksp No No records No No records 14.2.19 The syscall Layer The syscall layer directly maps each request to an appropriate system call. The layer does not accept any options on UNICOS or UNICOS/mk systems. 214 S–3695–36FFIO Layer Reference [14] Table 47. Data manipulation: syscall layer Granularity Data model Truncate on write 8 bits (1 byte) Stream (UNICOS system calls) No Table 48. Supported operations: syscall layer Operation Supported Comments ffopen Yes open ffread Yes read ffreada Yes reada ffreadc Yes read plus code ffwrite Yes write ffwritea Yes writea ffwritec Yes write plus code ffclose Yes close ffflush Yes None ffweof No None ffweod Yes trunc(2) ffseek Yes lseek(2) ffpos Yes ffbksp No Lower-level layers are not allowed. 14.2.20 The system Layer The system layer is implicitly appended to all specification lists, if not explicitly added by the user (unless the syscall, tape, er90, or fd layer is specified). It maps requests to appropriate system calls. If the file that is opened is a tape file, the system layer becomes the tape layer. For a description of options, see the syscall layer. Lower-level layers are not allowed. S–3695–36 215Application Programmer’s I/O Guide 14.2.21 The text Layer The text layer performs text blocking by terminating each record with a newline character. It can also recognize and represent the EOF mark. The text layer is used with character files and does not work with binary data. The general specification follows: text[.type]:[num1]:[num2] The keyword specification follows: text[.type][.newline=num1][.bufsize=num2] The type field can have one of the following three values: Value Definition nl Newline-separated records. eof Newline-separated records with a special string such as ~e. More than one EOF in a file is allowed. c205 CYBER 205–style text file (on the CYBER 205, these are called R-type records). The num1 field is the decimal value of a single character that represents the newline character. The default value is 10 (octal 012, ASCII line feed). The num2 field specifies the working buffer size (in decimal bytes). If any lower-level layers are record oriented, this is also the block size. Table 49. Data manipulation: text layer Granularity Data model Truncate on write 8 bits Record. No 216 S–3695–36FFIO Layer Reference [14] Table 50. Supported operations: text layer Supported operations Required of next lower level? Operation Supported Comments Used Comments ffopen Yes Yes ffread Yes Yes ffreada Yes Always synchronous No ffreadc Yes No ffwrite Yes Yes ffwritea Yes Always synchronous No ffwritec Yes No ffclose Yes Yes ffflush Yes No ffweof Passed through Yes Only if explicitly requested ffweod Yes Yes ffseek Yes Yes ffpos Yes No ffbksp No No 14.2.22 The user and site Layers The user and site layers let users and site administrators build layers that meet specific needs. The syntax follows: user[num1]:[num2] site:[num1]:[num2] The open processing passes the num1 and num2 arguments to the layer and are interpreted by the layers. S–3695–36 217Application Programmer’s I/O Guide See Chapter 15, page 221 for an example of how to create an FFIO layer. 14.2.23 The vms Layer The vms layer handles record blocking for three common record types on VAX/VMS operating systems. The general format of the specification follows. vms.[type.subtype]:[num1]:[num2] The following is the alternate keyword specification for this layer: vms.[type.subtype][.recsize=num1][.mbs=num2] The following type values are supported: Value Definition f VAX/VMS fixed-length records v VAX/VMS variable-length records s VAX/VMS variable-length segmented records In addition to the record type, you must specify a record subtype, which has one of the following four values: Value Definition bb Format used for binary blocked transfers disk Same as binary blocked tr Transparent format, for files transferred as a bit stream to and from the VAX/VMS system tape VAX/VMS labeled tape The num1 field is the maximum record size that may be read or written. It is ignored by the s record type. 218 S–3695–36FFIO Layer Reference [14] Table 51. Values for record size: vms layer Field Minimum Maximum Default Comments v.bb 1 32,767 32,767 v.tape 1 9995 2043 v.tr 1 32,767 2044 s.bb 1 None None No maximum record size s.tape 1 None None No maximum record size s.tr 1 None None No maximum record size The num2 field is the maximum segment or block size that is allowed on input and is produced on output. For vms.f.tr and vms.f.bb, num2 should be equal to the record size (num1). Because vms.f.tape places one or more records in each block, vms.f.tape num2 must be greater than or equal to num1. Table 52. Values for maximum block size: vms layer Field Minimum Maximum Default Comments v.bb 1 32,767 32,767 v.tape 6 32,767 2,048 v.tr 3 32,767 32,767 N/A s.bb 5 32,767 2,046 s.tape 7 32,767 2,048 s.tr 5 32,767 2,046 N/A For vms.v.bb and vms.v.disk records, num2 is a limit on the maximum record size. For vms.v.tape records, it is the maximum size of a block on tape; more specifically, it is the maximum size of a record that will be written to the next lower layer. If that layer is tape, num2 is the tape block size. If it is cos, it will be a COS record that represents a tape block. One or more records are placed in each block. For segmented records, num2 is a limit on the block size that will be produced. No limit on record size exists. For vms.s.tr and vms.s.bb, the block size is an upper limit on the size of a segment. For vms.s.tape, one or more segments are placed in a tape block. It functions as an upper limit on the size of a segment and a preferred tape block size. S–3695–36 219Application Programmer’s I/O Guide Table 53. Data manipulation: vms layer Granularity Data model Truncate on write Implementation strategy 8 bits Record No for f records. Yes for v and s records. f records for f formats. v records for v formats. Table 54. Supported operations: vms layer Supported operations Required of next lower level? Operation Supported Comments Used Comments ffopen Yes Yes ffread Yes Yes ffreada Yes Always synchronous No ffreadc Yes No ffwrite Yes Yes ffwritea Yes Always synchronous No ffwritec Yes No ffclose Yes Yes ffflush Yes No ffweof Yes and passed through Yes for s records; passed through for others Yes Only if explicitly requested ffweod Yes Yes ffseek Yes seek(fd,0,0) only (equals rewind) Yes seek(fd,0,0) only ffpos Yes NA ffbksp No No 220 S–3695–36Creating a user Layer [15] This chapter explains some of the internals of the FFIO system and explains the ways in which you can put together a user or site layer. Section 15.2, page 224, is an example of a user layer. 15.1 Internal Functions The FFIO system has an internal model of data that maps to any given actual logical file type based on the following concepts: • Data is a stream of bits. Layers must declare their granularity by using the fffcntl(3C) call. • Record marks are boundaries between logical records. • End-of-file marks (EOF) are a special type of record that exists in some file structures. • End-of-data (EOD) is a point immediately beyond the last data bit, EOR, or EOF in the file. You cannot read past or write after an EOD. In a case when a file is positioned after an EOD, a write operation (if valid) immediately moves the EOD to a point after the last data bit, end-of-record (EOR), or EOF produced by the write. All files are streams that contain zero or more data bits that may contain record or file marks. No inherent hierarchy or ordering is imposed on the file structures. Any number of data bits or EOR and EOF marks may appear in any order. The EOD, if present, is by definition last. Given the EOR, EOF, and EOD return statuses from read operations, only EOR may be returned along with data. When data bits are immediately followed by EOF, the record is terminated implicitly. Individual layers can impose restrictions for specific file structures that are more restrictive than the preceding rules. For instance, in COS blocked files, an EOR always immediately precedes an EOF. Successful mappings were used for all logical file types supported, except formats that have more than one type of partitioning for files (such as end-of-group or more than one level of EOF). For example, some CDC file formats have level numbers in the partitions. FFIO and CDC map level 017 to an EOF. No other handling is provided for these level numbers. S–3695–36 221Application Programmer’s I/O Guide Internally, there are two main protocol components: the operations and the stat structure. 15.1.1 The Operations Structure Many of the operations try to mimic the UNICOS system calls. In the man pages for ffread(3C), ffwrite(3C), and others, the calls can be made without the optional parameters and appear like the system calls. Internally, all parameters are required. The following list is a brief synopsis of the interface routines that are supported at the user level. Each of these ff entry points checks the parameters and issues the corresponding internal call. Each interface routine provides defaults and dummy arguments for those optional arguments that the user does not provide. Each layer must have an internal entry point for all of these operations; although in some cases, the entry point may simply issue an error or do nothing. For example, the syscall layer uses _ff_noop for the ffflush entry point because it has no buffer to flush, and it uses _ff_err2 for the ffweof entry point because it has no representation for EOF. No optional parameters for calls to the internal entry points exist. All arguments are required. A list of operations called as functions from a C program follows: The following are the variables for the internal entry points and the variable definitions. An internal entry point must be provided for all of these operations: Variable Definition fd The FFIO pointer (struct fdinfo *)fd. file A char* file. flags File status flag for open, such as O_RDONLY. buf Bit pointer to the user data. nb Number of bytes. ret The status returned; >=0 is valid, <0 is error. stat A pointer to the status structure. fulp The value FULL or PARTIAL defined in ffio.h for full or partial-record mode. &ubc A pointer to the unused bit count; this ranges from 0 to 7 and represents the bits not used in the last byte of the operation. It is used for both input and output. pos A byte position in the file. 222 S–3695–36Creating a user Layer [15] opos The old position of the file, just like the system call. whence The same as the syscall. cmd The command request to the fffcntl(3C) call. arg A generic pointer to the fffcntl argument. mode Bit pattern denoting file’s access permissions. argp A pointer to the input or output data. len The length of the space available at argp. It is used primarily on output to avoid overwriting the available memory. 15.1.2 FFIO and the Stat Structure The stat structure contains four fields in the current implementation. They mimic the iosw structure of the UNICOS ASYNC syscalls to the extent possible. All operations are expected to update the stat structure on each call. The SETSTAT and ERETURN macros are provided in ffio.h for this purpose. The fields in the stat structure are as follows: Status field Description stat.sw_flag 0 indicates outstanding; 1 indicates I/O complete. stat.sw_error 0 indicates no error; otherwise, the error number. stat.sw_count Number of bytes transferred in this request. This number is rounded up to the next integral value if a partial byte is transferred. stat.sw_stat This tells the status of the I/O operation. The FFSTAT(stat) macro accesses this field. The following are the possible values: FFBOD: At beginning-of-data (BOD). FFCNT: Request terminated by count (either the count of bytes before EOF or EOD in the file or the count of the request). FFEOR: Request termination by EOR or a full record mode read was processed. FFEOF: EOF encountered. FFEOD: EOD encountered. FFERR: Error encountered. S–3695–36 223Application Programmer’s I/O Guide If count is satisfied simultaneously with EOR, the FFEOR is returned. The EOF and EOD status values must never be returned with data. This means that if a byte-stream file is being traversed and the file contains 100 bytes and then an EOD, a read of 500 bytes will return with a stat value of FFCNT and a return byte count of 100. The next read operation returns FFEOD and a count of 0. A FFEOF or FFEOD status is always returned with a zero-byte transfer count. 15.2 user Layer Example This section gives a complete and working user layer. It traces I/O at a given level. All operations are passed through to the next lower-level layer, and a trace record is sent to the trace file. The first step in generating a user layer is to create a table that contains the addresses for the routines which fulfill the required functions described in Section 15.1.1, page 222, and Section 15.1.2, page 223. The format of the table can be found in struct xtr_s, which is found in the file. No restriction is placed on the names of the routines, but the table must be called _usr_ffvect for it to be recognized as a user layer. In the example, the declaration of the table can be found with the code in the _usr_open routine. To use this layer, you must take advantage of the soft external files in the library. The following script fragment is suggested for UNICOS systems: # -D_LIB_INTERNAL is required to obtain the # declaration of struct fdinfo in # cc -c -D_LIB_INTERNAL -hcalchars usr*.c cat usr*.o > user.o # # Note that the -F option is selected that loads # and links the entries despite not having any # hard references. segldr -o abs -F user.o myprog.o assign -F user,others... fort.1 ./abs For Cray T3E systems, replace the segldr command with the following: 224 S–3695–36Creating a user Layer [15] f90 main.f user.o -Wl"-D select(user)=yes" static char USMID[] = "@(#)code/usrbksp.c 1.0 "; /* COPYRIGHT CRAY RESEARCH, INC. * UNPUBLISHED -- ALL RIGHTS RESERVED UNDER * THE COPYRIGHT LAWS OF THE UNITED STATES. */ #include #include "usrio.h" /* * trace backspace requests */ int _usr_bksp(struct fdinfo *fio, struct ffsw *stat) { struct fdinfo *llfio; int ret; llfio = fio->fioptr; _usr_enter(fio, TRC_BKSP); _usr_pr_2p(fio, stat); ret = XRCALL(llfio, backrtn) llfio, stat); _usr_exit(fio, ret, stat); return(0); } static char USMID[] = "@(#)code.usrclose.c 1.0 "; /* COPYRIGHT CRAY RESEARCH, INC. * UNPUBLISHED -- ALL RIGHTS RESERVED UNDER * THE COPYRIGHT LAWS OF THE UNITED STATES. */ #include #include #include #include "usrio.h" /* * trace close requests */ int _usr_close(struct fdinfo *fio, struct ffsw *stat) { struct fdinfo *llfio; struct trace_f *pinfo; int ret; S–3695–36 225Application Programmer’s I/O Guide llfio = fio->fioptr; /* * lyr_info is a place in the fdinfo block that holds * a pointer to the layer’s private information. */ pinfo = (struct trace_f *)fio->lyr_info; _usr_enter(fio, TRC_CLOSE); _usr_pr_2p(fio, stat); /* * close file */ ret = XRCALL(llfio, closertn) llfio, stat); /* * It is the layer’s responsibility to clean up its mess. */ free(pinfo->name); pinfo->name = NULL; free(pinfo); _usr_exit(fio, ret, stat); (void) close(pinfo->usrfd); return(0); } static char USMID[] = "@(#)code/usrfcntl.c 1.0 "; /* COPYRIGHT CRAY RESEARCH, INC. * UNPUBLISHED -- ALL RIGHTS RESERVED UNDER * THE COPYRIGHT LAWS OF THE UNITED STATES. */ #include #include "usrio.h" /* * trace fcntl requests * * Parameters: * fd - fdinfo pointer * cmd - command code * arg - command specific parameter * stat - pointer to status return word * * This fcntl routine passes the request down to the next lower * layer, so it provides nothing of its own. * * When writing a user layer, the fcntl routine must be provided, 226 S–3695–36Creating a user Layer [15] * and must provide correct responses to one essential function and * two desirable functions. * * FC_GETINFO: (essential) * If the ’cmd’ argument is FC_GETINFO, the fields of the ’arg’ is * considered a pointer to an ffc_info_s structure, and the fields * must be filled. The most important of these is the ffc_flags * field, whose bits are defined in .(Look for FFC_STRM * through FFC_NOTRN) * FC_STAT: (desirable) * FC_RECALL: (desirable) */ int _usr_fcntl(struct fdinfo *fio, int cmd, void *arg, struct ffsw *stat) { struct fdinfo *llfio; struct trace_f *pinfo; int ret; llfio = fio->fioptr; pinfo = (struct trace_f *)fio->lyr_info; _usr_enter(fio, TRC_FCNTL); _usr_info(fio, "cmd=%d ", cmd); ret = XRCALL(llfio, fcntlrtn) llfio, cmd, arg, stat); _usr_exit(fio, ret, stat); return(ret); } static char USMID[] = "@(#)code/usropen.c 1.0 "; /* COPYRIGHT CRAY RESEARCH, INC. * UNPUBLISHED -- ALL RIGHTS RESERVED UNDER * THE COPYRIGHT LAWS OF THE UNITED STATES. */ #include #include #include #include #include "usrio.h" #define SUFFIX ".trc" /* * trace open requests; S–3695–36 227Application Programmer’s I/O Guide * The following routines compose the user layer. They are declared * in "usrio.h" */ /* * Create the _usr_ffvect structure. Note the _ff_err inclusion to * account for the listiortn, which is not supported by this user * layer */ struct xtr_s _usr_ffvect = { _usr_open, _usr_read, _usr_reada, _usr_readc, _usr_write, _usr_writea, _usr_writec, _usr_close, _usr_flush, _usr_weof, _usr_weod, _usr_seek, _usr_bksp, _usr_pos, _usr_err, _usr_fcntl }; _ffopen_t _usr_open( const char *name, int flags, mode_t mode, struct fdinfo * fio, union spec_u *spec, struct ffsw *stat, long cbits, int cblks, struct gl_o_inf *oinf) { union spec_u *nspec; struct fdinfo *llfio; struct trace_f *pinfo; char *ptr = NULL; int namlen, usrfd; _ffopen_t nextfio; char buf[256]; namlen = strlen(name); ptr = malloc(namlen + strlen(SUFFIX) + 1); if (ptr == NULL) goto badopen; pinfo = (struct trace_f *)malloc(sizeof(struct trace_f)); if (pinfo == NULL) goto badopen; 228 S–3695–36Creating a user Layer [15] fio->lyr_info = (char *)pinfo; /* * Now, build the name of the trace info file, and open it. */ strcpy(ptr, name); strcat(ptr, SUFFIX); usrfd = open(ptr, O_WRONLY | O_APPEND | O_CREAT, 0666); /* * Put the file info into the private data area. */ pinfo->name = ptr; pinfo->usrfd = usrfd; ptr[namlen] = ’\0’; /* * Log the open call */ _usr_enter(fio, TRC_OPEN); sprintf(buf,"(\"%s\", %o, %o...);\n", name, flags, mode); _usr_info(fio, buf, 0); /* * Now, open the lower layers */ nspec = spec; NEXT_SPEC(nspec); nextfio = _ffopen(name, flags, mode, nspec, stat, cbits, cblks, NULL, oinf); _usr_exit_ff(fio, nextfio, stat); if (nextfio != _FFOPEN_ERR) { DUMP_IOB(fio); /* debugging only */ return(nextfio); } /* * End up here only on an error * */ badopen: if(ptr != NULL) free(ptr); if (fio->lyr_info != NULL) free(fio->lyr_info); _SETERROR(stat, FDC_ERR_NOMEM, 0); return(_FFOPEN_ERR); } S–3695–36 229Application Programmer’s I/O Guide _usr_err(struct fdinfo *fio) { _usr_info(fio,"ERROR: not expecting this routine\n",0); return(0); } static char USMID[] = "@(#)code/usrpos.c 1.1 "; /* COPYRIGHT CRAY RESEARCH, INC. * UNPUBLISHED -- ALL RIGHTS RESERVED UNDER * THE COPYRIGHT LAWS OF THE UNITED STATES. */ #include #include "usrio.h" /* * trace positioning requests */ _ffseek_t _usr_pos(struct fdinfo *fio, int cmd, void *arg, int len, struct ffsw *stat) { struct fdinfo *llfio; struct trace_f *usr_info; _ffseek_t ret; llfio = fio->fioptr; usr_info = (struct trace_f *)fio->lyr_info; _usr_enter(fio,TRC_POS); _usr_info(fio, " ", 0); ret = XRCALL(llfio, posrtn) llfio, cmd, arg, len, stat); _usr_exit_sk(fio, ret, stat); return(ret); } static char USMID[] = "@(#)code/usrprint.c 1.1 "; /* COPYRIGHT CRAY RESEARCH, INC. * UNPUBLISHED -- ALL RIGHTS RESERVED UNDER * THE COPYRIGHT LAWS OF THE UNITED STATES. */ 230 S–3695–36Creating a user Layer [15] #include #include #include "usrio.h" static char *name_tab[] = { "???", "ffopen", "ffread", "ffreada", "ffreadc", "ffwrite", "ffwritea", "ffwritec", "ffclose", "ffflush", "ffweof", "ffweod", "ffseek", "ffbksp", "ffpos", "fflistio", "fffcntl", }; /* * trace printing stuff */ int _usr_enter(struct fdinfo *fio, int opcd) { char buf[256], *op; struct trace_f *usr_info; op = name_tab[opcd]; usr_info = (struct trace_f *)fio->lyr_info; sprintf(buf, "TRCE: %s ",op); write(usr_info->usrfd, buf, strlen(buf)); return(0); } void _usr_info(struct fdinfo *fio, char *str, int arg1) S–3695–36 231Application Programmer’s I/O Guide { char buf[256]; struct trace_f *usr_info; usr_info = (struct trace_f *)fio->lyr_info; sprintf(buf, str, arg1); write(usr_info->usrfd, buf, strlen(buf)); } void _usr_exit(struct fdinfo *fio, int ret, struct ffsw *stat) { char buf[256]; struct trace_f *usr_info; usr_info = (struct trace_f *)fio->lyr_info; fio->ateof = fio->fioptr->ateof; fio->ateod = fio->fioptr->ateod; sprintf(buf, "TRCX: ret=%d, stat=%d, err=%d\n", ret, stat->sw_stat, stat->sw_error); write(usr_info->usrfd, buf, strlen(buf)); } void _usr_exit_ss(struct fdinfo *fio, ssize_t ret, struct ffsw *stat) { char buf[256]; struct trace_f *usr_info; usr_info = (struct trace_f *)fio->lyr_info; fio->ateof = fio->fioptr->ateof; fio->ateod = fio->fioptr->ateod; #ifdef __mips #if (_MIPS_SZLONG== 32) sprintf(buf, "TRCX: ret=%lld, stat=%d, err=%d\n", ret, stat->sw_stat, stat->sw_error); #else sprintf(buf, "TRCX: ret=%ld, stat=%d, err=%d\n", ret, stat->sw_stat, stat->sw_error); #endif #else sprintf(buf, "TRCX: ret=%d, stat=%d, err=%d\n", ret, stat->sw_stat, stat->sw_error); 232 S–3695–36Creating a user Layer [15] #endif write(usr_info->usrfd, buf, strlen(buf)); } void _usr_exit_ff(struct fdinfo *fio, _ffopen_t ret, struct ffsw *stat) { char buf[256]; struct trace_f *usr_info; usr_info = (struct trace_f *)fio->lyr_info; #ifdef __mips sprintf(buf, "TRCX: ret=%lx, stat=%d, err=%d\n", ret, stat->sw_stat, stat->sw_error); #else sprintf(buf, "TRCX: ret=%d, stat=%d, err=%d\n", ret, stat->sw_stat, stat->sw_error); #endif write(usr_info->usrfd, buf, strlen(buf)); } void _usr_exit_sk(struct fdinfo *fio, _ffseek_t ret, struct ffsw *stat) { char buf[256]; struct trace_f *usr_info; usr_info = (struct trace_f *)fio->lyr_info; fio->ateof = fio->fioptr->ateof; fio->ateod = fio->fioptr->ateod; #ifdef __mips #if (_MIPS_SZLONG== 32) sprintf(buf, "TRCX: ret=%lld, stat=%d, err=%d\n", ret, stat->sw_stat, stat->sw_error); #else sprintf(buf, "TRCX: ret=%ld, stat=%d, err=%d\n", ret, stat->sw_stat, stat->sw_error); #endif #else sprintf(buf, "TRCX: ret=%d, stat=%d, err=%d\n", ret, stat->sw_stat, stat->sw_error); #endif write(usr_info->usrfd, buf, strlen(buf)); } void S–3695–36 233Application Programmer’s I/O Guide _usr_pr_rwc( struct fdinfo *fio, bitptr bufptr, size_t nbytes, struct ffsw *stat, int fulp) { char buf[256]; struct trace_f *usr_info; usr_info = (struct trace_f *)fio->lyr_info; #ifdef __mips #if (_MIPS_SZLONG == 64) && (_MIPS_SZPTR == 64) sprintf(buf,"(fd / %lx */, &memc[%lx], %ld, &statw[%lx], ", fio, BPTR2CP(bufptr), nbytes, stat); #else if (_MIPS_SZLONG == 32) && (_MIPS_SZPTR == 32) sprintf(buf,"(fd / %lx */, &memc[%lx], %lld, &statw[%lx], ", fio, BPTR2CP(bufptr), nbytes, stat); #endif #else sprintf(buf,"(fd / %x */, &memc[%x], %d, &statw[%x], ", fio, BPTR2CP(bufptr), nbytes, stat); #endif write(usr_info->usrfd, buf, strlen(buf)); if (fulp == FULL) sprintf(buf,"FULL"); else sprintf(buf,"PARTIAL"); write(usr_info->usrfd, buf, strlen(buf)); } void _usr_pr_rww( struct fdinfo *fio, bitptr bufptr, size_t nbytes, struct ffsw *stat, int fulp, int *ubc) { char buf[256]; struct trace_f *usr_info; usr_info = (struct trace_f *)fio->lyr_info; 234 S–3695–36Creating a user Layer [15] #ifdef __mips #if (_MIPS_SZLONG == 64) && (_MIPS_SZPTR == 64) sprintf(buf,"(fd / %lx */, &memc[%lx], %ld, &statw[%lx], ", fio, BPTR2CP(bufptr), nbytes, stat); #else if (_MIPS_SZLONG == 32) && (_MIPS_SZPTR == 32) sprintf(buf,"(fd / %lx */, &memc[%lx], %lld, &statw[%lx], ", fio, BPTR2CP(bufptr), nbytes, stat); #endif #else sprintf(buf,"(fd / %x */, &memc[%x], %d, &statw[%x], ", fio, BPTR2CP(bufptr), nbytes, stat); #endif write(usr_info->usrfd, buf, strlen(buf)); if (fulp == FULL) sprintf(buf,"FULL"); else sprintf(buf,"PARTIAL"); write(usr_info->usrfd, buf, strlen(buf)); sprintf(buf,", &conubc[%d]; ", *ubc); write(usr_info->usrfd, buf, strlen(buf)); } void _usr_pr_2p(struct fdinfo *fio, struct ffsw *stat) { char buf[256]; struct trace_f *usr_info; usr_info = (struct trace_f *)fio->lyr_info; #ifdef __mips #if (_MIPS_SZLONG == 64) && (_MIPS_SZPTR == 64) sprintf(buf,"(fd / %lx */, &statw[%lx], ", fio, stat); #else if (_MIPS_SZLONG == 32) && (_MIPS_SZPTR == 32) sprintf(buf,"(fd / %lx */, &statw[%lx], ", fio, stat); #endif #else sprintf(buf,"(fd / %x */, &statw[%x], ", fio, stat); #endif write(usr_info->usrfd, buf, strlen(buf)); } S–3695–36 235Application Programmer’s I/O Guide static char USMID[] = "@(#)code/usrread.c 1.0 "; /* COPYRIGHT CRAY RESEARCH, INC. * UNPUBLISHED -- ALL RIGHTS RESERVED UNDER * THE COPYRIGHT LAWS OF THE UNITED STATES. */ #include #include "usrio.h" /* * trace read requests * * Parameters: * fio - Pointer to fdinfo block * bufptr - bit pointer to where data is to go. * nbytes - Number of bytes to be read * stat - pointer to status return word * fulp - full or partial read mode flag * ubc - pointer to unused bit count */ ssize_t _usr_read( struct fdinfo *fio, bitptr bufptr, size_t nbytes, struct ffsw *stat, int fulp, int *ubc) { struct fdinfo *llfio; char *str; ssize_t ret; llfio = fio->fioptr; _usr_enter(fio, TRC_READ); _usr_pr_rww(fio, bufptr, nbytes, stat, fulp, ubc); ret = XRCALL(llfio, readrtn) llfio, bufptr, nbytes, stat, fulp, ubc); _usr_exit_ss(fio, ret, stat); return(ret); } /* * trace reada (asynchronous read) requests 236 S–3695–36Creating a user Layer [15] * * Parameters: * fio - Pointer to fdinfo block * bufptr - bit pointer to where data is to go. * nbytes - Number of bytes to be read * stat - pointer to status return word * fulp - full or partial read mode flag * ubc - pointer to unused bit count */ ssize_t _usr_reada( struct fdinfo *fio, bitptr bufptr, size_t nbytes, struct ffsw *stat, int fulp, int *ubc) { struct fdinfo *llfio; char *str; ssize_t ret; llfio = fio->fioptr; _usr_enter(fio, TRC_READA); _usr_pr_rww(fio, bufptr, nbytes, stat, fulp, ubc); ret = XRCALL(llfio,readartn)llfio,bufptr,nbytes,stat,fulp,ubc); _usr_exit_ss(fio, ret, stat); return(ret); } /* * trace readc requests * * Parameters: * fio - Pointer to fdinfo block * bufptr - bit pointer to where data is to go. * nbytes - Number of bytes to be read * stat - pointer to status return word * fulp - full or partial read mode flag */ ssize_t _usr_readc( struct fdinfo *fio, S–3695–36 237Application Programmer’s I/O Guide bitptr bufptr, size_t nbytes, struct ffsw *stat, int fulp) { struct fdinfo *llfio; char *str; ssize_t ret; llfio = fio->fioptr; _usr_enter(fio, TRC_READC); _usr_pr_rwc(fio, bufptr, nbytes, stat, fulp); ret = XRCALL(llfio, readcrtn)llfio, bufptr, nbytes, stat, fulp); _usr_exit_ss(fio, ret, stat); return(ret); } /* * _usr_seek() * * The user seek call should mimic the UNICOS lseek system call as * much as possible. */ _ffseek_t _usr_seek( struct fdinfo *fio, off_t pos, int whence, struct ffsw *stat) { struct fdinfo *llfio; _ffseek_t ret; char buf[256]; llfio = fio->fioptr; _usr_enter(fio, TRC_SEEK); #ifdef __mips #if (_MIPS_SZLONG == 64) sprintf(buf,"pos %ld, whence %d\n", pos, whence); #else sprintf(buf,"pos %lld, whence %d\n", pos, whence); #endif #else 238 S–3695–36Creating a user Layer [15] sprintf(buf,"pos %d, whence %d\n", pos, whence); #endif _usr_info(fio, buf, 0); ret = XRCALL(llfio, seekrtn) llfio, pos, whence, stat); _usr_exit_sk(fio, ret, stat); return(ret); } static char USMID[] = "@(#)code/usrwrite.c 1.0 "; /* COPYRIGHT CRAY RESEARCH, INC. * UNPUBLISHED -- ALL RIGHTS RESERVED UNDER * THE COPYRIGHT LAWS OF THE UNITED STATES. */ #include #include "usrio.h" /* * trace write requests * * Parameters: * fio - Pointer to fdinfo block * bufptr - bit pointer to where data is to go. * nbytes - Number of bytes to be written * stat - pointer to status return word * fulp - full or partial write mode flag * ubc - pointer to unused bit count (not used for IBM) */ ssize_t _usr_write( struct fdinfo *fio, bitptr bufptr, size_t nbytes, struct ffsw *stat, int fulp, int *ubc) { struct fdinfo *llfio; ssize_t ret; llfio = fio->fioptr; _usr_enter(fio, TRC_WRITE); _usr_pr_rww(fio, bufptr, nbytes, stat, fulp, ubc); S–3695–36 239Application Programmer’s I/O Guide ret = XRCALL(llfio, writertn) llfio, bufptr, nbytes, stat, fulp,ubc); _usr_exit_ss(fio, ret, stat); return(ret); } /* * trace writea requests * * Parameters: * fio - Pointer to fdinfo block * bufptr - bit pointer to where data is to go. * nbytes - Number of bytes to be written * stat - pointer to status return word * fulp - full or partial write mode flag * ubc - pointer to unused bit count (not used for IBM) */ ssize_t _usr_writea( struct fdinfo *fio, bitptr bufptr, size_t nbytes, struct ffsw *stat, int fulp, int *ubc) { struct fdinfo *llfio; ssize_t ret; llfio = fio->fioptr; _usr_enter(fio, TRC_WRITEA); _usr_pr_rww(fio, bufptr, nbytes, stat, fulp, ubc); ret = XRCALL(llfio, writeartn) llfio, bufptr, nbytes, stat, fulp,ubc); _usr_exit_ss(fio, ret, stat); return(ret); } /* * trace writec requests * * Parameters: * fio - Pointer to fdinfo block 240 S–3695–36Creating a user Layer [15] * bufptr - bit pointer to where data is to go. * nbytes - Number of bytes to be written * stat - pointer to status return word * fulp - full or partial write mode flag */ ssize_t _usr_writec( struct fdinfo *fio, bitptr bufptr, size_t nbytes, struct ffsw *stat, int fulp) { struct fdinfo *llfio; ssize_t ret; llfio = fio->fioptr; _usr_enter(fio, TRC_WRITEC); _usr_pr_rwc(fio, bufptr, nbytes, stat, fulp); ret = XRCALL(llfio, writecrtn)llfio,bufptr, nbytes, stat, fulp); _usr_exit_ss(fio, ret, stat); return(ret); } /* * Flush the buffer and clean up * This routine should return 0, or -1 on error. */ int _usr_flush(struct fdinfo *fio, struct ffsw *stat) { struct fdinfo *llfio; int ret; llfio = fio->fioptr; _usr_enter(fio, TRC_FLUSH); _usr_info(fio, "\n",0); ret = XRCALL(llfio, flushrtn) llfio, stat); _usr_exit(fio, ret, stat); return(ret); } S–3695–36 241Application Programmer’s I/O Guide /* * trace WEOF calls * * The EOF is a very specific concept. Don’t confuse it with the * UNICOS EOF, or the trunc(2) system call. */ int _usr_weof(struct fdinfo *fio, struct ffsw *stat) { struct fdinfo *llfio; int ret; llfio = fio->fioptr; _usr_enter(fio, TRC_WEOF); _usr_info(fio, "\n",0); ret = XRCALL(llfio, weofrtn) llfio, stat); _usr_exit(fio, ret, stat); return(ret); } /* * trace WEOD calls * * The EOD is a specific concept. Don’t confuse it with the UNICOS * EOF. It is usually mapped to the trunc(2) system call. */ int _usr_weod(struct fdinfo *fio, struct ffsw *stat) { struct fdinfo *llfio; int ret; llfio = fio->fioptr; _usr_enter(fio, TRC_WEOD); _usr_info(fio, "\n",0); ret = XRCALL(llfio, weodrtn) llfio, stat); _usr_exit(fio, ret, stat); return(ret); } /* USMID @(#)code/usrio.h 1.1 */ 242 S–3695–36Creating a user Layer [15] /* COPYRIGHT CRAY RESEARCH, INC. * UNPUBLISHED -- ALL RIGHTS RESERVED UNDER * THE COPYRIGHT LAWS OF THE UNITED STATES. */ #define TRC_OPEN 1 #define TRC_READ 2 #define TRC_READA 3 #define TRC_READC 4 #define TRC_WRITE 5 #define TRC_WRITEA 6 #define TRC_WRITEC 7 #define TRC_CLOSE 8 #define TRC_FLUSH 9 #define TRC_WEOF 10 #define TRC_WEOD 11 #define TRC_SEEK 12 #define TRC_BKSP 13 #define TRC_POS 14 #define TRC_UNUSED 15 #define TRC_FCNTL 16 struct trace_f { char *name; /* name of the file */ int usrfd; /* file descriptor of trace file */ }; /* * Prototypes */ extern int _usr_bksp(struct fdinfo *fio, struct ffsw *stat); extern int _usr_close(struct fdinfo *fio, struct ffsw *stat); extern int _usr_fcntl(struct fdinfo *fio, int cmd, void *arg, struct ffsw *stat); extern _ffopen_t _usr_open(const char *name, int flags, mode_t mode, struct fdinfo * fio, union spec_u *spec, struct ffsw *stat, long cbits, int cblks, struct gl_o_inf *oinf); extern int _usr_flush(struct fdinfo *fio, struct ffsw *stat); extern _ffseek_t _usr_pos(struct fdinfo *fio, int cmd, void *arg, int len, struct ffsw *stat); extern ssize_t _usr_read(struct fdinfo *fio, bitptr bufptr, size_t nbytes, struct ffsw *stat, int fulp, int *ubc); S–3695–36 243Application Programmer’s I/O Guide extern ssize_t _usr_reada(struct fdinfo *fio, bitptr bufptr, size_t nbytes, struct ffsw *stat, int fulp, int *ubc); extern ssize_t _usr_readc(struct fdinfo *fio, bitptr bufptr, size_t nbytes, struct ffsw *stat, int fulp); extern _ffseek_t _usr_seek(struct fdinfo *fio, off_t pos, int whence, struct ffsw *stat); extern ssize_t _usr_write(struct fdinfo *fio, bitptr bufptr, size_t nbytes, struct ffsw *stat, int fulp, int *ubc); extern ssize_t _usr_writea(struct fdinfo *fio, bitptr bufptr, size_t nbytes, struct ffsw *stat, int fulp, int *ubc); extern ssize_t _usr_writec(struct fdinfo *fio, bitptr bufptr, size_t nbytes, struct ffsw *stat, int fulp); extern int _usr_weod(struct fdinfo *fio, struct ffsw *stat); extern int _usr_weof(struct fdinfo *fio, struct ffsw *stat); extern int _usr_err(); /* * Prototypes for routines that are used by the user layer. */ extern int _usr_enter(struct fdinfo *fio, int opcd); extern void _usr_info(struct fdinfo *fio, char *str, int arg1); extern void _usr_exit(struct fdinfo *fio, int ret, struct ffsw *stat); extern void _usr_exit_ss(struct fdinfo *fio, ssize_t ret, struct ffsw *stat); extern void _usr_exit_ff(struct fdinfo *fio, _ffopen_t ret, struct ffsw *stat); extern void _usr_exit_sk(struct fdinfo *fio, _ffseek_t ret, struct ffsw *stat); extern void _usr_pr_rww(struct fdinfo *fio, bitptr bufptr, size_t nbytes, struct ffsw *stat, int fulp, int *ubc); extern void _usr_pr_2p(struct fdinfo *fio, struct ffsw *stat); 244 S–3695–36Older Data Conversion Routines [A] The UNICOS library contains newer conversion routines for the following foreign types: Type Routines IBM IBM2CRAY(3F), CRAY2IBM CDC CDC2CRAY(3F), CRAY2CDC VAX/VMS VAX2CRAY(3F), CRAY2VAX NOS/VE NVE2CRAY(3F), CRAY2NVE ETA/CYBER 205 ETA2CRAY(3F), CRAY2ETA IEEE IEG2CRAY(3F), CRAY2IEG The charts in this appendix list the older foreign data conversion routines that Cray supports for compatibility. The following abbreviations are used: int. (integer), f.p. (floating-point number), s.p. (single-precision number), and d.p. (double-precision number). Brackets in the synopsis indicate an optional parameter; it may be omitted. See the Application Programmer’s Library Reference Manual, for a complete description of each routine. A.1 Old IBM Data Conversion Routines The following lists IBM data conversion for integer, single-precision, double-precision, logical, and character data: Convert IBM to/from CRI Synopsis INTEGER*1 CALL USICTC (src, isb, dest, num, len [, inc ]) INTEGER*4 / 64-bit int. CALL USICTI (src, dest, isb, num, len [, inc ]) Pack decimal / 64-bit int. CALL USICTP (ian, dest, isb, num) CALL USPCTC (src, isb, num, ian) 32-bit f.p. / 64-bit s.p. CALL USSCTC (dpn, isb, dest, num [, inc ]) CALL USSCTI (fpn, dest, isb, num, ier [, inc ]) S–3695–36 245Application Programmer’s I/O Guide Convert IBM to/from CRI Synopsis 64-bit d.p. / 64-bit s.p. CALL USDCTC (dpn, isb, dest, num [, inc ]) CALL USDCTI (fpn, dest, isb, num, ier [, inc ]) LOGICAL*1 CALL USLCTC (src, isb, dest, num, len [, inc ]) LOGICAL*4 / 64-bit log. CALL USLCTI (src, dest, isb, num, len [, inc ]) EBCDIC /ASCII CALL USCCTC (src, isb, dest, num, npw [, val ]) CALL USCCTI (src, dest, isb, num, npw [, val ]) For UNICOS and UNICOS/mk IEEE systems, CRI2IBM(3F) and IBM2CRI(3F) provide all of the functionality of the preceding routines. A.2 Old CDC Data Conversion Routines The following lists CDC data conversion routines for single-precision numbers and character data: Convert CDC to/from CRI Synopsis 60-bit s.p. /64-bit s.p. CALL FP6064 (fpn, dest, num) CALL FP6460 (fpn, dest, num) Display Code / ASCII CALL DSASC (src, sc, dest, num) CALL ASCDC (src, sc, dest, num) A.3 Old VAX/VMS Data Conversion Routines The following lists VAX/VMS data conversion routines for integer, single-precision, double-precision, complex, and logical data: Convert VAX/VMS to/from CRI Synopsis INTEGER*2 CALL VXICTC (in, isb, dest, num, len [, inc ]) INTEGER*4 / 64-bit int. CALL VXICTI (in, dest, isb, num, len [, inc ]) 246 S–3695–36Older Data Conversion Routines [A] 32-bit F format / 64-bit s.p. CALL VXSCTC (fpn, isb, dest, num [, inc ]) CALL VXSCTI (fpn, dest, isb, num, ier [, inc ]) 64-bit D format / 64-bit s.p. CALL VXDCTI (fpn, dest, isb, num, ier [, inc ]) CALL VXDCTC (dpn, isb, dest, num [ ,inc ]) 64-bit G format / 64-bit s.p. CALL VXGCTC( dpn, isb, dest, num [, inc ]) CALL VXGCTI (fpn, dest, isb, num, ier [, inc ]) 64-bit complex /complex CALL VXZCTC (dpn, isb, dest, num [, inc ]) CALL VXZCTI (fpn, dest, isb, num, ier [, inc ]) Logical / 64-bit logical CALL VXLCTC (src, isb, dest, num, len [, inc ]) S–3695–36 247Glossary assign object The unit number, file name, or file name pattern to which assign objects are attached. blocking Adding I/O control words into the data stream. deblocking Removing I/O control words from the data stream. disk striping Multiplexing or interleaving a disk file across two or more disk drives to enhance I/O performance. The performance gain is function of the number of drives and channels used. external file A file on disk. It is associated with a unit number. file system (1) A tree-structured collection of files and their associated data and attributes. A file system is mounted to connect it to the overall file system hierarchy and to make it accessible. (2) An individual partition or cluster that has been formatted properly. The root file system is always mounted; other file systems are mounted as needed. (3) The entire set of available disk space. (4) A structure used to store programs and files on disk. A file system can be mounted (accessible for operations) or unmounted (noninteractive and unavailable for system use). The /etc/rc(8) script is the shell procedure that mounts file systems and activates accounting, error logging, and system activity logging. It is a major script that is called by the init(8) command in bringing UNICOS from single-user to multiuser mode. The /etc/rc.local script allows site modification of the start-up sequence. S–3695–36 249Application Programmer’s I/O Guide internal file A character variable that is used as the unit specifier in a READ or WRITE statement. logical device One or more physical device slices that the operating system treats as a single device. named pipe A first-in, first-out queue of read or write I/O requests. Piped I/O is faster than normal I/O. raw I/O A method of performing input/output in UNIX in which the programmer must handle all of the I/O control. This is basically unformatted I/O. record (1) A group of contiguous words or characters that are related by convention. A record may be fixed or of variable length. (2) A record for a listable dataset; each line is a record. (3) Each module of a binary-load dataset is a record. sector A part of the format scheme in disk drives. A disk drive is composed of equal segments called sector; a sector is the smallest unit of transfer to or from a disk drive. The size of a sector depends on the disk drive. See also block. slice (1) As used in the context of the low-speed communication (networking) subsystem in an EIOP, a slice is a subdivision of a channel buffer; sections of the buffer are divided into slices used for buffering network messages and data. (2) A contiguous storage address space on a physical disk, specified by a starting cylinder and number of blocks. stream (1) A software path of messages related to one file. (2) A stream, or logical command queue, is associated with a slave in the intelligent peripheral interface (IPI) context. The stream is used in identifying IPI-3 commands destined for that 250 S–3695–36Glossary slave. A slave may have 0, 1, or many streams associated with it at any given time. unit (1) A means of referring to an external file. (2) In the context of disk software on the IOS-E, unit refers to one disk drive that is daisy-chained with others on one channel adapter. The unit number represents an ordinal for referring to one disk on the channel. well formed I/O requests that begin and end on disk sector boundaries, usually 512 words (4096 bytes) or a multiple thereof. S–3695–36 251Index A ACPTBAD, 42 allocation memory preallocation, 167 applications multifile partition placement, 167 recommendations memory preallocation, 167 multifile partition placement, 167 user-level striping, 167 user-level striping, 167 AQIO routines and error detection, 33 AQCLOSE, 31 AQOPEN, 31 AQREAD, 32 AQREADC, 32 AQSTAT, 32 AQWRITE, 32 AQWRITEC, 32 assign and Fortran I/O, 61 alternative file names, 61 buffer size selection, 64 device allocation, 66 direct-access I/O tuning, 67 file space allocation, 65 file structure selection, 63 foreign file format specification, 65 Fortran file truncation, 67 assign basics, 55 assign command, 56 open processing, 55 related library routines, 60 local assign, 69 -y option, 15 assign command memory preallocation, 167 multifile partition placement, 167 syntax, 56 user-level striping, 167 assign environment, 55 related library routines, 60 assign environment file, 68 assign library routines calling sequences, 60 auxiliary I/O, 16 B bad tape data, 42 bin processing, 74 blankx or blx layer, 181 blocked file structure, 75 bmx file structure, 77 bmx/tape layer, 183 bufa layer, 106, 185 BUFFER IN/BUFFER OUT, 22 advantages, 21 buffer size considerations, 105 buffer size specification, 64 buffering, 79 introduction to, 79 library buffering, 81 other buffers, 83 overview, 79 system cache, 82 unbuffered I/O, 81 buffers usage, 79 C C I/O C I/O from Fortran, 50 FILE type usage, 50 S–3695–36 253Application Programmer’s I/O Guide Fortran interfaces to C functions, 51 functions, 50 mixing Fortran and C I/O, 51 UNICOS/mk Systems, 52 c205 layer, 187 cache layer, 109, 188 and improved I/O performance, 109 specification, 109 cachea layer, 106, 190 CDC CYBER 205 and ETA data conversions, 141 CDC CYBER NOS and NOS/BE 60–bit conversion, 138 CDC data conversion routines older routines, 246 cdc layer, 192 CDC NOS/VE conversion, 138 CDC NOS/VE layer, 208 characteristics of individual layers, 180 data model, 180 granularity, 180 implementation strategy, 180 truncate on write, 180 CHECKPT, 41 CLOSEV, 41 compound AQIO operation, 31 compound AQIO request, 31 conversion overview, 121 conversion methods advantages and disadvantages, 136 COS blocked file structure and ENDFILE records, 75 example formatted file, 64 COS blocked files and FFIO, 105 cos blocking layer, 194 COS data conversion, 140 cos file structure, 75 creating an I/O layer, 221 internal functions, 221 operations structure, 222 stat structure, 223 CTSS data conversion, 142 CYBER 205/ETA layer, 187 D data conversion, 121, 174 data conversion routines older routines, 245 data copying, 174 data item conversion, 127 absolute binary files advantages/disadvantages, 137 explicit conversion advantages/disadvantages, 137 implicit conversion advantages/disadvantages, 137 station conversion advantages/disadvantages, 136 data manipulation characteristics, 180 data output flow, 156 data transfer input statement READ, 11 output statement PRINT, 11 WRITE, 11 DD, 92 definitions external file, 5 external unit identifier, 5 file position, 8 internal file, 5 internal unit identifier, 5 device allocation, 66 devices disk drives, 91 main memory, 93 overview, 87 SSD, 89 logical device cache, 91 secondary data segments, 90 254 S–3695–36Index SSD file systems, 89 tape, 87 tape subsystem capabilities, 88 tape subsystem user commands, 88 direct access external file properties, 7 direct-access I/O tuning, 67 disk controllers, 83 disk drive storage quantities, 91 disk drives, 91 distributed I/O, 200 DR package ASYNCDR call, 25 CHECKDR call, 25 CLOSDR call, 25 OPENDR call, 24 STINDR call, 26 SYNCDR call, 25 WAITDR call, 25 WRITDR call, 25 E end–of–volumeprocessing, 41 ENDSP, 41 environment variables LISTIO_OUTPUT_STYLE, 15 LISTIO_PRECISION, 14 EOF records in standard Fortran, 8 EOV processing, 41 er90 layer, 196 error detection, 33 error messages message system, 2 event layer, 197 examples assign -a, 61 ASYNCDR call, 25 ASYNCMS call, 25 BACKSPACE statement, 18 buffer size specification, 64 CHECKDR call, 25 CHECKMS call, 25 CLOSDR call, 25 CLOSMS call, 25 COS blocked file structure formatted file, 64 device allocation, 66 direct access edit-directed I/O statement, 12 direct access unformatted I/O statement, 16 ENDFILE statement, 18 explicit named open statement, 10 explicit unnamed open statement, 10 file structure selection, 64 FINDMS call, 26 Fortran interfaces to C functions, 51 GETWA call, 29 implicit open statement, 9 ISHELL call, 42 layered I/O, 100 LENGTH function, 22 list-directed READ statement, 15 list-directed WRITE statement, 15 local assign mode, 69 mr and MS, 119 mr with buffer I/O, 117 named pipe, 42 named pipes file structure, 43 namelist I/O, 16 OPEN statement, 17 piped I/O with EOF detection, 45 piped I/O with no EOF detection, 44 program using DR package, 27 program using MS package, 26 program using WA/IO routines, 30 PUTWA call, 29 sds and mr WA package, 113 sds layer and buffer I/O, 111 sds layer usage, 108 sds with MS, 116 SEEK call, 29 sequential access edit-directed READ statement, 12 S–3695–36 255Application Programmer’s I/O Guide sequential access edit-directed WRITE statement, 12 sequential access unformatted READ statement, 16 sequential access unformatted WRITE statement, 16 specifying I/O class, 96 specifying I/O processing steps, 98 READ requests, 98 STINDR call, 26 STINDX call, 26 SYNCDR call, 25 SYNCMS call, 25 unblocked file structure, 64 unformatted direct sds and mr, 115 unformatted sequential mr, 118 unformatted sequential sds, 112 UNIT function, 22 user layer, 224 using the MVS station for IBM data conversion, 143 WAITDR call, 25 WAITMS call, 25 WCLOSE call, 29 WOPEN call, 29 WRITDR call, 25 WRITMS call, 25 explicit data conversion definition, 121 explicit data item conversion, 127 explicit named open statement example, 10 explicit unnamed open statement example, 10 extensions Word-addressable routines, 28 external file, 5 external files direct access, 7 format, 6 sequential access, 6 external unit identifier, 5 external units and file connections, 6 F f77 layer, 198 fd layer, 200 fdcp tool, 122 examples, 122 FFIO and buffer size considerations, 105 and Fortran I/O forms, 97 and performance enhancements, 105 and reading and writing COS files, 105 and reading and writing fixed-length records, 104 and reading and writing unblocked files, 104 common formats, 103 error messages, 2 introduction, 95 reading and writing text files, 103 removing blocking, 106 using the bufa layer, 106 using the cache layer, 109 using the cachea layer, 106 using the sds layer, 107 with the mr layer, 108 FFIO and foreign data foreign conversion tips CTSS conversion, 142 VAX/VMS conversion, 150 workstation and IEEE conversion, 148 FFIO and the stat structure, 223 FFIO layer reference individual layers blank expansion/compression layer, 181 bufa layer, 185 cache layer, 188 cachea layer, 190 cdc layer, 192 COS blocking layer, 194 CYBER 205/ETA blocking layer, 187 er90 layer, 196 256 S–3695–36Index event layer, 197 f77 layer, 198 fd layer, 200 global layer, 200 ibm layer, 202 memory resident layer, 205 nosve layer, 208 null layer, 210 sds layer, 211 syscall layer, 214 system layer, 215 tape/bmx layer, 183 text layer, 216 user and site layers, 217 vms layer, 218 FFIO specifications text files, 103 using with text files, 103 using with unblocked files, 104 file access, 6 direct access, 7 sequential access, 6 file connections alternative file names, 61 tuning, 61 file positioning statement, 18 file properties, 6 file space allocation, 65 specifying file system partitions, 66 file structure, 71 alternatives using assign, 63 assign options, 71 COS file structure, 75 default, 63 selection, 63 tape file structure, 77 text file structures, 74 unblocked file structure, 72 bin file processing, 74 sbin file processing, 73 u file processing, 74 file structure overhead, 168 file truncation activating and suppressing, 67 FILE type buffering, 51 used with C I/O functions, 50 fixed-length records and FFIO, 104 foreign conversion tips CTSS conversion, 142 VAX/VMS conversion, 150 workstation and IEEE conversion, 148 foreign file conversion and fdcp, 122 CDC CYBER 205 and ETA conversion, 141 CDC NOS/VE conversion, 138 choosing conversion methods, 136 conversion techniques, 138 COS conversions, 140 explicit data item conversion, 127 file types supported, 121 IBM, 142 implicit data item conversion, 129 routines, 128 TCP/IP, 127 foreign file format specifications, 65 foreign I/O formats supported data types, , 133 formatted I/O statements optimizing, 12 types, 11 formatted record size, 175 Fortran I/O extensions, 21 BUFFER IN/BUFFER OUT, 21 LENGTH function, 22 positioning, 23 UNIT intrinsic routine, 22 GETPOS, 23 random access I/O routines, 23 DR package, 24 MS package, 24 SETPOS, 23 S–3695–36 257Application Programmer’s I/O Guide WA I/O routines, 28 Fortran input/output extensions asynchronous queued I/O (AQIO) routines, 31 AQCLOSE, 31 AQOPEN, 31 AQREAD, 32 AQREADC, 32 AQSTAT, 32 AQWRITE, 32 AQWRITEC, 32 logical record I/O routines, 38 Fortran interfaces to C functions, 51 Fortran standard auxiliary I/O statements, 16 BACKSPACE file positioning statement, 18 ENDFILE file positioning statement, 18 file connection statements, 17 file positioning statements, 18 INQUIRE statement, 17 OPEN, 17 REWIND file positioning statement, 18 data transfer formatted I/O, 11 data transfer statements, 11 edit-directed formatted I/O, 12 list-directed formatted I/O, 14 namelist I/O, 15 unformatted I/O, 16 external files, 6 file access, 6 file name specification, 5 file properties, 6 file types, 5 files direct file access, 7 external files, 6 file position, 8 form, 6 internal files, 5 sequential file access, 6 formatted I/O statements optimizing, 12 Fortran unit identifiers, 8 overview, 5 overview of files, 5 Fortran unit identifiers, 8 valid unit numbers, 9 G GETPOS, 23 GETTP, 42 global I/O, 200 global layer, 200 I I/O forms and FFIO usage, 97 I/O layers, 99, 156 supported operations, 181 unblocked data transfer, 106 I/O optimization, 155 avoiding formatted I/O, 174 bypassing library buffers, 175 characterizing files, 156 data conversions, 174 evaluation tools, 157 execution times, 160 file structure overhead, 168 identifying time-intensive activities, 159 ja command, 160 library buffer sizes, 174 optimizing speed, 158 overlapping CPU and I/O, 176 overview, 155 overview of optimization techniques, 157 preallocating file space, 166 source code changes, 158 summary of techniques, 157 system requests, 160 UNICOS/mk systems, 177 using alternative file structures, 170 using asynchronous COS blocking layer, 171 258 S–3695–36Index using asynchronous read-ahead/write-behind, 172 using faster devices, 164 using MR/SDS combinations, 165 using pipes, 176 using scratch files, 168 using simpler file structures, 173 using striping, 167 using the cache layer, 166 using the MR feature, 161 I/O processing steps, 96 description, 95 I/O classes, 99 specifying I/O class, 96 example, 96 IBM data conversion, 142 data transfer between COS and VM, 147 other record formats, 146 using the MVS station, 143 example, 143 IBM data conversion routines older routines, 245 ibm layer, 202 implicit data conversion definition, 121 implicit data item conversion, 129 supported conversions, 132 implicit numeric conversions, 152 implicit open example, 9 implied unit numbers, 9 increasing formatted record size, 175 individual layer reference, 179 INQUIRE statement, 17 INQUIRE by file statement, 17 INQUIRE by unit statement, 17 internal file, 5 internal file identifier, 5 internal files definition, 5 format, 6 standard Fortran, 5 introduction to FFIO layered I/O, 95 layered I/O options, 100 L layered I/O, 97 options, 100 overview, 95 specifying layers, 99 usage, 97 usage rules, 100 library buffer sizes, 174 library buffering, 81 library buffers, 77 library error messages flexible file I/O error messages, 2 system error messages, 2 tape error messages, 2 LISTIO_OUTPUT_STYLE, 15 LISTIO_PRECISION, 14 local assign mode, 69 logical device definition, 80 logical device cache, 91 logical disk device definition, 92 logical record I/O routines, 38 READ, 38 READC, 38 READCP, 38 READIBM, 38 READP, 38 WRITE, 39 WRITEC, 39 WRITECP, 39 WRITEP, 39 WRITIBM, 39 M main memory, 93 memory allocation preallocation, 167 S–3695–36 259Application Programmer’s I/O Guide memory-resident layer, 205 mr layer, 108, 205 mr and MS example, 119 specification, 109 example, 109 unformatted sequential mr example, 118 with buffer I/O, 117 MS package ASYNCMS call, 25 CHECKMS call, 25 CLOSMS call, 25 FINDMS call, 26 OPENMS call, 24 STINDX call, 26 SYNCMS call, 25 WAITMS call, 25 WRITMS call, 25 multitasking standard Fortran I/O, 19 multithreading, 19 N named pipe support, 41 named pipes and binary data, 43 and EOF, 43 creating, 42 detecting EOF, 45 difference from normal I/O, 42 ISHELL call, 42 PIPE_BUF parameter, 43 piped I/O example (EOF detection), 45 piped I/O example (no EOF detection), 44 receiving process file structure, 43 restrictions, 42 sending process file structure, 43 specifying file structure for binary data, 43 with EOF detection usage requirements, 45 namelist I/O, 15 nosve layer, 208 null layer, 210 numeric conversions, 152 O older data conversion routines, 245 old CDC data conversion routines, 246 old IBM data conversion routines, 245 old VAX/VMS data conversion routines, 246 open processing, 55 and INQUIRE statement, 63 operations in FFIO, 222 optimization evaluation tools, 157 optimization techniques, 157 P performance enhancements, 105 performance impact applications, 167 user-level striping, 167 permanent files definition, 156 physical device I/O activities, 160 position property definition, 8 positioning statements, 23 private I/O, 19 Pthreads, 19 R raw I/O, 81, 83 read system call, 49 record blocking removal, 106 record-addressable random-access file routines the DR package, 23 the MS package, 23 S sbin processing, 73 sds layer, 107, 211 buffer I/O example, 111 260 S–3695–36Index BUFFER I/O example, 112 examples, 108 sds with MS example, 116 specifications, 107 unformatted direct sds and mr, 115 with mr WA package, 113 secondary data segment, 90 sequential access external file properties, 6 setbuf function, 51 setf command multifile partition placement (-p option), 167 preallocating memory (-c option), 166–167 SETPOS, 23 SETSP, 41 SETTP, 42 setvbuf function, 51 site layer, 217 SKIPBAD, 42 SSD overview, 89 SSD file systems, 89 ssread system call, 90 sswrite system call, 90 standard error unit number, 10 standard Fortran EOF records, 8 standard input unit number, 10 standard output unit number, 10 STARTSP, 41 stream definition, 50 striping user-level striping, 167 striping capability definition, 92 supported implicit data conversions, 132 syscall layer, 214 system cache, 82 definition, 80 system I/O, 49 asynchronous I/O, 49 synchronous I/O, 49 unbuffered I/O, 50 system layer, 215 T tape I/O interfaces, 87 tape structure library buffers, 77 tape or bmx, 77 tape subsystem capabilities, 88 tape subsystem user commands, 88 tape support, 41 and bad data, 42 positioning routines, 42 user EOV processing, 41 temporary files definition, 156 text file structure, 74 text files and FFIO, 103 text layer, 216 U u file processing, 74 unblocked data transfer I/O layers, 106 unblocked file structure and BACKSPACE statement, 72 and BUFFER IN/BUFFER OUT statements, 72 definition, 72 example, 64 specifications, 73 unblocked files and FFIO, 104 unbuffered I/O, 81 unformatted I/O, 16 UNICOS library parameters, 101 UNICOS/mk and AQIO routines, 31 S–3695–36 261Application Programmer’s I/O Guide UNICOS/mk systems C I/O, 52 file handles, 31 optimization techniques, 177 private I/O, 19 UNIT intrinsic routine, 22 unit number standard error, 10 access mode and form, 10 standard input, 10 access mode and form, 10 standard output, 10 access mode and form, 10 UNIX FFIO special files, 41 usage rules layered I/O options, 100 user layer, 217 user layer example, 224 V valid unit numbers, 9 VAX/VMS conversion, 150 VAX/VMS data conversion routines older routines, 246 vms layer, 218 W WA routines GETWA call, 29 PUTWA call, 29 SEEK call, 29 user requirements, 28 WCLOSE call, 29 WOPEN call, 29 WAIO, 28 well-formed requests definition, 80 workstation and IEEE conversion, 148 write system call, 49 Y —y option to assign, 15 262 S–3695–36