# DMA Accelerator Functional Unit (AFU) User Guide Updated for Intel® Acceleration Stack for Intel® Xeon® CPU with FPGAs: 1.2 **UG-20127 | 2018.12.04**Latest document on the web: **PDF | HTML** # **Contents** | 1. About this Document | 3 | |---------------------------------------------------------------------------------|----| | 1.1. Intended Audience | 3 | | 1.2. Conventions | | | 1.3. Acronyms | 3 | | 1.4. Acceleration Glossary | 4 | | 2. DMA AFU Description | 5 | | 2.1. Introduction | | | 2.2. The DMA AFU Software Package | | | 2.3. The DMA AFU Hardware Components | 6 | | 2.3.1. DMA Test System | | | 2.3.2. DMA BBB | 11 | | 3. Register Map and Address Spaces | 13 | | 3.1. DMA AFU Register Map | | | 3.1.1. DMA AFU Address Space | | | 4. Software Programming Model | 16 | | 4.1. Software APIs | | | 4.1.1. fpgaDmaOpen() | | | 4.1.2. fpgaDmaTransferSync() | | | 4.1.3. fpgaDmaClose() | 17 | | 5. Running DMA AFU Example | 18 | | 6. Optimization for Improved DMA Performance | 20 | | 7. DMA Accelerator Functional Unit (AFU) User Guide Archives | 22 | | 8. Document Revision History for the DMA Accelerator Functional Unit (AFU) User | 23 | # 1. About this Document #### 1.1. Intended Audience The intended audience comprises hardware or software developers that require an Accelerator Function (AF) to buffer data locally in memory connected to the Intel FPGA device. #### 1.2. Conventions **Table 1. Document Conventions** | Convention | Description | |---------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | # | Precedes a command that indicates the command is to be entered as root. | | \$ | Indicates a command is to be entered as a user. | | This font | Filenames, commands, and keywords are printed in this font. Long command lines are printed in this font. Although long command lines may wrap to the next line, the return is not part of the command; do not press enter. | | <variable_name></variable_name> | Indicates the placeholder text that appears between the angle brackets must be replaced with an appropriate value. Do not enter the angle brackets. | ## 1.3. Acronyms Table 2. Acronyms | Acronyms | Expansion | Description | |----------|-----------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------| | AF | Accelerator Function | Compiled Hardware Accelerator image implemented in FPGA logic that accelerates an application. | | AFU | Accelerator Functional<br>Unit | Hardware Accelerator implemented in FPGA logic which offloads a computational operation for an application from the CPU to improve performance. | | API | Application Programming Interface | A set of subroutine definitions, protocols, and tools for building software applications. | | ASE | AFU Simulation<br>Environment | Co-simulation environment that allows you to use the same host application and AF in a simulation environment. ASE is part of the Intel Acceleration Stack for FPGAs. | | CCI-P | Core Cache Interface | CCI-P is the standard interface AFUs use to communicate with the host. | | CL | Cache Line | 64-byte cache line | | | ! | continued | Intel Corporation. All rights reserved. Intel, the Intel logo, Altera, Arria, Cyclone, Enpirion, MAX, Nios, Quartus and Stratix words and logos are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. Intel warrants performance of its FPGA and semiconductor products to current specifications in accordance with Intel's standard warranty, but reserves the right to make changes to any products and services at any time without notice. Intel assumes no responsibility or liability arising out of the application or use of any information, product, or service described herein except as expressly agreed to in writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying on any published information and before placing orders for products or services. 9001:2015 Registered | Acronyms | Expansion | Description | |----------|------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | DFH | Device Feature Header | Creates a linked list of feature headers to provide an extensible way of adding features. | | FIM | FPGA Interface Manager | The FPGA hardware containing the FPGA Interface Unit (FIU) and external interfaces for memory, networking, etc. The Accelerator Function (AF) interfaces with the FIM at run time. | | FIU | FPGA Interface Unit | FIU is a platform interface layer that acts as a bridge between platform interfaces like PCIe*, UPI and AFU-side interfaces such as CCI-P. | | MPF | Memory Properties<br>Factory | The MPF is a Basic Building Block (BBB) that AFUs can use to provide CCI-P traffic shaping operations for transactions with the FIU. | # 1.4. Acceleration Glossary # Table 3. Acceleration Stack for Intel® Xeon® CPU with FPGAs Glossary | Term | Abbreviation | Description | |-------------------------------------------------------------------------------------|---------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | Intel <sup>®</sup> Acceleration Stack for Intel<br>Xeon <sup>®</sup> CPU with FPGAs | Acceleration Stack | A collection of software, firmware, and tools that provides performance-optimized connectivity between an Intel FPGA and an Intel Xeon processor. | | Intel Programmable Acceleration Card with Intel Arria® 10 GX FPGA | Intel PAC with Intel Arria 10 GX FPGA | PCIe accelerator card with an Intel<br>Arria 10 FPGA. Programmable<br>Acceleration Card is abbreviated PAC.<br>Contains a FPGA Interface Manager<br>(FIM) that connects to an Intel Xeon<br>processor over PCIe bus. | | Intel Xeon Scalable Platform with Integrated FPGA | Integrated FPGA Platform | A platform with the Intel Xeon and FPGA in a single package and sharing a coherent view of memory using the Intel Ultra Path Interconnect (UPI). | # 2. DMA AFU Description #### 2.1. Introduction The Direct Memory Access (DMA) AFU example shows how to manage memory transfers between the host processor and the FPGA. You can integrate the DMA AFU into your design to move data between the host memory and the FPGA local memory. Connecting a local memory directly to the FPGA should improve performance significantly for applications that frequently access FPGA memory. The DMA AFU comprises the following submodules: - MMIO Decoder Logic - Memory Properties Factory (MPF) BBB - Core Cache Interface (CCI-P) to the Avalon® Memory-Mapped (Avalon-MM) Adapter - DMA Test System which contains the DMA BBB These submodules are described in more detail in the DMA AFU Hardware Components topic below. #### **Related Information** - The DMA AFU Hardware Components on page 6 - Avalon Interface Specifications For more information about the Avalon-MM protocol, including timing diagrams for read and write transactions. #### 2.2. The DMA AFU Software Package The Intel Acceleration Stack for Intel Xeon CPU with FPGAs package file (.tar.gz), includes the DMA AFU example. This example provides a user space driver. The driver programs the host application to use the DMA to move data between host and FPGA memory. The hardware binaries, sources, and the user space driver are available in the following directory: \$OPAE PLATFORM ROOT/hw/samples/dma afu. Before experimenting with the DMA AFU, you must install the Open Programmable Acceleration Engine (OPAE) software package. Refer to *Installing the OPAE Software Package* in the *Intel Acceleration Stack Quick Start Guide for Intel Programmable Acceleration Card with Intel Arria 10 GX FPGA* for installation instructions. This *Quick Start Guide* also includes basic information about the Open Programmable Acceleration Engine (OPAE) and configuring an AFU. Intel Corporation. All rights reserved. Intel, the Intel logo, Altera, Arria, Cyclone, Enpirion, MAX, Nios, Quartus and Stratix words and logos are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. Intel warrants performance of its FPGA and semiconductor products to current specifications in accordance with Intel's standard warranty, but reserves the right to make changes to any products and services at any time without notice. Intel assumes no responsibility or liability arising out of the application or use of any information, product, or service described herein except as expressly agreed to in writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying on any published information and before placing orders for products or services. ISO 9001:2015 Registered After installing the Open Programmable Acceleration Engine (OPAE) software package, a sample host application and the DMA AFU are available in the following directory: \$OPAE\_PLATFORM\_ROOT/hw/samples/dma\_afu/sw . A sample application, fpga\_dma\_test implements the DMA AFU user space driver. #### **Related Information** - Intel Acceleration Stack Quick Start Guide for Intel Programmable Acceleration Card with Arria 10 GX FPGA - Installing the OPAE Software Package ## 2.3. The DMA AFU Hardware Components The DMA AFU interfaces with the FPGA Interface Unit (FIU) and two banks of local DDR4-SDRAM. The total memory addressable on the device is 8 gigabytes (8 GB). The memory comprises two, 4 GB banks. Note: The currently available hardware dictates this memory configuration. Future hardware may support different memory configurations. You can use the DMA AFU to copy data between the following source and destination locations: - The host to device local memory - Device local memory to the host - Device local memory to device local memory A Platform Designer system, <code>\$OPAE\_PLATFORM\_ROOT/hw/samples/dma\_afu/hw/rtl/qsys/dma\_test\_system.qsys</code> implements most of the DMA AFU. For 1.2, a Platform Designer system is located at <installation path>/hw/samples/dma\_afu/hw/rtl/qsys/<device>/dma\_test\_system.qsys Figure 1. DMA AFU Hardware Block Diagram The DMA AFU includes the following internal modules to interface with the FPGA Interface Unit (FIU): - MMIO Decoder Logic: detects MMIO read and write transactions and separates them from the CCI-P RX channel 0 that they arrive from. This ensures that MMIO traffic never reaches the MPF BBB and is serviced by an independent MMIO command channel. - Memory Properties Factory (MPF): This module ensures that read responses from the DMA return in the order that they were issued. The Avalon-MM protocol requires read responses to return in the correct order. - CCI-P to Avalon-MM Adapter: This module translates between CCI-P and Avalon-MM transactions, as follows: - The CCI-P to Avalon-MM path (MMIO path): This path translates CCI-P transactions to Avalon-MM reads and writes. - Note: MMIO accesses do not support backpressure. As a result, the CCI-P to Avalon-MM Adapter does not support the waitrequest signal. Intel recommends that you add an Avalon-MM Clock Crossing Bridge, available in the IP Catalog, between the CCI-P to Avalon MMIO Adapter master port and the DMA Test System Avalon-MM slave port. Intel recommends that you set the clock crossing command depth to 64 entries deep and disable burst support. - The Avalon-MM to CCI-P: These paths create separate read-only and writeonly paths for the DMA to access host memory. The Avalon-MM write slave of the CCI-P to Avalon Adapter includes an extra, high-order bit to implement write fences. When the high-order bit is set to 1'b1, the CCI-P adapter first issues a write fence. Then, the CCI-P bridge writes data to the host physical address space with the high-order bit is set to 1'b0. This operation allows the DMA to synchronize writes to host memory. The DMA BBB is not capable of receiving write responses so that the write fence is used to synchronize the write data with the host. DMA Test System: This module serves as a wrapper around the DMA BBB to expose the DMA masters and interrupt interfaces to the rest of the logic in the AFU. It provides the interface between the DMA BBB and the CCI-P to Avalon Adapter. It also provides the interface between the DMA BBB and the local FPGA SDRAM banks. #### 2.3.1. DMA Test System The DMA test system tests the DMA AFU. #### Figure 2. DMA Test System Block Diagram This block diagram shows the internals of the DMA test system. The DMA test system is shown as a monolithic block in Figure $\bf 1$ on page 7. CCIP to Avalon MMIO Adapter The DMA test system includes the following internal modules: - AFU ID: This component stores the 64-bit Device Feature Header (DFH) and also includes the universally unique identifier (UUID). The AFU\_ID\_L register stores the lower 32 bits of the AFU ID. The AFU\_ID\_H register stores the upper 32 bits of the AFU ID. A software driver scans the DMA Test System, finds the AFU ID, and identifies the DMA BBB. - DMA Basic Building Block (BBB): This component moves data between the host and local device memory spaces. DMA BBB interrupt connects to the IRQ 0 signal. The IRQ 0 signal is an input to the CCI-P to Avalon Adapter. The CCI-P to Avalon Adapter forwards the interrupt to the host. - Pipeline Bridge: The Pipeline Bridge inserts pipeline stages between memory mapped IP cores. By default, Platform Designer optimizes for low latency. Consequently, the Pipeline Bridge improve the system maximum operating frequency (F<sub>MAX</sub>) at the expense of additional latency. - Clock Crossing Bridge: The Clock Crossing Bridge isolates Avalon-MM masters and slaves that are in different clock domains. Because the Clock Crossing Bridge includes clock-crossing logic, it adds FIFOs that have a greater latency than the standard Pipeline Bridge. The Clock Crossing Bridge ensures that the memory transactions from the DMA BBB safely cross to the local SDRAM clock domain. #### 2.3.2. DMA BBB The DMA BBB subsystem transfers data from source to destination addresses using memory-mapped transactions. The DMA AFU accesses control and status registers in the DMA BBB subsystem. The DMA BBB comprises five IP cores available in the Platform Designer IP Catalog as shown in the following figure. Figure 3. DMA BBB Platform Designer Block Diagram This block diagram excludes some internal Pipeline Bridge IP cores. The components in the DMA BBB Platform Designer implement the following functions: - Modular Scatter-Gather DMA (MSGDMA): This IP core performs memory mapped transfers between source and destination addresses. The MSGDMA transfers 64 bytes per clock cycle. The data must be aligned to 64-byte boundaries. The transfer length must be a multiple of 64 bytes. The MSGDMA supports 50-bit addressing and can transfer up to 16,777,152 bytes per descriptor. In this implementation, the driver limits the transfer size to 1,047,552 bytes per descriptor. - Address Span Extender: This IP core implements memory transfers that are not aligned on a 64-byte boundary. The host uses it to perform MMIO accesses to FPGA device memory that are not aligned on a 64-byte boundary. The Address Span Extender accesses a 4 kilobyte (4 KB) window into the local device memory. The control port sets the base address of the (4 KB) window. The base address must be aligned to a 4 KB boundary so that the window is aligned to the window size. For example, to access FPGA memory address 0xF340, set the window address to 0xF000 and then access offset 0x0340 within the address span extender data window. - BBB ID: This component stores the 64-bit Device Feature Header(DFH) and the UUID. The BBB\_ID\_L register stores the lower 32 bits of the BBB ID. The BBB\_ID\_H register stores the upper 32 bits of the BBB ID. A software driver scans the BBB ID to identify the functionality of this DMA subsystem. - Magic Number ROM: This IP core contains a single, read-only 64-byte value. The DMA uses this value to create a write fence in host memory. This ROM is only visible to the MSGDMA. The host cannot access it. - ullet Pipeline Bridge: The Pipeline Bridge inserts pipeline stages to improve the system $F_{\text{MAX}}$ at the expense of latency. # 3. Register Map and Address Spaces The DMA AFU supports two memory views: The DMA view and the host view. The DMA view supports a 50-bit address space. The lower half of the DMA view maps to the FPGA device memory and the magic number ROM inside the DMA BBB. The upper half of the DMA view maps to the host memory which is split into a direct access and a write fence region. The direct access and memory write fence regions overlaps into the same 48-bit host physical address space. The host view includes all the registers accessible through MMIO accesses such as the DFH tables, and the control/status registers of the various IP cores used inside the DMA AFU. The host view includes an indirect mapping mechanism to the device memory so that the host can access the FPGA device memory using a 4KB windowing mechanism built into the DMA BBB. The MMIO registers in the DMA BBB and AFU support 32- and 64- bit access. The DMA AFU does not support 512-bit MMIO accesses. Accesses to the MSGDMA registers inside the DMA BBB must be 32 bits. ## 3.1. DMA AFU Register Map The DMA register map provides the absolute addresses of all the locations within the unit. These registers are in the host view because it's only the host that can access them. Table 4. DMA AFU Memory Map | Byte Address | Name | Span in Bytes | Description | |--------------|----------|---------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | 0x0_0000 | AFU DFH | 8 | Refer to Table 5 on page 13 for the bit fields. | | 0x0_0008 | AFU ID_L | 8 | Set to 0x9081F88B8F655CAA for the DMA AFU. | | 0x0_0010 | AFU ID_H | 8 | Set to 0x331DB30C988541EA for the DMA AFU. | | 0x0_2000 | MPF DFH | 240 | Specifies IDs, feature list, and control and status registers. The MPF decodes this information. This information is not available inside the DMA Platform Designer system. | | 0x2_0000 | DMA BBB | 8192 | The DMA BBB memory map. Refer to Table 6 on page 14 for the register offsets. | #### Table 5. DMA AFU DFH Encoding | Bit Field | Description | | | |-----------|----------------------------------------------------|--|--| | [11:0] | Feature ID. Set to 0. | | | | [15:12] | AFU major revision number. Set to 0. | | | | [39:16] | Next DFH byte offset/DFH region size. Set to 8192. | | | | | continued | | | Intel Corporation. All rights reserved. Intel, the Intel logo, Altera, Arria, Cyclone, Enpirion, MAX, Nios, Quartus and Stratix words and logos are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. Intel warrants performance of its FPGA and semiconductor products to current specifications in accordance with Intel's standard warranty, but reserves the right to make changes to any products and services at any time without notice. Intel assumes no responsibility or liability arising out of the application or use of any information, product, or service described herein except as expressly agreed to in writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying on any published information and before placing orders for products or services. 9001:2015 Registered | Bit Field | Description | | | | |-----------|---------------------------------------------------------------------------------------|--|--|--| | [40] | End of DFH list. When set, the DFH is at the end of the list. The default value is 0. | | | | | [47:41] | Reserved. | | | | | [51:48] | AFU minor revision number. Set to 0. | | | | | [59:52] | Reserved. | | | | | [63:60] | Feature type. Set to 1 (AFU). | | | | #### Table 6. DMA BBB Memory Map Add the byte addresses below to the BBB DMA base address, 0x2\_0000. | | • | | | |--------------|-------------------------------------|---------------|-------------------------------------------------------| | Byte Address | Name | Span in Bytes | Description | | 0x0000 | BBB DFH | 8 | Refer to Table 6 on page 14 for the bit fields. | | 0x0008 | BBB ID_L | 8 | Set to 0xA9149A35BACE01EA for the DMA BBB. | | 0x0010 | BBB ID_H | 8 | Set to 0xEF82DEF7F6EC40FC for the DMA BBB. | | 0x0040 | MSGDMA CSR | 32 | Controls the DMA. | | 0x0060 | MSGDMA<br>Descriptor | 32 | Receives DMA descriptors. | | 0x0200 | Address Span<br>Extender<br>Control | 8 | Moves the address window that the data port accesses. | | 0x1000 | Address Span<br>Extender<br>Data | 4096 | Maps a 4 KB window to a local device memory. | #### Table 7. DMA BBB DFH Encoding | Bit Field | Description | | |-----------|---------------------------------------------------------------------------------------|--| | [11:0] | Feature ID. Set to 0 | | | [15:12] | AFU major revision number. Set to 0. | | | [39:16] | Next DFH byte offset / DFH region size. Set to 8192. | | | [40] | End of DFH list. When set, the DFH is at the end of the list. The default value is 0. | | | [47:41] | Reserved. | | | [51:48] | AFU minor revision number. Set to 0 | | | [59:52] | Reserved. | | | [63:60] | Feature type. Set to 2 (BBB). | | ### 3.1.1. DMA AFU Address Space The host can access registers listed in the Table 4 on page 13 and the Table 6 on page 14. Host accesses to FPGA local memory must use the Address Span Extender IP core included in the DMA BBB subsystem. The MSGDMA in the DMA BBB subsystem has access to the full 50-bit address space. The lower have of this address space includes the local memories and the Magic Number ROM. The upper half of this address space includes host memory. The following figure shows the host and MSGDMA views of memory. Figure 4. The DMA AFU and Host Views of Memory <sup>(1)</sup> The Address Span Extender can only access the EMIF A and EMIF B address spaces. <sup>(2)</sup> The write fence aliased host memory, addresses <sup>0</sup>x3\_0000\_0000\_0000-0x3\_FFFF\_FFFF\_FFFF, aliases to the host memory spanning 0x2\_0000\_0000\_0000-0x2\_FFFF\_FFFF\_FFFF. The write fence aliased host memory span is write only. Writes to the write fence aliased host memory cause a write fence to be issued followed by the write data accompanying it. This address space should only be written to infrequently to send write fences to synchronize with the host. Reads to this address space are undefined. # 4. Software Programming Model The DMA AFU includes a software driver that you can use in your own host application. The fpga\_dma.c and fgpa\_dma.h files located in the $<installation_path>/hw/samples/dma_afu/sw$ directory implement the software driver. This driver supports the following functions: | API | Description | | | |------------------------|--------------------------------------------------------------------------------------------------------------------------------------|--|--| | fpgaDMAOpen() | Opens a handle to the DMA BBB. | | | | fpgaDMATransferSync() | Transfers data from a source location to a destination location. The source and destination can be located in host or device memory. | | | | fpgaDMATransferClose() | Closes DMA BBB handle previously allocated. | | | #### 4.1. Software APIs #### 4.1.1. fpgaDmaOpen() fpgaDmaOpen() scans the device feature chain to locate the DMA BBB and then creates a handle for the DMA BBB. | Prototype | fpga_result fpgaDmaOpen(fpga_handle fpga, fpga_dma_handle *dma) | | | |-----------|-----------------------------------------------------------------|------------------------------------------|--| | Arguments | fpga Input containing fpga object handle from fpgaOpen(). | | | | | dma | Output containing handle to the DMA BBB. | | | Returns | FPGA_OK on success; otherwise an error return code. | | | # 4.1.2. fpgaDmaTransferSync() | Prototype | fpga_result fpgaDmaTransferSync(fpga_dma_handle dma, uint64_t dst, uint64_t src, size_t count, fpga_dma_transfer_t type) | | |-----------|--------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------| | Arguments | dma | dmaInput specifying DMA handle obtained from fpgaDmaOpen(). | | | dst | Input specifying the destination byte address of the transfer. To maximize performance, make dst a multiple of 64 bytes. | | | src | Input specifying the source byte address of the transfer. To maximize performance, make src a multiple of 64 bytes. | | | count | Input specifying the length of the transfer in bytes. To maximize performance, make count a multiple of 64 bytes. | | | type | Input specifying the type of transfer. type has the following valid values: HOST_TO_FPGA_MM, FPGA_TO_HOST_MM, or FPGA_TO_FPGA_MM. | | Returns | FPGA_OK on success; otherwise error return code. | | Intel Corporation. All rights reserved. Intel, the Intel logo, Altera, Arria, Cyclone, Enpirion, MAX, Nios, Quartus and Stratix words and logos are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. Intel warrants performance of its FPGA and semiconductor products to current specifications in accordance with Intel's standard warranty, but reserves the right to make changes to any products and services at any time without notice. Intel assumes no responsibility or liability arising out of the application or use of any information, product, or service described herein except as expressly agreed to in writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying on any published information and before placing orders for products or services. 9001:2015 Registered UG-20127 | 2018.12.04 # 4.1.3. fpgaDmaClose() ${\tt fpgaDmaClose()} \ \ {\tt closes} \ \ {\tt the} \ \ {\tt previously} \ \ {\tt allocated} \ \ {\tt DMA} \ \ {\tt BBB} \ \ {\tt handle}.$ | Prototype | fpga_result fpgaDmaClose(fpga_Dma_handle dma) | | | |------------|--------------------------------------------------|----------------------------------------------------------|--| | Parameters | dma | Input containing DMA handle obtained from fpgaDmaOpen(). | | | Returns | FPGA_OK on success; otherwise error return code. | | | # 5. Running DMA AFU Example Before running this example, you should be familiar with the examples in the *Intel Acceleration Stack Quick Start Guide for Intel Programmable Acceleration Card with Intel Arria 10 GX FPGA*. You must define an environment variable. The environment variable is dependent on the Intel Acceleration Stack version you are using: For 1.2, set the environment variable to \$OPAE\_PLATFORM\_ROOT Note: Intel recommends using the GCC (C Compiler) to compile the example software. Compiling the DMA sample application and user space driver with g++ (C++ compiler) results in compilation errors. Complete the following steps to download the DMA AF bitstream and build and run the example software: If you have not already done so, configure the system hugepage to allocate 20, 2 MB hugepages that the DMA user space driver requires<sup>(1)</sup>. This command requires root privileges: ``` $ sudo sh -c "echo 20 > /sys/kernel/mm/hugepages/hugepages-\ 2048kB/nr_hugepages" ``` - 2. \$ sudo fpgaconf \$PLATFORM\_ROOT/hw/samples/dma\_afu/bin/ \ dma\_afu.gbs - 3. \$ cd \$OPAE\_PLATFORM\_ROOT/hw/samples/dma\_afu/sw - 4. \$ make - 5. For 1.1 and earlier release, enter \$ sudo LD\_LIBRARY\_PATH=`pwd`: \$LD\_LIBRARY\_PATH ./fpga\_dma\_test 0 ``` For 1.2 release, enter $ sudo ./fpga_dma_test 0 ``` The DMA software takes approximately a minute to populate test buffers and verify the results. The software prints the following messages during a successful run: ``` Running test in HW mode Buffer Verification Success! Buffer Verification Success! Running DDR sweep test Allocated test buffer Fill test buffer DDR Sweep Host to FPGA Measured bandwidth = 6710.886400 Megabytes/sec Clear buffer DDR Sweep FPGA to Host Measured bandwidth = 6927.366606 Megabytes/sec Verifying buffer.. Buffer Verification Success! ``` Intel Corporation. All rights reserved. Intel, the Intel logo, Altera, Arria, Cyclone, Enpirion, MAX, Nios, Quartus and Stratix words and logos are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. Intel warrants performance of its FPGA and semiconductor products to current specifications in accordance with Intel's standard warranty, but reserves the right to make changes to any products and services at any time without notice. Intel assumes no responsibility or liability arising out of the application or use of any information, product, or service described herein except as expressly agreed to in writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying on any published information and before placing orders for products or services. ISO 9001:2015 Registered UG-20127 | 2018.12.04 *Note:* The test application prints bandwidth results for different transfer sizes, source and destination addresses. #### **Related Information** Intel Acceleration Stack Quick Start Guide for Intel Programmable Acceleration Card with Arria 10 GX FPGA <sup>(1)</sup> If your host has multiple cards, you will need 20, 2 MB hugepages per card. For an example, a system with four cards requires total 80, 2 MB hugepages. Send Feedback # 6. Optimization for Improved DMA Performance Implementation of NUMA (non-uniform memory access) optimization in the fpga\_dma\_st\_test.c (application) allows processor to access it's own local memory faster than accessing the non-local memory (memory local to another processor). A typical NUMA configuration is shown in the diagram below. The orange arrow represents access from a core to memory local of the same core. The red arrow illustrates the path taken when a core on Node 0 access memory that resides in local memory of Node 1. Figure 5. Typical NUMA Configuration Use the following code to implement NUMA optimization in your test application: ``` // Set up proper affinity if requested if (cpu_affinity || memory_affinity) unsigned dom = 0, bus = 0, dev = 0, func = 0; fpga_properties props; int retval; #if(FPGA_DMA_DEBUG) char str[4096]; #endif res = fpgaGetProperties(afc_token, &props); ON_ERR_GOTO(res, out_destroy_tok, "fpgaGetProperties"); res = fpgaPropertiesGetBus(props, (uint8_t *) & bus); ON_ERR_GOTO(res, out_destroy_tok, "fpgaPropertiesGetBus"); res = fpgaPropertiesGetDevice(props, (uint8_t *) & dev); ON_ERR_GOTO(res, out_destroy_tok, "fpgaPropertiesGetDevice"); res = fpgaPropertiesGetFunction(props, (uint8_t *) & func); ON_ERR_GOTO(res, out_destroy_tok, "fpgaPropertiesGetFunction"); // Find the device from the topology hwloc_topology_t topology; hwloc_topology_init(&topology); hwloc_topology_set_flags(topology, HWLOC_TOPOLOGY_FLAG_IO_DEVICES); ``` Intel Corporation. All rights reserved. Intel, the Intel logo, Altera, Arria, Cyclone, Enpirion, MAX, Nios, Quartus and Stratix words and logos are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. Intel warrants performance of its FPGA and semiconductor products to current specifications in accordance with Intel's standard warranty, but reserves the right to make changes to any products and services at any time without notice. Intel assumes no responsibility or liability arising out of the application or use of any information, product, or service described herein except as expressly agreed to in writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying on any published information and before placing orders for products or services. ISO 9001:2015 Registered UG-20127 | 2018.12.04 ``` hwloc_topology_load(topology); hwloc_obj_t obj = hwloc_get_pcidev_by_busid(topology, dom, bus, dev, func); hwloc_obj_t obj2 = hwloc_get_non_io_ancestor_obj(topology, obj); #if (FPGA_DMA_DEBUG) hwloc_obj_type_snprintf(str, 4096, obj2, 1); printf("%s\n", str); hwloc_obj_attr_snprintf(str, 4096, obj2, " :: ", 1); printf("%s\n", str); hwloc_bitmap_taskset_snprintf(str, 4096, obj2->cpuset); printf("CPUSET is %s\n", str); hwloc_bitmap_taskset_snprintf(str, 4096, obj2->nodeset); printf("NODESET is %s\n", str); #endif if (memory_affinity) { #if HWLOC_API_VERSION > 0x00020000 retval = hwloc_set_membind(topology, obj2->nodeset, HWLOC_MEMBIND_THREAD, HWLOC_MEMBIND_MIGRATE HWLOC_MEMBIND_BYNODESET); #else hwloc_set_membind_nodeset(topology, obj2->nodeset, HWLOC MEMBIND THREAD, HWLOC_MEMBIND_MIGRATE); #endif ON_ERR_GOTO(retval, out_destroy_tok, "hwloc_set_membind"); if (cpu_affinity) { retval = hwloc_set_cpubind(topology, obj2->cpuset, HWLOC_CPUBIND_STRICT); ON_ERR_GOTO(retval, out_destroy_tok, "hwloc_set_cpubind"); ``` # 7. DMA Accelerator Functional Unit (AFU) User Guide Archives | Intel Acceleration Stack<br>Version | User Guide (PDF) | | | |-------------------------------------|--------------------------------------------------|--|--| | 1.1 | DMA Accelerator Functional Unit (AFU) User Guide | | | | 1.0 | DMA Accelerator Functional Unit (AFU) User Guide | | | # 8. Document Revision History for the DMA Accelerator Functional Unit (AFU) User Guide | Document Version | Intel<br>Acceleration<br>Stack Version | Changes | |------------------|-----------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | 2018.12.04 | 1.2 (supported<br>with Intel<br>Quartus® Prime<br>Pro Edition<br>17.1.1) | <ul> <li>Added new section Optimization for Improved DMA Performance.</li> <li>Added new path for a Platform Designer system in The DMA AFU Hardware Components section.</li> <li>Modified command in Running DMA AFU Example section.</li> </ul> | | 2018.08.15 | 1.1 (supported with Intel Quartus Prime Pro Edition 17.1.1) and 1.0 (supported with Intel Quartus Prime Pro Edition 17.0.0) | <ul> <li>Corrected the broken link in DMA Test System.</li> <li>Minor edits in Running DMA AFU Example.</li> </ul> | | 2018.08.06 | 1.1 (supported with Intel Quartus Prime Pro Edition 17.1.1) and 1.0 (supported with Intel Quartus Prime Pro Edition 17.0.0) | <ul> <li>Modified DMA AFU Block Diagram to show that MMIO Decoder Logic is used in place of Asynchronus Shim BBB for CCI-P transactions.</li> <li>Modified the environment variable from \$DCP_LOC to \$OPAE_PLATFORM_ROOT in the Running DMA AFU Example section.</li> </ul> | | 2018.04.11 | 1.0 (supported<br>with Intel<br>Quartus Prime<br>Pro Edition<br>17.0.0) | Updated Figure: DMA AFU Hardware Block Diagram to include the DMA BBB. Updated information in The DMA AFU Hardware Components Register Map and Address Spaces Running DMA AFU Example Corrected the title of the reference document from the Altera Acceleration Stack for Intel Xeon CPU with FPGAs Getting Started Guide to Intel Acceleration Stack Quick Start Guide for Intel Programmable Acceleration Card with Intel Arria 10 GX FPGA in The DMA AFU Software Package section. Editorial modifications. | | 2017.12.22 | 1.0 Beta<br>(supported with<br>Intel Quartus<br>Prime Pro Edition<br>17.0.0) | Initial release. | Intel Corporation. All rights reserved. Intel, the Intel logo, Altera, Arria, Cyclone, Enpirion, MAX, Nios, Quartus and Stratix words and logos are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. Intel warrants performance of its FPGA and semiconductor products to current specifications in accordance with Intel's standard warranty, but reserves the right to make changes to any products and services at any time without notice. Intel assumes no responsibility or liability arising out of the application or use of any information, product, or service described herein except as expressly agreed to in writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying on any published information and before placing orders for products or services. 9001:2015 Registered