api: host code to setup and execute FPGA bitstreams. Compiled to static library that can be linked to your applicationkernels: OpenCL kernel code for 1d, 2d and 3d FFTexamples: Sample code that makes use of the apicmake: cmake modules used by the build systemscripts: convenience slurm scriptsdocs: describes models regarding performance and resource utilization
These additional libraries are automatically fetched during system configuration:
- cxxopts for command line argument parsing
- hlslib for CMake Intel FPGA OpenCL find packages
- findFFTW for CMake FFTW find package
- gtest for unit tests
The following compile options can be set when creating a CMake build directory either using the -D parameter or by using the cmake-gui such as:
cmake -DCMAKE_BUILD_TYPE=Release ..
ccmake ..
| ** Name ** | ** Description ** | ** Default Values ** | ** Alternate Values ** |
|---|---|---|---|
AOC\_FLAGS* * |
Intel offline compiler flags used for kernel compilation | -g -v -no-interleaving=default |
|
EMU\_FLAGS |
Compiler flags used for emulation, with fast emulation as default | -march=emulator |
|
FPGA\_BOARD\_NAME |
Name of the target FPGA board | p520\_hpc\_sg280l |
pac\_s10\_usm |
LOG\_FFT\_SIZE |
Currently supported log2 number of points along each FFT dimension | 6 | 5, 7, 8, 9 |
BURST\_INTERLEAVING* |
Toggle to enable burst interleaved global memory accesses Sets the -no-interleaving= to the AOC\_FLAGS* parameter |
NO | YES |
DDR\_BUFFER\_LOCATION |
Name of the global memory interface found in the board\_spec.xml DDR :p520\_hpc\_sg280l, device : pac\_s10\_usm board |
DDR |
device |
SVM\_BUFFER\_LOCATION |
Name of the SVM global memory interface found in the board\_spec.xml* * "" : p520\_hpc\_sg280l, host: pac\_s10\_usm |
host |
|
CMAKE\_BUILD\_TYPE |
Specify the build type | Debug |
Release, RelWithDebInfo |
Generation of Intel OpenCL Offline Compiler reports
make <kernel_name>_report
make fft1d_reportCMake debug builds lets you step through code using gdb.
gdb --args ./fft -n 64 -d 2 -p <path-to-bitstream>
Offloading FFT on FPGA
Usage:
./fft<..> [OPTION...]
-n, --num arg Number of sample points in a dimension (default: 64)
-d, --dim arg Number of dimensions (default: 3)
-b, --back Toggle Backward FFT
-i, --iter arg Number of iterations (default: 1)
-p, --path arg Path to FPGA bitstream
-y, --noverify Toggle to not verify with FFTW
-c, --batch arg Number of batches of FFT calculations in FPGA (default: 1)
-t, --burst Toggle to use burst interleaved global memory accesses in
FPGA
-m, --use_bram Toggle to use BRAM instead of DDR for 3D Transpose
-s, --use_usm Toggle to use Unified Shared Memory features for data
transfers between host and device
-e, --emulate Toggle to enable emulation
-h, --help Print usageThe examples measure and output relevant performance metrics that are shown below:
------------------------------------------
FFT CONFIGURATION:
--------------------------------------------
Type : Complex to Complex
Points : 64
Direction : Forward
Placement : In Place
Batch : 1
Iterations : 1
Transpose3D : DDR
Burst Interleaving : No
Emulation : Yes
USM Feature : No
--------------------------------------------
-- Initializing FPGA ...
-- 1 platforms found
0: intel(r) fpga emulation platform for opencl(tm)
-- 1 devices found
Choosing first device by default
-- Getting program binary from path: p520_hpc_sg280l/emulation/fft1d_64_nointer/fft1d.aocx
-- Building the program
0: Calculating FFT -
-- Launching 1D FFT of 1 batches
Launching FFT transform for 1 batch
-- Copying data from host to device
-- Executing kernels
-- Transfering results back to host
-- Cleaning up FPGA resources ...
------------------------------------------
Measurements
--------------------------------------------
PCIe Write = 0.0000ms
Kernel Execution = 0.0182ms
Kernel Exec/Batch = 0.0182ms
PCIe Read = 0.0000ms
Total = 0.0182ms
Throughput = 0.0982GFLOPS/s | 26.8213 GB/s-
PCIe Write: time taken in milliseconds to transfer data from host memory of the CPU to the global memory of the FPGA. -
PCIe Read: the time taken in milliseconds to transfer data from global memory of the FPGA to the host memory of the CPU. -
Kernel Execution: the time taken in milliseconds for the execution of the required kernels, which includes the global memory accesses. -
Total:PCIe Write+Kernel Execution+PCIe Read -
Throughput: $$ \frac{dim * 5 * N^{dim} * log_2 N}{runtime}$$