Welcome to streamblocks-partitioning
, yet another platform for StreamBlocks!
This platforms implements a profile-guided hardware-software partitioning flow
for StreamBlocks. The platforms takes various profiling information for a program
both on hardware and software and produces many partitions that are supposed to
perform better than trivial all-in-hardware or all-in-software partitions.
We assume you have a basic understanding of StreamBlocks' compilation flow, therefore we jump right into running a simple example.
The setup is similar to streamblocks-platform
, we use maven to build jar
files and use a shell script to run the partitioning tool (you need to have
streamblocks-tycho
installed before).
> cd partitioning/platform-partition
> mvn install
We use a mixed-integer linear programming approach to partition the design across hardware and software. To solve the formulas, we rely on Gurobi 8.0.1 which is a commercial optimizer with free academic licenses. Please head over to gurobi to get a license and then download version 8.1.0:
> wget https://packages.gurobi.com/8.1/gurobi8.1.0_linux64.tar.gz
> mkdir -p gurobi
> tar -xzf gurobi8.1.0_linux64.tar.gz -C gurobi
Activate your license:
./gurobi/gurobi801/linux64/bin/grbgetkey ${YOUR_LICENSE}
To use the Java bindings you need to include gurobi/gurobi810/linux64/lib/
in your LD_LIBRARY_PATH
.
export LD_LIBRARY_PATH=${STREAMBLOCKS_HOME}/streamblocks-partitioning/partitioning/gurobi/gurobi810/linux64/lib
We rely on program profiles to perform partitioning. Although software profiles
can be easily obtained through real execution for hardware profiles we rely on
simulation. streamblocks-platforms
can generate SystemC code for hardware-software
cosimulation to collect per actor hardware profile information. The simulation
relies on Verilator.
Install Verilator and and ensure
VERILATOR_ROOT
is set properly.
This repository works like the other StreamBlocks' platforms, i.e., you have
to pass the CAL source files and a target directory. In addition to that
you pass --set config=profile_data.json
which a profiling config file in
json
. It looks something like:
{
"name": "RVCDecoder",
"cores": 8,
"mode": "heterogeneous",
"systemc": {
"multiplier": 1,
"freq": 210,
"path": "PATH_TO_SYSTEMC_PROFILE_XML"
},
"opencl": {
"multiplier": 1,
"freq": 1000,
"path": "PATH_TO_OPENCL_PROFILE_XML"
},
"bandwidth": {
"multiplier": 1,
"freq": 3400.0,
"path": "PATH_TO_FIFO_BW_PROFILE_XML"
},
"software" : {
"multiplier" : 1.0,
"freq" : 3400.0,
"path" : "PATH_TO_SOFTWARE_PROFILE_XML"
}
}
This json
file points to 4 other xml
file that contain profiling
information. opencl
and bandwidth
are platform dependent but program
independent. They model the opencl
read/write bandwidth over the CPU-FPGA
interconnect (e.g., PCIe) for various buffer sizes. bandwidth
models the
software FIFO bandwidth within
and across threads. To obtain the platform profile data consult this guide.
Software profiles can obtained by compiling a project for software-only execution.
Suppose you have the followed the PassThrough
example from the streamblock-platform guide:
> ./PassThrough --with-bandwidth --with-complexity --generate=software_profile.xml
Will log the per-actor software profile in the software_profile.xml
.
Hardware profiles requires a bit more work. When compiling CAL, pass --set enable-systemc=on
to
streamblocks
to generate code needed for
simulation-based profiling. The Streamblocks.cmake
file in
streamblocks-examples
provides a handy function for generating simulation code which we recommend you
to use.
After building the binary, you can collect the hardware profile using:
> ./PassThrough --hardware-profile=systemc_profile.xml
The freq
field denotes the clock frequency at which the profiling is
performed. For instance, OpenCL performance numbers are in nano seconds,
therefore the frequency is 1GHz. The SystemC freq
is an estimate of the final
operating frequency on the FPGA. Note that you have to estimate this number,
e.g., put everything on FPGA and implement your design and use the achieved
frequency here as an estimate for any other partitioning. The software freq
is
the clock speed at which CPU performance counters work. For modern x86 processor
this usually corresponds to the nominal processor speed, but for arm the number
might be different.
The bandwidth freq
should be the same as the software freq
. Since the
profiling relies on the same methodology and performance counters.
Since we profile hardware performance through simulation, you may want to use a down sampled input to keep profiling times reasonable. If you do that you can still use the full input for software, all you need to do is to tell the partitioning tool to up-sample the hardware profiling numbers using the multiplier field.
Setting the core
field in the config file instructs the tool to try to find
partitions up to the given core count. For instance, if you set the core
field
to 4, the tool will solve 4 different optimization problems (i.e., for 1, 2, 3,
and 4 cores). Higher core count usually corresponds to longer run time.
You can set the mode
field to either heterogeneous
or homogeneous
. In the
former, the actor network is partitioned across CPU cores and an FPGA, whereas
in the latter the work is only partitioned across multiple cores. Note that in
heterogeneous
mode, only actors that have a valid systemc
profile will be
considered for hardware. This is a way for you to pin some actors to software
by essentially excluding them from the systemc
profile xml file.
The tool will generate a bunch of .xcf
and .xml
files. xcf
files can be
given to streamblocks
using the --xcf-path
to specify which actors are
placed on hardware and which are placed on software. The xml
files are loaded
at runtime in the executable using the --cfile
argument and specify the actor
to thread mappings.
├── heterogeneous
│ ├── 1 <== partitions found for a single core system
│ │ ├── multicore
│ │ │ ├── config_0.xml <== fed to --cfile argument at runtime
│ │ │ ├── config_1.xml
│ │ │ ├── config_2.xml
│ │ │ └── config_3.xml
│ │ └── xcf
│ │ ├── configuration_0.xcf <== fed to streamblocks using --xcf-path
│ │ ├── configuration_1.xcf
│ │ ├── configuration_2.xcf
│ │ └── configuration_3.xcf
│ ├── 2 <=== partitions found for a dual core system
│ │ ├── multicore
│ │ │ ├── config_0.xml
│ │ │ ├── config_1.xml
│ │ │ └── config_2.xml
│ │ └── xcf
│ │ ├── configuration_0.xcf
│ │ ├── configuration_1.xcf
│ │ └── configuration_2.xcf
| ├── hardware.json
├── unique
│ ├── unique_0.xcf <=
│ ├── unique_1.xcf
Each xcf
file is therefore paired with an xml
file. For instance,
heterogeneous/n/multicore/config_m.xml
should be used with an executable
produced by heterogeneous/n/xcf/configuration_m.xcf
.
The unique
directory enumerates all the distinct hardware partitions (i.e.,
unique sub set of actors on hardware). This can be used to avoid redundant FPGA
implementation, for instance heterogeneous/1/xcf/configuration_2.xcf
might be
the same as heterogeneous/2/xcf/configuration_3.xcf
. The json
files
hardware.json
contains a mapping from every
heterogeneous/n/xcf/configuration_m.xcf
to a unique/unique_p.xcf
file.