Merge branch 'master' of https://github.gatech.edu/casl/Vortex
This commit is contained in:
@@ -18,7 +18,7 @@ Directory structure
|
|||||||
|
|
||||||
- benchmarks: OpenCL and RISC-V benchmarks
|
- benchmarks: OpenCL and RISC-V benchmarks
|
||||||
|
|
||||||
- docs: documentation.
|
- docs: [documentation](https://github.com/vortexgpgpu/vortex-dev/blob/master/doc/Vortex.md).
|
||||||
|
|
||||||
- hw: hardware sources.
|
- hw: hardware sources.
|
||||||
|
|
||||||
|
|||||||
35
doc/Codebase.md
Normal file
35
doc/Codebase.md
Normal file
@@ -0,0 +1,35 @@
|
|||||||
|
# Vortex Codebase
|
||||||
|
|
||||||
|
The directory/file layout of the Vortex codebase is as followed:
|
||||||
|
|
||||||
|
- `benchmark`: contains opencl, risc-v, and vector tests
|
||||||
|
- `opencl`: contains basic kernel operation tests (i.e. vector add, transpose, dot product)
|
||||||
|
- `riscv`: contains official riscv tests which are pre-compiled into binaries
|
||||||
|
- `vector`: tests for vector instructions (not yet implemented)
|
||||||
|
- `ci`: contain tests to be run during continuous integration (Travis CI)
|
||||||
|
- driver, opencl, riscv_isa, and runtime tests
|
||||||
|
- `driver`: contains driver software implementation (software that is run on the host to communicate with the vortex processor)
|
||||||
|
- `opae`: contains code for driver that runs on FPGA
|
||||||
|
- `rtlsim`: contains code for driver that runs on local machine (driver built using verilator which converts rtl to c++ binary)
|
||||||
|
- `simx`: contains code for driver that runs on local machine (vortex)
|
||||||
|
- `include`: contains vortex.h which has the vortex API that is used by the drivers
|
||||||
|
- `runtime`: contains software used inside kernel programs to expose GPGPU capabilities
|
||||||
|
- `include`: contains vortex API needed for runtime
|
||||||
|
- `linker`: contains linker file for compiling kernels
|
||||||
|
- `src`: contains implementation of vortex API (from include folder)
|
||||||
|
- `tests`: contains runtime tests
|
||||||
|
- `simple`: contains test for GPGPU functionality allowed in vortex
|
||||||
|
- `simx`: contains simX, the cycle approximate simulator for vortex
|
||||||
|
- `miscs`: contains old code that is no longer used
|
||||||
|
- `hw`:
|
||||||
|
- `unit_tests`: contains unit test for RTL of cache and queue
|
||||||
|
- `syn`: contains all synthesis scripts (quartus and yosys)
|
||||||
|
- `quartus`: contains code to synthesis cache, core, pipeline, top, and vortex stand-alone
|
||||||
|
- `simulate`: contains RTL simulator (verilator)
|
||||||
|
- `testbench.cpp`: runs either the riscv, runtime, or opencl tests
|
||||||
|
- `opae`: contains source code for the accelerator functional unit (AFU) and code which programs the fpga
|
||||||
|
- `rtl`: contains rtl source code
|
||||||
|
- `cache`: contains cache subsystem code
|
||||||
|
- `fp_cores`: contains floating point unit code
|
||||||
|
- `interfaces`: contains code that handles communication for each of the units of the microarchitecture
|
||||||
|
- `libs`: contains general-purpose modules (i.e., buffers, encoders, arbiters, pipe registers)
|
||||||
BIN
doc/Images/vortex_microarchitecture_v2.png
Normal file
BIN
doc/Images/vortex_microarchitecture_v2.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 517 KiB |
94
doc/Microarchitecture.md
Normal file
94
doc/Microarchitecture.md
Normal file
@@ -0,0 +1,94 @@
|
|||||||
|
# Vortex Microarchitecture
|
||||||
|
|
||||||
|
### Vortex GPGPU Execution Model
|
||||||
|
|
||||||
|
Vortex uses the SIMT (Single Instruction, Multiple Threads) execution model with a single warp issued per cycle.
|
||||||
|
|
||||||
|
- **Threads**
|
||||||
|
- Smallest unit of computation
|
||||||
|
- Each thread has its own register file (32 int + 32 fp registers)
|
||||||
|
- Threads execute in parallel
|
||||||
|
- **Warps**
|
||||||
|
- A logical clster of threads
|
||||||
|
- Each thread in a warp execute the same instruction
|
||||||
|
- The PC is shared; maintain thread mask for Writeback
|
||||||
|
- Warp's execution is time-multiplexed at log steps
|
||||||
|
- Ex. warp 0 executes at cycle 0, warp 1 executes at cycle 1
|
||||||
|
|
||||||
|
### Vortex RISC-V ISA Extension
|
||||||
|
|
||||||
|
- **Thread Mask Control**
|
||||||
|
- Control the number of warps to activate during execution
|
||||||
|
- `TMC` *count*: activate count threads
|
||||||
|
- **Warp Scheduling**
|
||||||
|
- Control the number of warps to activate during execution
|
||||||
|
- `WSPAWN` *count, addr*: activate count warps and jump to addr location
|
||||||
|
- **Control-Flow Divergence**
|
||||||
|
- Control threads to activate when a branch diverges
|
||||||
|
- `SPLIT` *predicate*: apply 'taken' predicate thread mask adn save 'not-taken' into IPDOM stack
|
||||||
|
- `JOIN`: restore 'not-taken' thread mask
|
||||||
|
- **Warp Synchronization**
|
||||||
|
- `BAR` *id, count*: stall warps entering barrier *id* until count is reached
|
||||||
|
|
||||||
|
### Vortex Pipeline/Datapath
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
Vortex has a 5-stage pipeline: FI | ID | Issue | EX | WB.
|
||||||
|
|
||||||
|
- **Fetch**
|
||||||
|
- Warp Scheduler
|
||||||
|
- Track stalled & active warps, resolve branches and barriers, maintain split/join IPDOM stack
|
||||||
|
- Instruction Cache
|
||||||
|
- Retrieve instruction from cache, issue I-cache requests/responses
|
||||||
|
- **Decode**
|
||||||
|
- Decode fetched instructions, notify warp scheduler when the following instructions are decoded:
|
||||||
|
- Branch, tmc, split/join, wspawn
|
||||||
|
- Precompute used_regs mask (needed for Issue stage)
|
||||||
|
- **Issue**
|
||||||
|
- Scheduling
|
||||||
|
- In-order issue (operands/execute unit ready), out-of-order commit
|
||||||
|
- IBuffer
|
||||||
|
- Store fetched instructions, separate queues per-warp, selects next warp through round-robin scheduling
|
||||||
|
- Scoreboard
|
||||||
|
- Track in-use registers
|
||||||
|
- GPRs (General-Purpose Registers) stage
|
||||||
|
- Fetch issued instruction operands and send operands to execute unit
|
||||||
|
- **Execute**
|
||||||
|
- ALU Unit
|
||||||
|
- Single-cycle operations (+,-,>>,<<,&,|,^), Branch instructions (Share ALU resources)
|
||||||
|
- MULDIV Unit
|
||||||
|
- Multiplier - done in 2 cycles
|
||||||
|
- Divider - division and remainder, done in 32 cycles
|
||||||
|
- Implements serial alogrithm (Stalls the pipeline)
|
||||||
|
- FPU Unit
|
||||||
|
- Multi-cycle operations, uses `FPnew` Library on ASIC, uses hard DSPs on FPGA
|
||||||
|
- CSR Unit
|
||||||
|
- Store constant status registers - device caps, FPU status flags, performance counters
|
||||||
|
- Handle external CSR requests (requests from host CPU)
|
||||||
|
- LSU Unit
|
||||||
|
- Handle load/store operations, issue D-cache requests, handle D-cache responses
|
||||||
|
- Commit load responses - saves storage, Scoreboard tracks completion
|
||||||
|
- GPGPU Unit
|
||||||
|
- Handle GPGPU instructions
|
||||||
|
- TMC, WSPAWN, SPLIT, BAR
|
||||||
|
- JOIN is handled by Warp Scheduler (upon SPLIT response)
|
||||||
|
- **Commit**
|
||||||
|
- Commit
|
||||||
|
- Update CSR flags, update performance counters
|
||||||
|
- Writeback
|
||||||
|
- Write result back to GPRs, notify Scoreboard (release in-use register), select candidate instruction (ALU unit has highest priority)
|
||||||
|
- **Clustering**
|
||||||
|
- Group mulitple cores into clusters (optionally share L2 cache)
|
||||||
|
- Group multiple clusters (optionally share L3 cache)
|
||||||
|
- Configurable at build time
|
||||||
|
- Default configuration:
|
||||||
|
- #Clusters = 1
|
||||||
|
- #Cores = 4
|
||||||
|
- #Warps = 4
|
||||||
|
- #Threads = 4
|
||||||
|
- **FPGA AFU Interface**
|
||||||
|
- Manage CPU-GPU comunication
|
||||||
|
- Query devices caps, load kernel instructions and resource buffers, start kernel execution, read destination buffers
|
||||||
|
- Local Memory - GPU access to local DRAM
|
||||||
|
- Reserved I/O addresses - redirect to host CPU, console output
|
||||||
@@ -24,12 +24,27 @@ Running tests under specific drivers (rtlsim,simx,fpga) is done using the script
|
|||||||
- *L3cache* - used to enable the shared l3cache among the Vortex clusters.
|
- *L3cache* - used to enable the shared l3cache among the Vortex clusters.
|
||||||
- *Driver* - used to specify which driver to run the Vortex simulation (either rtlsim, vlsim, fpga, or simx).
|
- *Driver* - used to specify which driver to run the Vortex simulation (either rtlsim, vlsim, fpga, or simx).
|
||||||
- *Debug* - used to enable debug mode for the Vortex simulation.
|
- *Debug* - used to enable debug mode for the Vortex simulation.
|
||||||
- *Scope* -
|
- *Perf* - used to enable the detailed performance counters within the Vortex simulation.
|
||||||
- *Perf* - is used to enable the detailed performance counters within the Vortex simulation.
|
- *App* - used to specify which test/benchmark to run in the Vortex simulation. The main choices are vecadd, sgemm, basic, demo, and dogfood. Other tests/benchmarks are located in the `/benchmarks/opencl` folder though not all of them work wit the current version of Vortex.
|
||||||
- *App* - is used to specify which test/benchmark to run in the Vortex simulation. The main choices are vecadd, sgemm, basic, demo, and dogfood. Other tests/benchmarks are located in the `/benchmarks/opencl` folder though not all of them work wit the current version of Vortex.
|
- *Args* - used to pass additional arguments to the application.
|
||||||
- *Args* -
|
|
||||||
|
|
||||||
Example use of command line arguments: Run the sgemm benchmark using the vlsim driver with a Vortex configuration of 1 cluster, 4 cores, 4 warps, and 4 threads.
|
Example use of command line arguments: Run the sgemm benchmark using the vlsim driver with a Vortex configuration of 1 cluster, 4 cores, 4 warps, and 4 threads.
|
||||||
|
|
||||||
$ ./ci/blackbox.sh --clusters=1 --cores=4 --warps=4 --threads=4 --driver=vlsim --app=sgemm
|
$ ./ci/blackbox.sh --clusters=1 --cores=4 --warps=4 --threads=4 --driver=vlsim --app=sgemm
|
||||||
|
|
||||||
|
Output from terminal:
|
||||||
|
```
|
||||||
|
Create context
|
||||||
|
Create program from kernel source
|
||||||
|
Upload source buffers
|
||||||
|
Execute the kernel
|
||||||
|
Elapsed time: 2463 ms
|
||||||
|
Download destination buffer
|
||||||
|
Verify result
|
||||||
|
PASSED!
|
||||||
|
PERF: core0: instrs=90802, cycles=52776, IPC=1.720517
|
||||||
|
PERF: core1: instrs=90693, cycles=53108, IPC=1.707709
|
||||||
|
PERF: core2: instrs=90849, cycles=53107, IPC=1.710678
|
||||||
|
PERF: core3: instrs=90836, cycles=50347, IPC=1.804199
|
||||||
|
PERF: instrs=363180, cycles=53108, IPC=6.838518
|
||||||
|
```
|
||||||
31
doc/Vortex.md
Normal file
31
doc/Vortex.md
Normal file
@@ -0,0 +1,31 @@
|
|||||||
|
# Vortex Documentation
|
||||||
|
|
||||||
|
### Table of Contents
|
||||||
|
|
||||||
|
- [Vortex Codebase Layout](https://github.com/vortexgpgpu/vortex-dev/blob/master/doc/Codebase.md)
|
||||||
|
- [Vortex Microarchitecture and Extended RISC-V ISA](https://github.com/vortexgpgpu/vortex-dev/blob/master/doc/Microarchitecture.md)
|
||||||
|
- Vortex Software
|
||||||
|
- [Vortex Simulation](https://github.com/vortexgpgpu/vortex-dev/blob/master/doc/Simulation.md)
|
||||||
|
- [FPGA Configuration, Program and Test](https://github.com/vortexgpgpu/vortex-dev/blob/master/doc/Flubber_FPGA_Startup_Guide.md)
|
||||||
|
- Debugging
|
||||||
|
- Useful Links
|
||||||
|
|
||||||
|
### Quick Start
|
||||||
|
|
||||||
|
Setup Vortex environment:
|
||||||
|
```
|
||||||
|
$ export RISCV_TOOLCHAIN_PATH=/opt/riscv-gnu-toolchain
|
||||||
|
$ export PATH=:/opt/verilator/bin:$PATH
|
||||||
|
$ export VERILATOR_ROOT=/opt/verilator
|
||||||
|
```
|
||||||
|
|
||||||
|
Test Vortex with different drivers and configurations:
|
||||||
|
- Run basic driver test with rtlsim driver and Vortex config of 2 clusters, 2 cores, 2 warps, 4 threads
|
||||||
|
|
||||||
|
$ ./ci/blackbox.sh --clusters=2 --cores=2 --warps=2 --threads=4 --driver=rtlsim --app=basic
|
||||||
|
- Run demo driver test with vlsim driver and Vortex config of 1 clusters, 4 cores, 4 warps, 2 threads
|
||||||
|
|
||||||
|
$ ./ci/blackbox.sh --clusters=1 --cores=4 --warps=4 --threads=2 --driver=vlsim --app=demo
|
||||||
|
- Run dogfood driver test with simx driver and Vortex config of 4 cluster, 4 cores, 8 warps, 6 threads
|
||||||
|
|
||||||
|
$ ./ci/blackbox.sh --clusters=4 --cores=4 --warps=8 --threads=6 --driver=simx --app=dogfood
|
||||||
@@ -5,19 +5,16 @@ Description: Makes the build in the opae directory with the specified core
|
|||||||
exists, a make clean command is ran before the build. Script waits
|
exists, a make clean command is ran before the build. Script waits
|
||||||
until the inteldev script or quartus program is finished running.
|
until the inteldev script or quartus program is finished running.
|
||||||
|
|
||||||
Usage: ./build.sh -c [1|2|4|8|16] [-p perf] [-w wait]
|
Usage: ./build.sh -c [1|2|4|8|16] [-p [y|n]]
|
||||||
|
|
||||||
Options:
|
Options:
|
||||||
-c
|
-c
|
||||||
Core count (1, 2, 4, 8, or 16).
|
Core count (1, 2, 4, 8, or 16).
|
||||||
|
|
||||||
-p
|
-p
|
||||||
Performance profiling enable. Changes the source file in the
|
Performance profiling enable (y or n). Changes the source file in the
|
||||||
opae directory to include/exclude "+define+PERF_ENABLE".
|
opae directory to include/exclude "+define+PERF_ENABLE".
|
||||||
|
|
||||||
-w
|
|
||||||
Wait for the build to complete
|
|
||||||
|
|
||||||
_______________________________________________________________________________
|
_______________________________________________________________________________
|
||||||
|
|
||||||
|
|
||||||
@@ -27,6 +24,7 @@ Description: Runs build.sh with performance profiling enabled for all valid
|
|||||||
core configurations.
|
core configurations.
|
||||||
|
|
||||||
_______________________________________________________________________________
|
_______________________________________________________________________________
|
||||||
|
_______________________________________________________________________________
|
||||||
|
|
||||||
|
|
||||||
-program_fpga.sh-
|
-program_fpga.sh-
|
||||||
@@ -41,6 +39,7 @@ Options:
|
|||||||
Core count (1, 2, 4, 8, or 16).
|
Core count (1, 2, 4, 8, or 16).
|
||||||
|
|
||||||
_______________________________________________________________________________
|
_______________________________________________________________________________
|
||||||
|
_______________________________________________________________________________
|
||||||
|
|
||||||
|
|
||||||
-gather_perf_results.sh-
|
-gather_perf_results.sh-
|
||||||
@@ -65,3 +64,53 @@ _______________________________________________________________________________
|
|||||||
Description: Programs fpga and runs gather_perf_results.sh for all valid core
|
Description: Programs fpga and runs gather_perf_results.sh for all valid core
|
||||||
configurations. All builds should already be made before running
|
configurations. All builds should already be made before running
|
||||||
this.
|
this.
|
||||||
|
|
||||||
|
_______________________________________________________________________________
|
||||||
|
_______________________________________________________________________________
|
||||||
|
|
||||||
|
|
||||||
|
-export_csv.sh-
|
||||||
|
|
||||||
|
Description: Creates specified .csv output file from an input directory, file,
|
||||||
|
and parameter. The .csv file contains two columns: cores, and the input
|
||||||
|
parameter. The output file is located within the directory specified with -d.
|
||||||
|
|
||||||
|
Usage: ./export_csv.sh -c [cores] -d [directory] -i [input filename] -o
|
||||||
|
[output filename] -p '[parameter]'
|
||||||
|
|
||||||
|
Example: ./export_csv.sh -c 16 -d perf_2021_03_07 -i sgemm.result -o output.csv
|
||||||
|
-p 'PERF: scoreboard stalls'
|
||||||
|
|
||||||
|
Options:
|
||||||
|
-c
|
||||||
|
Upper limit of cores to be read in. Core directories should exist in
|
||||||
|
the directory specified by -d e.g. 1c, 2c, 4c for -c 4.
|
||||||
|
|
||||||
|
-d
|
||||||
|
The directory of the form perf_{date} located in the evaluation
|
||||||
|
directory.
|
||||||
|
|
||||||
|
-i
|
||||||
|
The input filename located in each core directory within the
|
||||||
|
directory specified by -d.
|
||||||
|
|
||||||
|
-o
|
||||||
|
The output filename to be created within the directory specified
|
||||||
|
by -d.
|
||||||
|
|
||||||
|
-p
|
||||||
|
The parameter corresponding to the core count in the .csv file. The
|
||||||
|
full name of the parameter from the start of the line should be
|
||||||
|
inputted to avoid the parameter name being matched multiple times.
|
||||||
|
|
||||||
|
_______________________________________________________________________________
|
||||||
|
|
||||||
|
|
||||||
|
-export_ipc_csv.sh-
|
||||||
|
|
||||||
|
Description: Runs export_csv.sh for the parameter IPC.
|
||||||
|
|
||||||
|
Usage: ./export_csv.sh -c [cores] -d [directory] -i [input filename] -o
|
||||||
|
[output filename]
|
||||||
|
|
||||||
|
Example: ./export_ipc.sh -c 16 -d perf_2021_03_07 -i sgemm.result -o output.csv
|
||||||
|
|||||||
33
evaluation/scripts/export_csv.sh
Executable file
33
evaluation/scripts/export_csv.sh
Executable file
@@ -0,0 +1,33 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
while getopts c:d:i:o:p: flag
|
||||||
|
do
|
||||||
|
case "${flag}" in
|
||||||
|
c) cores=${OPTARG};; #1, 2, 4, 8, 16
|
||||||
|
d) dir=${OPTARG};; #directory name (e.g. perf_2021_03_07)
|
||||||
|
i) ifile=${OPTARG};; #input filename
|
||||||
|
o) ofile=${OPTARG};; #output filename
|
||||||
|
p) param=${OPTARG};; #parameter to be made into csv
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
if [[ ! "$cores" =~ ^(1|2|4|8|16)$ ]]; then
|
||||||
|
echo 'Invalid parameter for argument -c (1, 2, 4, 8, or 16 expected)'
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ -z "$ifile" ]; then
|
||||||
|
echo 'No input filename given for argument -f'
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ -z "$dir" ]; then
|
||||||
|
echo 'No directory given for argument -d'
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf "cores,${param}\n" > "../${dir}/${ofile}"
|
||||||
|
for ((i=1; i<=$cores; i=i*2)); do
|
||||||
|
printf "${i}," >> "../${dir}/${ofile}"
|
||||||
|
(sed -n "s/${param}=\(.*\)/\1/p" < "../${dir}/${i}c/${ifile}") >> "../${dir}/${ofile}"
|
||||||
|
done
|
||||||
32
evaluation/scripts/export_ipc_csv.sh
Executable file
32
evaluation/scripts/export_ipc_csv.sh
Executable file
@@ -0,0 +1,32 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
while getopts c:d:f:o: flag
|
||||||
|
do
|
||||||
|
case "${flag}" in
|
||||||
|
c) cores=${OPTARG};; #1, 2, 4, 8, 16
|
||||||
|
d) dir=${OPTARG};; #directory name (e.g. perf_2021_03_07)
|
||||||
|
i) ifile=${OPTARG};; #input filename
|
||||||
|
o) ofile=${OPTARG};; #output filename
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
if [[ ! "$cores" =~ ^(1|2|4|8|16)$ ]]; then
|
||||||
|
echo 'Invalid parameter for argument -c (1, 2, 4, 8, or 16 expected)'
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ -z "$ifile" ]; then
|
||||||
|
echo 'No input filename given for argument -f'
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ -z "$dir" ]; then
|
||||||
|
echo 'No directory given for argument -d'
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf "cores,IPC" > "../${dir}/${ofile}"
|
||||||
|
for ((i=1; i<=$cores; i=i*2)); do
|
||||||
|
printf "${i}," >> "../${dir}/${ofile}"
|
||||||
|
(sed -n "s/IPC=\(.*\)/\1/p" < "../${dir}/${i}c/${ifile}" | awk 'END {print $NF}') >> "../${dir}/${ofile}"
|
||||||
|
done
|
||||||
Reference in New Issue
Block a user