Spawns tasks in a way that the threads in a warp see contiguous
thread_id, unlike the original variant where each thread were allocated
a range of thread_id that spans the number of batches.
E.g. in a 4-thread config, instead of mapping IDs (0,2,4,6)->(1,3,5,7),
map (0,1,2,3)->(4,5,6,7).
TODO remaining logic not implemented.
+ Microarchitecture optimizations
+ 64-bit support
+ Xilinx FPGA support
+ LLVM-16 support
+ Refactoring and quality control fixes
minor update
minor update
minor update
minor update
minor update
minor update
cleanup
cleanup
cache bindings and memory perf refactory
minor update
minor update
hw unit tests fixes
minor update
minor update
minor update
minor update
minor update
minor udpate
minor update
minor update
minor update
minor update
minor update
minor update
minor update
minor updates
minor updates
minor update
minor update
minor update
minor update
minor update
minor update
minor updates
minor updates
minor updates
minor updates
minor update
minor update