Hansung Kim
4aba018733
sgemm_impl: Fix wrong barrier count; add barrier for write_to_smem
2024-08-19 15:33:23 -07:00
Hansung Kim
e93e54cdec
sgemm_impl: Drop volatile quanitifier
...
doesn't seem to do much & creates excessive type errors.
2024-08-19 15:19:53 -07:00
Hansung Kim
42ddb9a48e
sgemm_impl: Accept layout template param at gemm_single_tile and wmma_load
2024-08-19 13:16:51 -07:00
Hansung Kim
1b133e7b5c
sgemm_impl: Rename dmem load function
2024-08-18 22:26:49 -07:00
Hansung Kim
46b5047775
sgemm_impl: Remove GMEM_COALESCED_A option
...
Uncoalesced GMEM accesses is verified to yield slow performance and the
relevant code is not used anymore; remove the cruft
2024-08-18 22:26:02 -07:00
Hansung Kim
04643fa64d
sgemm_impl: Refactor dmem_load into one unified logic
...
Replace the confusing logic that had slightly different use of BM/BN/BK
for A and B, into one logic that accepts matrix memory layout as a
proper argument & does compile-time logic to determine the right
dimensions.
TODO: !GMEM_COALESCED_A is not updated yet
2024-08-18 22:05:22 -07:00
Hansung Kim
b44b202a21
sgemm_impl: Rename to wmma
2024-08-18 16:21:22 -07:00
Hansung Kim
b978bf8757
sgemm_impl: Split tile offset addr gen from wmma store
...
& add an option to write to smem in gemm_single_tile.
2024-08-18 16:10:29 -07:00
Hansung Kim
d0809d292a
sgemm: Specify A/B tile SMEM address via template args
...
& split single-time GEMM into a separate function.
2024-08-16 18:01:57 -07:00
Hansung Kim
a1858e0c80
sgemm_impl: Parameterize BK/TCK by FP_SIZE
2024-08-15 20:33:33 -07:00
Hansung Kim
014f7cd06f
sgemm_tcore: Unpack arg params, remove threadblock_dim_y
...
thread_block_gemm is meant to be reusable, so it shouldn't assume what
the kernel arg struct looks like.
threadblock_dim_y was ambiguous and didn't match the literal name either
(it was used as # of warps that participate in a barrier).
2024-08-14 20:34:49 -07:00
Hansung Kim
1b1264207b
sgemm_tcore: Add compile-time write_to_gmem param to thread_block_gemm
2024-08-14 17:48:31 -07:00
Hansung Kim
ee6339a35f
sgemm_tcore: Split all impl code into sgemm_impl.hpp
...
This is to make thread_block_gemm a re-usable library function for GEMM
operations for use in other kernels.
2024-08-14 16:24:48 -07:00