Commit Graph

13 Commits

Author SHA1 Message Date
Hansung Kim
4aba018733 sgemm_impl: Fix wrong barrier count; add barrier for write_to_smem 2024-08-19 15:33:23 -07:00
Hansung Kim
e93e54cdec sgemm_impl: Drop volatile quanitifier
doesn't seem to do much & creates excessive type errors.
2024-08-19 15:19:53 -07:00
Hansung Kim
42ddb9a48e sgemm_impl: Accept layout template param at gemm_single_tile and wmma_load 2024-08-19 13:16:51 -07:00
Hansung Kim
1b133e7b5c sgemm_impl: Rename dmem load function 2024-08-18 22:26:49 -07:00
Hansung Kim
46b5047775 sgemm_impl: Remove GMEM_COALESCED_A option
Uncoalesced GMEM accesses is verified to yield slow performance and the
relevant code is not used anymore; remove the cruft
2024-08-18 22:26:02 -07:00
Hansung Kim
04643fa64d sgemm_impl: Refactor dmem_load into one unified logic
Replace the confusing logic that had slightly different use of BM/BN/BK
for A and B, into one logic that accepts matrix memory layout as a
proper argument & does compile-time logic to determine the right
dimensions.

TODO: !GMEM_COALESCED_A is not updated yet
2024-08-18 22:05:22 -07:00
Hansung Kim
b44b202a21 sgemm_impl: Rename to wmma 2024-08-18 16:21:22 -07:00
Hansung Kim
b978bf8757 sgemm_impl: Split tile offset addr gen from wmma store
& add an option to write to smem in gemm_single_tile.
2024-08-18 16:10:29 -07:00
Hansung Kim
d0809d292a sgemm: Specify A/B tile SMEM address via template args
& split single-time GEMM into a separate function.
2024-08-16 18:01:57 -07:00
Hansung Kim
a1858e0c80 sgemm_impl: Parameterize BK/TCK by FP_SIZE 2024-08-15 20:33:33 -07:00
Hansung Kim
014f7cd06f sgemm_tcore: Unpack arg params, remove threadblock_dim_y
thread_block_gemm is meant to be reusable, so it shouldn't assume what
the kernel arg struct looks like.

threadblock_dim_y was ambiguous and didn't match the literal name either
(it was used as # of warps that participate in a barrier).
2024-08-14 20:34:49 -07:00
Hansung Kim
1b1264207b sgemm_tcore: Add compile-time write_to_gmem param to thread_block_gemm 2024-08-14 17:48:31 -07:00
Hansung Kim
ee6339a35f sgemm_tcore: Split all impl code into sgemm_impl.hpp
This is to make thread_block_gemm a re-usable library function for GEMM
operations for use in other kernels.
2024-08-14 16:24:48 -07:00