Replace the duplicated z4c_gpu_rhs_ss.cu with a lightweight
gpu_rhs_z4c_ss wrapper inside bssn_gpu_rhs_ss.cu (guarded by
#if ABEtype==2). The wrapper:
1. Builds trKd = trK + 2*TZ on host and passes it to gpu_rhs_ss
2. After BSSN GPU returns, computes TZ_rhs = alpn1*Hcon/2 and
applies kappa1/kappa2 constraint damping on CPU
This avoids duplicate kernel definitions (linker errors) and
keeps all shell GPU code in a single file. The CPU-side Z4C
corrections are O(100K) operations — negligible vs GPU RHS time.
Also remove the separate z4c_gpu_rhs_ss.cu and its build rule.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>