Replaces blocking Parallel::Sync + MPI_Allreduce in Z4c_class Step() with
non-blocking MPI_Iallreduce overlapped with Sync_start/Sync_finish, matching
the pattern already used in bssn_class on cjy-oneapi-opus-hotfix. Covers both ABEtype==2
and CPBC variants (predictor + corrector = 4 call sites).
Cherry-picked optimization from afd4006, adapted to SyncCache
infrastructure instead of the separate SyncPlan API.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>