Cache Sync in RestrictProlong: replace 11 basic Parallel::Sync() calls
with Parallel::Sync_cached() across RestrictProlong, RestrictProlong_aux,
and ProlongRestrict to avoid rebuilding grid segment lists every call.
Merge paired MPI_Allreduce in surface_integral: combine 9 pairs of
consecutive RP/IP Allreduce calls into single calls with count=2*NN.
Merge scalar MPI_Allreduce in surf_MassPAng: combine 3 groups of 7
scalar Allreduce calls (mass + angular/linear momentum) into single
calls with count=7.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>