diff --git a/docs/contest_runners.md b/docs/contest_runners.md
index 28c0d33..8db6ddf 100644
--- a/docs/contest_runners.md
+++ b/docs/contest_runners.md
@@ -1,254 +1,53 @@
-# Contest Runners
-
-This directory contains two self-contained contest entrypoints:
-
-- `tools/tn_contest_runner.py`: general tensor-network path search and contraction.
-- `tools/mps_contest_runner.py`: Vidal/MPS multi-node expectation runner.
-
-Both scripts keep circuit and observable definitions inside the script so a
-contest case can be edited in one place.
-
-## Environment
-
-Run commands from the repository root:
-
+# TN
 ```bash
-cd /home/yx/qibotn
-```
+# qibotn目录下
+I_MPI_FABRICS=shm:ofi \
+I_MPI_OFI_PROVIDER=tcp \
+FI_PROVIDER=tcp \
+CASE=main1 \
+OBSERVABLES=long_z_string \
+NQUBITS=34 \
+NLAYERS=20 \
+TORCH_THREADS=48 \
+SEARCH_REPEATS=2048 \
+SEARCH_TIME=300 \
+SCHEDULER_HOST=10.20.1.103 \
+WORKER_HOSTS="10.20.1.103 10.20.6.101" \
+DASK_ADDRESS="tcp://10.20.1.103:8786" \
+NWORKERS=84 \
+NTHREADS=1 \
+MPIEXEC_FULL="mpirun -np 4 -hostfile /home/yx/qibotn/hostfile -perhost 2" \
+tools/run_tn_dask_mpi_all.sh
 
-For Intel MPI on two nodes, use the known working style:
+# 单独缩并contract计算
 
-```bash
-mpirun -np 4 -hostfile /home/yx/qibotn/hostfile -perhost 2 ...
-```
-
-Set `TCM_ENABLE=1` for CPU runs:
-
-```bash
-export TCM_ENABLE=1
-```
-
-## TN Workflow
-
-List built-in TN contest cases:
-
-```bash
-python -u tools/tn_contest_runner.py list
-```
-
-TN path search uses dask by default. Without `--dask-address`, the script starts
-a local dask cluster. For multiple servers, start one scheduler and workers
-with the helper script, then pass the scheduler address to the search command.
-
-Start the default two-node dask cluster:
-
-```bash
-cd /home/yx/qibotn
-tools/manage_tn_dask_cluster.sh start
-```
-
-Check status:
-
-```bash
-cd /home/yx/qibotn
-tools/manage_tn_dask_cluster.sh status
-```
-
-Stop the cluster:
-
-```bash
-cd /home/yx/qibotn
-tools/manage_tn_dask_cluster.sh stop
-```
-
-The helper defaults are:
-
-```bash
-SCHEDULER_HOST=10.20.1.103
-WORKER_HOSTS="10.20.1.103 10.20.1.102"
-NWORKERS=48
-NTHREADS=1
-ROOT_DIR=/home/yx/qibotn
-PYTHON_BIN=.venv/bin/python
-DASK_WORKER_TTL="24 hours"
-DASK_TICK_LIMIT="30 minutes"
-DASK_LOST_WORKER_TIMEOUT="30 minutes"
-```
-
-Override them inline if needed:
-
-```bash
-WORKER_HOSTS="10.20.1.103 10.20.1.102" NWORKERS=48 \
-  tools/manage_tn_dask_cluster.sh restart
-```
-
-Check that both nodes are connected by adding `--tn-debug-trials` to a small
-search. The output should include `qibotn_dask_workers` with both hosts.
-
-`tools/tn_contest_runner.py search` stops the external dask cluster after the
-search phase by default. Pass `--keep-dask` if you want to reuse the same dask
-cluster for several searches.
-
-Use enough trials to fill the cluster. With the default two-node setup there are
-96 worker slots, so `--tn-search-repeats` should be at least 96. The contest
-runner default is 2048.
-
-Cotengra trials are CPU-bound and can hold the Python GIL long enough for dask
-to report `Event loop was unresponsive`. Dask defaults are much more aggressive:
-`scheduler.worker-ttl=5 minutes`, `admin.tick.limit=3s`, and
-`deploy.lost-worker-timeout=15s`. The helper script raises these limits so
-workers are not killed by dask during search. The intended timeout is
-`--tn-search-time`; after that, the runner stops the external dask cluster.
-
-Small correctness check against statevector:
-
-```bash
-python -u tools/tn_contest_runner.py validate \
-  --case main1 \
-  --nqubits 8 \
-  --nlayers 2 \
-  --torch-threads 4 \
-  --tn-search-repeats 8 \
-  --tn-search-time 5
-```
-
-Search and save contraction trees:
-
-```bash
-TCM_ENABLE=1 python -u tools/tn_contest_runner.py search \
-  --case main1 \
-  --torch-threads 48 \
-  --dtype complex64 \
-  --dask-address tcp://10.20.1.103:8786 \
-  --tn-search-repeats 2048 \
-  --tn-search-time 300
-```
-
-Contract using the saved tree on one node:
-
-```bash
-TCM_ENABLE=1 mpirun -np 2 python -u tools/tn_contest_runner.py contract \
+I_MPI_FABRICS=shm:ofi \
+I_MPI_OFI_PROVIDER=tcp \
+FI_PROVIDER=tcp \
+mpirun -np 4 -hostfile /home/yx/qibotn/hostfile -perhost 2 \
+  .venv/bin/python -u tools/tn_contest_runner.py contract \
   --mpi \
   --case main1 \
+  --nqubits 34 \
+  --nlayers 20 \
+  --observables long_z_string \
+  --tree-dir trees/contest_tn \
   --torch-threads 48 \
   --dtype complex64
 ```
 
-Contract using the saved tree on two nodes:
-
-```bash
-TCM_ENABLE=1 mpirun -np 4 -hostfile /home/yx/qibotn/hostfile -perhost 2 \
-  python -u tools/tn_contest_runner.py contract \
-  --mpi \
-  --case main1 \
-  --torch-threads 48 \
-  --dtype complex64
+# MPS
 ```
+cd /home/yx/qibotn
 
-Run search and contract in one command:
-
-```bash
-TCM_ENABLE=1 python -u tools/tn_contest_runner.py all \
-  --case main1 \
-  --torch-threads 48 \
-  --dtype complex64 \
-  --dask-address tcp://10.20.1.103:8786 \
-  --tn-search-repeats 2048 \
-  --tn-search-time 300
-```
-
-Run only selected observables:
-
-```bash
-python -u tools/tn_contest_runner.py search \
-  --case main2 \
-  --observables open_zz
-```
-
-Tree files are written to `trees/contest_tn/` by default. The tree filename
-contains case, observable, qubit count, layer count, and target slice count.
-If any of these change, search again.
-
-Edit TN contest cases in `tools/tn_contest_runner.py`:
-
-- `CASES`: case name, circuit kind, observable list, default scale.
-- `build_circuit`: circuit definitions.
-- `pauli_sum_observable`: observable definitions.
-
-## MPS Workflow
-
-List built-in Vidal/MPS contest cases:
-
-```bash
-python -u tools/mps_contest_runner.py list
-```
-
-Small correctness check against statevector:
-
-```bash
-mpirun -np 2 python -u tools/mps_contest_runner.py validate \
-  --case main1 \
-  --nqubits 8 \
-  --nlayers 2 \
-  --bond 64 \
-  --torch-threads 4
-```
-
-Run one MPS case on one node:
-
-```bash
-TCM_ENABLE=1 mpirun -np 2 python -u tools/mps_contest_runner.py run \
-  --case main1 \
-  --torch-threads 48
-```
-
-Run one MPS case on two nodes:
-
-```bash
-TCM_ENABLE=1 mpirun -np 4 -hostfile /home/yx/qibotn/hostfile -perhost 2 \
-  python -u tools/mps_contest_runner.py run \
-  --case main1 \
-  --torch-threads 48
-```
-
-Run only one observable:
-
-```bash
-TCM_ENABLE=1 mpirun -np 4 -hostfile /home/yx/qibotn/hostfile -perhost 2 \
-  python -u tools/mps_contest_runner.py run \
-  --case main1 \
-  --observables ring_xz \
-  --torch-threads 48
-```
-
-Override scale:
-
-```bash
-TCM_ENABLE=1 mpirun -np 4 -hostfile /home/yx/qibotn/hostfile -perhost 2 \
-  python -u tools/mps_contest_runner.py run \
-  --case main1 \
-  --nqubits 128 \
-  --nlayers 24 \
-  --bond 1024 \
-  --torch-threads 48
-```
-
-Edit MPS contest cases in `tools/mps_contest_runner.py`:
-
-- `CASES`: case name, circuit kind, observable list, default scale and bond.
-- `build_circuit`: circuit definitions.
-- `observable`: observable definitions, including dense local terms.
-
-## Notes
-
-- TN uses path search plus contraction. Reuse tree files only for the exact same
-  circuit, observable, qubit count, layer count, seed, and slicing setup.
-- TN path search defaults to dask. Use `--tn-search-backend processpool` only
-  for fallback/debugging.
-- Prefer the default `--tn-target-size 4294967296` memory target. Do not force
-  `--tn-target-slices` unless you have already verified that cotengra can find
-  valid trees for that exact setting.
-- MPS/Vidal does not use contraction-tree search. It runs the circuit directly
-  and reports `trunc_sum` and `trunc_max`.
-- Default TN contraction is the stable torch/quimb path. Do not pass
-  `--tn-contract-implementation cpp` for contest runs.
+I_MPI_FABRICS=shm:ofi \
+I_MPI_OFI_PROVIDER=tcp \
+FI_PROVIDER=tcp \
+MPIEXEC_FULL="mpirun -np 4 -hostfile /home/yx/qibotn/hostfile -perhost 2" \
+TORCH_THREADS=48 \
+OBS_FILTER=ring_xz \
+MAIN1_NQ=128 \
+MAIN1_LAYERS=24 \
+MAIN1_BOND=1024 \
+tools/run_vidal_mpi_contest_cases.sh main1
+```
\ No newline at end of file
diff --git a/hostfile b/hostfile
index e596b93..19358eb 100644
--- a/hostfile
+++ b/hostfile
@@ -1,2 +1,2 @@
 10.20.1.103:2
-10.20.1.102:2
+10.20.6.101:2
diff --git a/src/qibotn/backends/cpu.py b/src/qibotn/backends/cpu.py
index 5259fe0..27db0b5 100644
--- a/src/qibotn/backends/cpu.py
+++ b/src/qibotn/backends/cpu.py
@@ -41,11 +41,19 @@ def _bind_numa_node(rank):
     Returns the NUMA domain that was selected, or ``None`` if the binding
     could not be determined.
     """
+    current_affinity = os.sched_getaffinity(0)
+    online_cpus = set(range(os.cpu_count() or 1))
+    if current_affinity and current_affinity != online_cpus:
+        # MPI launchers such as Intel MPI often pin local ranks correctly
+        # before Python starts.  Do not narrow that placement further.
+        return None
+
     local_rank = rank
     for name in (
         "OMPI_COMM_WORLD_LOCAL_RANK",
         "MV2_COMM_WORLD_LOCAL_RANK",
         "MPI_LOCALRANKID",
+        "I_MPI_LOCAL_RANK",
         "SLURM_LOCALID",
     ):
         try:
@@ -54,13 +62,27 @@ def _bind_numa_node(rank):
         except (KeyError, ValueError):
             pass
 
-    domain = local_rank % 2
-    cpulist = f"/sys/devices/system/node/node{domain}/cpulist"
+    domains = _available_numa_domains()
+    if not domains:
+        return None
+
+    local_size = _local_world_size()
+    assigned_domains = domains[local_rank::local_size]
+    if not assigned_domains:
+        assigned_domains = [domains[local_rank % len(domains)]]
+
+    domain = assigned_domains[0]
+    cpus = set()
+    for selected in assigned_domains:
+        cpulist = f"/sys/devices/system/node/node{selected}/cpulist"
+        try:
+            cpus.update(_parse_cpu_list(open(cpulist, encoding="utf-8").read().strip()))
+        except (FileNotFoundError, OSError):
+            pass
     try:
-        cpus = _parse_cpu_list(open(cpulist, encoding="utf-8").read().strip())
         if cpus:
             os.sched_setaffinity(0, cpus)
-    except (FileNotFoundError, OSError):
+    except OSError:
         pass
 
     try:
@@ -76,6 +98,38 @@ def _bind_numa_node(rank):
     return domain
 
 
+def _available_numa_domains():
+    nodes = []
+    base = Path("/sys/devices/system/node")
+    try:
+        for path in base.glob("node[0-9]*"):
+            try:
+                nodes.append(int(path.name[4:]))
+            except ValueError:
+                pass
+    except OSError:
+        return []
+    return sorted(nodes)
+
+
+def _local_world_size():
+    for name in (
+        "OMPI_COMM_WORLD_LOCAL_SIZE",
+        "MV2_COMM_WORLD_LOCAL_SIZE",
+        "MPI_LOCALNRANKS",
+        "I_MPI_LOCAL_SIZE",
+        "SLURM_NTASKS_PER_NODE",
+    ):
+        value = os.environ.get(name)
+        if not value:
+            continue
+        try:
+            return max(1, int(str(value).split("(", 1)[0]))
+        except ValueError:
+            pass
+    return 1
+
+
 def _parse_cpu_list(text):
     cpus = set()
     for item in text.split(","):
diff --git a/src/qibotn/parallel.py b/src/qibotn/parallel.py
index 2603b4f..46ecc53 100644
--- a/src/qibotn/parallel.py
+++ b/src/qibotn/parallel.py
@@ -745,6 +745,12 @@ def _contract_mpi(
     is_torch = backend == "torch"
     nslices = int(getattr(tree, "multiplicity", 1))
     stats = SlicedContractStats(rank, size, nslices, 0, assignment)
+    nslices_by_rank = comm.allgather(nslices)
+    if len(set(nslices_by_rank)) != 1:
+        raise RuntimeError(
+            "Inconsistent contraction tree slices across MPI ranks: "
+            f"{nslices_by_rank}. Ensure all nodes load the same tree file."
+        )
 
     if not set(getattr(tree, "sliced_inds", ())).isdisjoint(set(getattr(tree, "output", ()))):
         raise NotImplementedError(
diff --git a/tools/manage_tn_dask_cluster.sh b/tools/manage_tn_dask_cluster.sh
index 2fb7446..b91cd84 100755
--- a/tools/manage_tn_dask_cluster.sh
+++ b/tools/manage_tn_dask_cluster.sh
@@ -5,7 +5,7 @@ set -euo pipefail
 #
 # Defaults target two servers:
 #   scheduler: 10.20.1.103:8786
-#   workers:   10.20.1.103, 10.20.1.102
+#   workers:   10.20.1.103, 10.20.6.101
 #
 # Usage:
 #   tools/manage_tn_dask_cluster.sh start
@@ -14,7 +14,7 @@ set -euo pipefail
 #
 # Common overrides:
 #   SCHEDULER_HOST=10.20.1.103
-#   WORKER_HOSTS="10.20.1.103 10.20.1.102"
+#   WORKER_HOSTS="10.20.1.103 10.20.6.101"
 #   NWORKERS=48
 #   NTHREADS=1
 #   ROOT_DIR=/home/yx/qibotn
@@ -25,8 +25,8 @@ PYTHON_BIN="${PYTHON_BIN:-.venv/bin/python}"
 SCHEDULER_HOST="${SCHEDULER_HOST:-10.20.1.103}"
 SCHEDULER_PORT="${SCHEDULER_PORT:-8786}"
 DASHBOARD_ADDRESS="${DASHBOARD_ADDRESS:-:8787}"
-WORKER_HOSTS="${WORKER_HOSTS:-10.20.1.103 10.20.1.102}"
-NWORKERS="${NWORKERS:-48}"
+WORKER_HOSTS="${WORKER_HOSTS:-10.20.1.103 10.20.6.101}"
+NWORKERS="${NWORKERS:-84}"
 NTHREADS="${NTHREADS:-1}"
 MEMORY_LIMIT="${MEMORY_LIMIT:-0}"
 LOCAL_DIRECTORY="${LOCAL_DIRECTORY:-/tmp/qibotn-dask}"
diff --git a/tools/run_tn_dask_mpi_all.sh b/tools/run_tn_dask_mpi_all.sh
new file mode 100755
index 0000000..c273534
--- /dev/null
+++ b/tools/run_tn_dask_mpi_all.sh
@@ -0,0 +1,93 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
+cd "$ROOT_DIR"
+
+CASE="${CASE:-main1}"
+OBSERVABLES="${OBSERVABLES:-long_z_string}"
+NQUBITS="${NQUBITS:-34}"
+NLAYERS="${NLAYERS:-20}"
+TORCH_THREADS="${TORCH_THREADS:-48}"
+SEARCH_REPEATS="${SEARCH_REPEATS:-2048}"
+SEARCH_TIME="${SEARCH_TIME:-300}"
+TN_TARGET_SIZE="${TN_TARGET_SIZE:-8589934592}"
+TN_TARGET_SLICES="${TN_TARGET_SLICES:-}"
+
+PYTHON_BIN="${PYTHON_BIN:-.venv/bin/python}"
+DTYPE="${DTYPE:-complex64}"
+TREE_DIR="${TREE_DIR:-trees/contest_tn}"
+DASK_ADDRESS="${DASK_ADDRESS:-tcp://10.20.1.103:8786}"
+MPIEXEC_FULL="${MPIEXEC_FULL:-mpirun -np 4 -hostfile /home/yx/qibotn/hostfile -perhost 2}"
+SYNC_TREES="${SYNC_TREES:-1}"
+SYNC_HOSTS="${SYNC_HOSTS:-${WORKER_HOSTS:-}}"
+SSH_BIN="${SSH_BIN:-ssh}"
+
+export TCM_ENABLE="${TCM_ENABLE:-1}"
+
+tn_slice_args=(--tn-target-size "$TN_TARGET_SIZE")
+if [[ -n "$TN_TARGET_SLICES" ]]; then
+  tn_slice_args+=(--tn-target-slices "$TN_TARGET_SLICES")
+fi
+
+is_local_host() {
+  local host="$1"
+  [[ "$host" == "localhost" || "$host" == "127.0.0.1" ]] && return 0
+  [[ "$host" == "$(hostname)" ]] && return 0
+  [[ "$host" == "$(hostname -f 2>/dev/null || true)" ]] && return 0
+  hostname -I 2>/dev/null | tr ' ' '\n' | grep -qx "$host"
+}
+
+sync_trees_to_hosts() {
+  [[ "$SYNC_TREES" == "1" ]] || return 0
+  [[ -n "$SYNC_HOSTS" ]] || return 0
+
+  local src_dir="$TREE_DIR"
+  local dst_dir="$TREE_DIR"
+  if [[ "$TREE_DIR" != /* ]]; then
+    src_dir="$ROOT_DIR/$TREE_DIR"
+    dst_dir="$ROOT_DIR/$TREE_DIR"
+  fi
+
+  for host in $SYNC_HOSTS; do
+    is_local_host "$host" && continue
+    echo "Sync tree dir to $host:$dst_dir"
+    "$SSH_BIN" "$host" "mkdir -p $(printf '%q' "$dst_dir")"
+    if command -v rsync >/dev/null 2>&1; then
+      rsync -a "$src_dir/" "$host:$dst_dir/"
+    else
+      scp -q "$src_dir"/*.pkl "$host:$dst_dir/"
+    fi
+  done
+}
+
+tools/manage_tn_dask_cluster.sh start
+
+echo "Search with dask: $DASK_ADDRESS"
+"$PYTHON_BIN" -u tools/tn_contest_runner.py search \
+  --case "$CASE" \
+  --nqubits "$NQUBITS" \
+  --nlayers "$NLAYERS" \
+  --observables $OBSERVABLES \
+  --tree-dir "$TREE_DIR" \
+  --dask-address "$DASK_ADDRESS" \
+  --torch-threads "$TORCH_THREADS" \
+  --dtype "$DTYPE" \
+  --tn-search-repeats "$SEARCH_REPEATS" \
+  --tn-search-time "$SEARCH_TIME" \
+  "${tn_slice_args[@]}"
+
+sync_trees_to_hosts
+
+echo "Contract with MPI: $MPIEXEC_FULL"
+read -r -a mpi_prefix <<< "$MPIEXEC_FULL"
+"${mpi_prefix[@]}" "$PYTHON_BIN" -u tools/tn_contest_runner.py contract \
+  --mpi \
+  --case "$CASE" \
+  --nqubits "$NQUBITS" \
+  --nlayers "$NLAYERS" \
+  --observables $OBSERVABLES \
+  --tree-dir "$TREE_DIR" \
+  --torch-threads "$TORCH_THREADS" \
+  --dtype "$DTYPE" \
+  "${tn_slice_args[@]}"
diff --git a/tools/tn_contest_runner.py b/tools/tn_contest_runner.py
index 680cecd..40de960 100644
--- a/tools/tn_contest_runner.py
+++ b/tools/tn_contest_runner.py
@@ -199,7 +199,7 @@ def build_parallel_opts(args, tree_file=None, search_only=False):
         "search_workers": args.tn_search_workers or args.torch_threads,
         "max_repeats": args.tn_search_repeats,
         "max_time": args.tn_search_time,
-        "print_stats": not args.no_tn_stats,
+        "print_stats": False,
     }
     if args.tn_search_backend is not None:
         opts["search_backend"] = args.tn_search_backend
@@ -303,7 +303,7 @@ def run_one(args, case_name, obs_name, mode):
             f"failed_trials={search_stats.get('failed_trials', 'na')} "
             f"requested_trials={search_stats.get('requested_trials', 'na')} "
             f"best_score={search_stats.get('best_score', float('nan')):.6g} "
-            f"slices={cost.get('slices')} "
+            f"slices={cost.get('nslices')} "
             f"log10_flops={cost.get('log10_flops', float('nan')):.3f} "
             f"log10_write={cost.get('log10_write', float('nan')):.3f} "
             f"log2_size={cost.get('log2_size', float('nan')):.3f} "
@@ -337,6 +337,11 @@ def apply_case_defaults(args):
 def stop_dask_cluster(args):
     if args.keep_dask or args.tn_search_backend != "dask" or not args.dask_address:
         return
+    if args.mpi:
+        from mpi4py import MPI
+
+        if MPI.COMM_WORLD.Get_rank() != 0:
+            return
     script = ROOT / "tools" / "manage_tn_dask_cluster.sh"
     if not script.exists():
         print(f"dask_stop_skipped reason=missing_script path={script}", flush=True)
diff --git a/trees/contest_tn/main1_long_z_string_34q20l_auto.pkl b/trees/contest_tn/main1_long_z_string_34q20l_auto.pkl
index 76eeedd..55ac205 100644
Binary files a/trees/contest_tn/main1_long_z_string_34q20l_auto.pkl and b/trees/contest_tn/main1_long_z_string_34q20l_auto.pkl differ