From b3b8283f87211bc48c6bb92e6d9afc27b416f9e6 Mon Sep 17 00:00:00 2001 From: Masamichi Takagi Date: Wed, 6 Nov 2019 13:00:24 +0900 Subject: [PATCH] Add NEWS.md Change-Id: Iecf193e3d5dac57f87ef8db2f43add5fb99f6a6e --- NEWS.md | 516 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 516 insertions(+) create mode 100644 NEWS.md diff --git a/NEWS.md b/NEWS.md new file mode 100644 index 00000000..7f9c1808 --- /dev/null +++ b/NEWS.md @@ -0,0 +1,516 @@ +=========================================== +What's new in V1.7.0 (Nov 15, 2019) +=========================================== + +---------------------- +McKernel major updates +---------------------- +1. arm64: Contiguous PTE support +2. arm64: Scalable Vector Extension (SVE) support +3. arm64 port: Direct access to Mckernel memory from Linux +4. arm64 port: utility thread offloading, which spawns thread onto + Linux CPU +5. Crash utility extension +6. Replace mcoverlayfs with a soft userspace overlay +7. Build system is switched to cmake +8. Core dump includes thread information + +------------------------ +McKernel major bug fixes +------------------------ +1. shmobj: Fix rusage counting for large page +2. mcctrl control: task start_time changed to u64 nsec +3. mcctrl: add handling for one more level of page tables +4. Add kernel argument to turn on/off time sharing +5. flatten_string/process env: realign env and clear trailing bits +6. madvise: Add MADV_HUGEPAGE support +8. mcctrl: remove in-kernel calls to syscalls +9. arch_cpu_read_write_register: error return fix. +10. set_cputime(): interrupt enable/disable fix. +11. set_mempolicy(): Add mode check. +12. mbind(): Fix memory_range_lock deadlock. +13. ihk_ikc_recv: Record channel to packet for release +14. Add set_cputime() kernel to kernel case and mode enum. +15. execve: Call preempt_enable() before error-exit +16. memory/x86_64: fix linux safe_kernel_map +17. do_kill(): fix pids table when nr of threads is larger than num_processors +18. shmget: Use transparent huge pages when page size isn't specified +19. prctl: Add support for PR_SET_THP_DISABLE and PR_GET_THP_DISABLE +20. monitor_init: fix undetected hang on highest numbered core +21. init_process_stack: change premapped stack size based on arch +22. x86 syscalls: add a bunch of XXat() delegated syscalls +23. do_pageout: fix direct kernel-user access +24. stack: add hwcap auxval +25. perf counters: add arch-specific perf counters +26. Added check of nohost to terminate_host(). +27. kmalloc: Fix address order in free list +28. sysfs: use nr_cpu_ids for cpumasks (fixes libnuma parsing error on ARM) +29. monitor_init: Use ihk_mc_cpu_info() +30. Fix ThunderX2 write-combined PTE flag insanity +31. ARM: eliminate zero page mapping (i.e, init_low_area()) +32. eliminate futex_cmpxchg_enabled check (not used and dereffed a NULL pointer) +33. page_table: Fix return value of lookup_pte when ptl4 is blank +34. sysfs: add missing symlinks for cpu/node +35. Make Linux handler run when mmap to procfs. +36. Separate mmap area from program loading (relocation) area +37. move rusage into kernel ELF image (avoid dynamic alloc before NUMA init) +38. arm: turn off cpu on panic +39. page fault handler: protect thread accesses +40. Register PPD and release_handler at the same time. +41. fix to missing exclusive processing between terminate() and + finalize_process(). +42. perfctr_stop: add flags to no 'disable_intens' +43. fileobj, shmobj: free pages in object destructor (as opposed to page_unmap()) +44. clear_range_l1, clear_range_middle: Fix handling contiguous PTE +45. do_mmap: don't pre-populate the whole file when asked for smaller segment +46. invalidate_one_page: Support shmobj and contiguous PTE +47. ubsan: fix undefined shifts +48. x86: disable zero mapping and add a boot pt for ap trampoline +49. rusage: Don't count PF_PATCH change +50. Fixed time processing. +51. copy_user_pte: vmap area not owned by McKernel +52. gencore: Zero-clear ELF header and memory range table +53. rpm: ignore CMakeCache.txt in dist and relax BuildRequires on cross build +54. gencore: Allocate ELF header to heap instead of stack +55. nanosleep: add cpu_pause() in spinwait loop +56. init_process: add missing initializations to proc struct +57. rus_vm_fault: always use a packet on the stack +58. process stack: use PAGE_SIZE in aux vector +59. copy_user_pte: base memobj copy on range & VR_PRIVATE +60. arm64: ptrace: Fix overwriting 1st argument with return value +61. page fault: use cow for private device mappings +62. reproductible builds: remove most install paths in c code +63. page fault: clear writable bit for non-dirtying access to shared ranges +64. mcreboot/mcstop+release: support for regular user execution +65. irqbalance_mck: replace extra service with service drop-in +66. do_mmap: give addr argument a chance even if not MAP_FIXED +67. x86: fix xchg() and cmpxchg() macros +68. IHK: support for using Linux work IRQ as IKC interrupt (optional) +69. MCS: fix ARM64 issue by using smp_XXX() functions (i.e., barrier()s) +70. procfs: add number of threads to stat and status +71. memory_range_lock: Fix deadlock in procfs/sysfs handler +72. flush instruction cache at context switch time if necessary +73. arm64: Fix PMU related functions +74. page_fault_process_memory_range: Disable COW for VM region with zeroobj +75. extend_process_region: Fall back to demand paging when not contiguous + +=========================================== +What's new in V1.6.0 (Nov 11, 2018) +=========================================== + +----------------------------------------------- +McKernel new features, improvements and changes +----------------------------------------------- +1. McKernel and Linux share one unified kernel virtual address space. + That is, McKernel sections resides in Linux sections spared for + modules. In this way, Linux can access the McKernel kernel memory + area. +2. hugetlbfs support +3. IHK is now included as a git submodule +4. Debug messages are turned on/off in per souce file basis at run-time. +5. It's prohibited for McKernel to access physical memory ranges which + Linux didn't give to McKernel. +6. UTI (capability to spawn a thread on Linux CPU) improvement: + * System calls issued from the thread are hooked by modifying + binary in memory. + +--------------------------- +McKernel bug fixes (digest) +--------------------------- +# below corresponds to the redmine issue number +(https://postpeta.pccluster.org/redmine/). + +1. #926: shmget: Hide object with IPC_RMID from shmget +2. #1028: init_process: Inherit parent cpu_set +3. #995: Fix shebang recorded in argv[0] +4. #1024: Fix VMAP virtual address leak +5. #1109: init_process_stack: Support "ulimit -s unlimited" +6. x86 mem init: do not map identity mapping +7. mcexec_wait_syscall: requeue potential request on interrupted wait +8. mcctrl_ikc_send_wait: fix interrupt with do_frees == NULL +9. pager_req_read: handle short read +10. kprintf: only call eventfd() if it is safe to interrupt +11. process_procfs_request: Add Pid to /proc//status +12. terminate: fix oversubscribe hang when waiting for other threads on same CPU to die +13. mcexec: Do not close fd returned to mckernel side +14. #976: execve: Clear sigaltstack and fp_regs +15. #1002: perf_event: Specify counter by bit_mask on start/stop +16. #1027: schedule: Don't reschedule immediately when wake up on migrate +17. #mcctrl: lookup unexported symbols at runtime +18. __sched_wakeup_thread: Notify interrupt_exit() of re-schedule +19. futex_wait_queue_me: Spin-sleep when timeout and idle_halt is specified +20. #1167: ihk_os_getperfevent,setperfevent: Timeout IKC sent by mcctrl +21. devobj: fix object size (POSTK_DEBUG_TEMP_FIX_36) +22. mcctrl: remove rus page cache +23. #1021: procfs: Support multiple reads of e.g. /proc/*/maps +24. #1006: wait: Delay wake-up parent within switch context +25. #1164: mem: Check if phys-mem is within the range of McKernel memory +26. #1039: page_fault_process_memory_range: Remove ihk_mc_map_virtual for CoW of device map +27. partitioned execution: pass process rank to LWK +28. process/vm: implement access_ok() +29. spinlock: rewrite spinlock to use Linux ticket head/tail format +30. #986: Fix deadlock involving mmap_sem and memory_range_lock +31. Prevent one CPU from getting chosen by concurrent forks +32. #1009: check_signal: system call restart is done only once +33. #1176: syscall: the signal received during system call processing is not processed. +34. #1036 syscall_time: Handle by McKernel +35. #1165 do_syscall: Delegate system calls to the mcexec with the same pid +36. #1194 execve: Fix calling ptrace_report_signal after preemption is disabled +37. #1005 coredump: Exclude special areas +38. #1018 procfs: Fix pread/pwrite to procfs fail when specified size is bigger than 4MB +39. #1180 sched_setaffinity: Check migration after decrementing in_interrupt +40. #771, #1179, #1143 ptrace supports threads +41. #1189 procfs/do_fork: wait until procfs entries are registered +42. #1114 procfs: add '/proc/pid/stat' to mckernel side and fix its comm +43. #1116 mcctrl procfs: check entry was returned before using it +44. #1167 ihk_os_getperfevent,setperfevent: Return -ETIME when IKC timeouts +45. mcexec/execve: fix shebangs handling +46. procfs: handle 'comm' on mckernel side +47. ihk_os_setperfevent: Return number of registered events +48. mcexec: fix terminating zero after readlink() + +=========================================== +What's new in V1.5.1 (July 9, 2018) +=========================================== + +----------------------------------------------- +McKernel new features, improvements and changes +----------------------------------------------- +1. Watchdog timer to detect hang of McKernel + mcexec prints out the following line to its stderr when a hang of + McKernel is detected. + + mcexec detected hang of McKernel + + The watchdog timer is enabled by passing -i option + to mcreboot.sh. specifies the interval of checking + if McKernel is alive. + Example: mcreboot.sh -i 600: Detect the hang with 10 minutes interval + + The detailed step of the hang detection is as follows. + (1) mcexec acquires eventfd for notification from IHK and perform + epoll() on it. + (2) A daemon called ihkmond monitors the state of McKernel periodically + with the interval specified by the -i option. It judges that + McKernel is hanging and notifies mcexec by the eventfd if its + state hasn't changed since the last check. + +2. Documentation + man page: Installed directory is changed to /share/man + +--------------------------- +McKernel bug fixes (digest) +--------------------------- +1. #1146: pager_req_map(): do not take mmap_sem if not needed +2. #1135: prepare_process_ranges_args_envs(): fix saving cmdline +3. #1144: fileobj/devobj: record path name +4. #1145: fileobj: use MCS locks for per-file page hash +5. #1076: mcctrl: refactor prepare_image into new generic ikc send&wait +6. #1072: execve: fix execve with oversubscribing +7. #1132: execve: use thread variable instead of cpu_local_var(current) +8. #1117: mprotect: do not set page table writable for cow pages +9. #1143: syscall wait4: add _WALL (POSTK_DEBUG_ARCH_DEP_44) +10. #1064: rusage: Fix initialization of rusage->num_processors +11. #1133: pager_req_unmap: Put per-process data at exit +12. #731: do_fork: Propagate error code returned by mcexec +13. #1149: execve: Reinitialize vm_regions's map area on execve +14. #1065: procfs: Show file names in /proc//maps +15. #1112: mremap: Fix type of size arguments (from ssize_t to size_t) +16. #1121: sched_getaffinity: Check arguments in the same order as in Linux +17. #1137: mmap, mremap: Check arguments in the same order as in Linux +18. #1122: fix return value of sched_getaffinity +19. #732: fix: /proc//maps outputs a unnecessary NULL character + +=================================== +What's new in V1.5.0 (Apr 5, 2018) +=================================== + +-------------------------------------- +McKernel new features and improvements +-------------------------------------- +1. Aid for Linux version migration: Detect /proc, /sys format change + between two kernel verions +2. Swap out + * Only swap-out anonymous pages for now +3. Improve support of /proc/maps +4. mcstat: Linux tool to show resource usage + +--------------------------- +McKernel bug fixes (digest) +--------------------------- +1. #727: execve: Fix memory leak when receiving SIGKILL +2. #829: perf_event_open: Support PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE +3. #906: mcexec: Check return code of fork() +4. #1038: mcexec: Timeout when incorrect value is given to -n option +5. #943 #945 #946 #960 $961: mcexec: Support strace +6. #1029: struct thread is not released with stress-test involving signal + and futex +7. #863 #870: Respond immediately to terminating signal when + offloading system call +8. #1119: translate_rva_to_rpa(): use 2MB blocks in 1GB pages on x86 +11. #898: Shutdown OS only after no in-flight IKC exist +12. #882: release_handler: Destroy objects as the process which opened it +13. #882: mcexec: Make child process exit if the parent is killed during + fork() +14. #925: XPMEM: Don't destroy per-process object of the parent +15. #885: ptrace: Support the case where a process attaches its child +16. #1031: sigaction: Support SA_RESETHAND +17. #923: rus_vm_fault: Return error when a thread not performing + system call offloading causes remote page fault +18. #1032 #1033 #1034: getrusage: Fix ru_maxrss, RUSAGE_CHILDREN, + ru_stime related bugs +19. #1120: getrusage: Fix deadlock on thread->times_update +20. #1123: Fix deadlock related to wait_queue_head_list_node +21. #1124: Fix deadlock of calling terminate() from terminate() +22. #1125: Fix deadlock related to thread status + * Related functions are: hold_thread(), do_kill() and terminate() +23. #1126: uti: Fix uti thread on the McKernel side blocks others in do_syscall() +24. #1066: procfs: Show Linux /proc/self/cgroup +25. #1127: prepare_process_ranges_args_envs(): fix generating saved_cmdline to + avoid PF in strlen() +26. #1128: ihk_mc_map/unmap_virtual(): do proper TLB invalidation +27. #1043: terminate(): fix update_lock and threads_lock order to avoid deadlock +28. #1129: mcreboot.sh: Save /proc/irq/*/smp_affinity to /tmp/mcreboot +29. #1130: mcexec: drop READ_IMPLIES_EXEC from personality + +-------------------- +McKernel workarounds +-------------------- +1. Forbid CPU oversubscription + * It can be turned on by mcreboot.sh -O option + + +=================================== +What's new in V1.4.0 (Oct 30, 2017) +=================================== + +----------------------------------------------------------- +Feature: Abstracted event type support in perf_event_open() +----------------------------------------------------------- +PERF_TYPE_HARDWARE and PERF_TYPE_CACHE types are supported. + +---------------------------------- +Clean-up: Direct user-space access +---------------------------------- +Code lines using direct user-space access (e.g. passing user-space +pointer to memcpy()) becomes more portable across processor +architectures. The modification follows the following rules. + +1. Move the code section as it is to the architecture dependent + directory if it is a part of the critical-path. +2. Otherwise, rewrite the code section by using the portable methods. + The methods include copy_from_user(), copy_to_user(), + pte_get_phys() and phys_to_virt(). + +-------------------------------- +Test: MPI and OpenMP micro-bench +-------------------------------- +The performance figures of MPI and OpenMP primitives are compared with +those of Linux by using Intel MPI Benchmarks and EPCC OpenMP Micro +Benchmark. + + +=================================== +What's new in V1.3.0 (Sep 30, 2017) +=================================== + +-------------------- +Feature: Kernel dump +-------------------- +1. A dump level of "only kernel memory" is added. + +The following two levels are available now: + 0: Dump all + 24: Dump only kernel memory + +The dump level can be set by -d option in ihkosctl or the argument +for ihk_os_makedumpfile(), as shown in the following examples: + + Command: ihkosctl 0 dump -d 24 + Function call: ihk_os_makedumpfile(0, NULL, 24, 0); + +2. Dump file is created when Linux panics. + +The dump level can be set by dump_level kernel argument, as shown in the +following example: + + ihkosctl 0 kargs "hidos dump_level=24" + +The IHK dump function is registered to panic_notifier_list when creating +/dev/mcdX and called when Linux panics. + +----------------------------- +Feature: Quick Process Launch +----------------------------- + +MPI process launch time and some of the initialization time can be +reduced in application consisting of multiple MPI programs which are +launched in turn in the job script. + +The following two steps should be performed to use this feature: +1. Replace mpiexec with ql_mpiexec_start and add some lines for + ql_mpiexec_finalize in the job script +2. Modify the app so that it can repeat calculations and wait for the + instructions from ql_mpiexec_{start,finalize} at the end of the + loop + +The first step is explained using an example. Assume the original job +script looks like this: + +/* Execute ensamble simulation and then data assimilation, and repeat this + ten times */ +for i in {1..10}; do + + /* Each ensamble simulation execution uses 100 nodes, launch ten of them + in parallel */ + for j in {1..10}; do + mpiexec -n 100 -machinefile ./list1_$j p1.out a1 & pids[$i]=$!; + done + + /* Wait until the ten ensamble simulation programs finish */ + for j in {1..10}; do wait ${pids[$j]}; done + + /* Launch one data assimilation program using 1000 nodes */ + mpiexec -n 1000 -machinefile ./list2 p2.out a2 +done + +The job script should be modified like this: + +for i in {1..10}; do + for j in {1..10}; do + /* Replace mpiexec with ql_mpiexec_start */ + ql_mpiexec_start -n 100 -machinefile ./list1_$j p1.out a1 & pids[$j]=$!; + done + + for j in {1..10}; do wait ${pids[$j]}; done + + ql_mpiexec_start -n 1000 -machinefile ./list2 p2.out a2 +done + +/* p1.out and p2.out don't exit but are waiting for the next calculation. + So tell them to exit */ +for j in {1..10}; do + ql_mpiexec_finalize -machinefile ./list1_$i p1.out a1; +done +ql_mpiexec_finalize -machinefile ./list2 p2.out a2; + + +The second step is explained using a pseudo-code. + + MPI_Init(); + Prepare data exchange with preceding / following MPI programs +loop: + foreach Fortran module + Initialize data using command-line argments, parameter files, + environment variables + Input data from preceding MPI programs / Read snap-shot + Perform main calculation + Output data to following MPI programs / Write snap-shot + /* ql_client() waits for command of ql_mpiexec_{start,finish} */ + if (ql_client() == QL_CONTINUE) { goto loop; } + MPI_Finalize(); + +qlmpilib.h should be included in the code and libql{mpi,fort}.so +should be linked to the executable file. + + +======================== +Restrictions on McKernel +======================== + + 1. Pseudo devices such as /dev/mem and /dev/zero are not mmap()ed + correctly even if the mmap() returns a success. An access of their + mapping receives the SIGSEGV signal. + + 2. clone() supports only the following flags. All the other flags + cause clone() to return error or are simply ignored. + + * CLONE_CHILD_CLEARTID + * CLONE_CHILD_SETTID + * CLONE_PARENT_SETTID + * CLONE_SETTLS + * CLONE_SIGHAND + * CLONE_VM + + 3. PAPI has the following restriction. + + * Number of counters a user can use at the same time is up to the + number of the physical counters in the processor. + + 4. msync writes back only the modified pages mapped by the calling process. + + 5. The following syscalls always return the ENOSYS error. + + * migrate_pages() + * move_pages() + * set_robust_list() + + 6. The following syscalls always return the EOPNOTSUPP error. + + * arch_prctl(ARCH_SET_GS) + * signalfd() + + 7. signalfd4() returns a fd, but signal is not notified through the + fd. + + 8. set_rlimit sets the limit values but they are not enforced. + + 9. Address randomization is not supported. + +10. brk() extends the heap more than requestd when -h + (--extend-heap-by=) option of mcexec is used with the value + larger than 4 KiB. syscall_pwrite02 of LTP would fail for this + reason. This is because the test expects that the end of the heap + is set to the same address as the argument of sbrk() and expects a + segmentation violation occurs when it tries to access the memory + area right next to the boundary. However, the optimization sets + the end to a value larger than the requested. Therefore, the + expected segmentation violation doesn't occur. + +11. setpriority()/getpriority() won't work. They might set/get the + priority of a random mcexec thread. This is because there's no + fixed correspondence between a McKernel thread which issues the + system call and a mcexec thread which handles the offload request. + +12. mbind() can set the policy but it is not used when allocating + physical pages. + +13. MPOL_F_RELATIVE_NODES and MPOL_INTERLEAVE flags for + set_mempolicy()/mbind() are not supported. + +14. The MPOL_BIND policy for set_mempolicy()/mbind() works as the same + as the MPOL_PREFERRED policy. That is, the physical page allocator + doesn't give up the allocation when the specified nodes are + running out of pages but continues to search pages in the other + nodes. + +15. Kernel dump on Linux panic requires Linux kernel CentOS-7.4 and + later. In addition, crash_kexec_post_notifiers kernel argument + must be given to Linux kernel. + +16. setfsuid()/setfsgid() cannot change the id of the calling thread. + Instead, it changes that of the mcexec worker thread which takes + the system-call offload request. + +17. mmap (hugeTLBfs): The physical pages corresponding to a map are + released when no McKernel process exist. The next map gets fresh + physical pages. + +18. Sticky bit on executable file has no effect. + +19. Linux (RHEL-7 for x86_64) could hang when offlining CPUs in the + process of booting McKernel due to the Linux bug, found in + Linux-3.10 and fixed in the later version. One way to circumvent + this is to always assign the same CPU set to McKernel. + +20. madvise: + * MADV_HWPOISON and MADV_SOFT_OFFLINE always returns -EPERM. + * MADV_MERGEABLE and MADV_UNMERGEABLE always returns -EINVAL. + * MADV_HUGEPAGE and MADV_NOHUGEPAGE on file map returns -EINVAL + (It succeeds on RHEL-8 for aarch64). + +21. brk() and mmap() doesn't report out-of-memory through its return + value. Instead, page-fault reports the error. + +22. anonymous mmap pre-maps requested number of pages when contiguous + pages are available. Demand paging is used when not available. \ No newline at end of file