task snapshot mechanism
Javier Martinez Canillas
martinez.javier at gmail.com
Sun Apr 17 07:51:04 EDT 2011
On Sun, Apr 17, 2011 at 7:41 AM, Tharindu Rukshan Bamunuarachchi
<btharindu at gmail.com> wrote:
> hi all,
> has anyone heard about or used task snapshot mechanism for Linux ?
> what i mean by process hibernation ... stop process , take snapshot of
> current state and later start/continue from the point of snapshot. (in
> case of failure of original process)
I work in a research group in the HPC field. Our group develops many
tools that use process checkpoint restart. Basically the people here
use 3 CR mechanism that I'm aware of:
1- Berkeley lab's checkpoint/restart - BLCR
- Probably the most robust framework to CR in Linux. Is a hybrid
- You can compile OpenMPI message passing library to checkpoint
distributed applications using BLCR, very useful in HPC
- It looks that they are slowing down its development. The last
official release is 0.82 (June 16, 2009) and support kernel 2.6.30
(pretty old). To compile with newer kernels there are some patches
flowing in the development mailing list but I think only to give
support until 2.6.34 I think.
- You need root permissions to insert the blcr kernel module. One of
our tools used BLCR and we couldn't run in many clusters because the
sysadmins were skeptical about inserting a kernel module with a few
random patches published in a mailing list.
2- DMTCP: Distributed MultiThreaded CheckPointing
- A completely user-space solution. You don't need to bother the
sysadmins to install kernel modules.
- Can checkpoint distributed computation (we already tried with
OpenMPI and it also checkpoints the orte daemon).
- There is current development to add DMTCP to OpenMPI for parallel
applications checkpoints from OpenMPI as a alternative to BLCR
- Since it is implemented in user-space it has a lot of workarounds to
maintain process state in userspace.
- Duplicates kernel-space process information.
- Only works with socket-based communications (it doesn't work with
proprietary infiniband protocols for example).
3- Linux-cr checkpoint/restart mechanism
- The checkpoint/restart mechanism is implemented in the kernel as
syscalls and some user-space tools.
- Their intention is to push the mechanism upstream for kernel inclusion.
- Since their implementation is kernel based it is very robust.
- The patch-set still didn't make for kernel inclusion. And the the
whole subject is complicated. Not all kernel developers agree that
implement CR in the kernel is a good idea
- You need a custom kernel that has linux-cr support.
So which CR mechanism you choose will depend of many factors (you have
control the machine, use sockets, can boot a custom kernel, etc).
Hope it helps.
Javier Martínez Canillas
(+34) 682 39 81 69
PhD Student in High Performance Computing
Computer Architecture and Operating System Department (CAOS)
Universitat Autònoma de Barcelona
More information about the Kernelnewbies