The Design and Implementation of Berkeley Lab's Linux Checkpoint/Restart
Clusters of commodity computers running Linux are becoming an increasingly popular platform for high-performance computing, as they provide the best price/performance ratio in the marketplace. But while the size and raw power of Linux clusters continues to increase, many aspects of their software environments continue to lag behind those provided by proprietary supercomputing systems. One feature missing from Linux clusters is a robust, kernel-level checkpoint/restart implementation that can support a wide variety of parallel scientific codes. This paper describes Berkeley Lab’s Linux Checkpoint/Restart project (BLCR), which seeks to provide a foundation for this capability. BLCR can be used either as a standalone system for checkpointing applications on a single machine, or as a component by a scheduling system or parallel communication library for checkpointing and restoring parallel jobs running on multiple machines. [via]
http://ftg.lbl.gov/CheckpointRestart/blcr.pdf...

Related Files
Sponsored Links
Free Download COBY Manual, Guide, Instructions, available in PDF ebooks format.