Introduction to Archive Tools

I want tools that will do several things for us:

Archive data from tapes or disk to tapes or disk
Allow us to identify files which have become corrupted
Be reasonably quick and not too costly in disk/tape space
Allow us to recover files despite minor corruption. (This is controversial.)
Retrieve efficiently in bulk after a disaster. This is not intended as a competitor to dCache.

The tools serve to archive data currently on disk and not backed up, and to migrate media to new technologies.

Tools

The 3.0.8 version of rsync allows incremental scans rather than the giant list of all files to be copied. It also calculates an md5sum. I modified rsync to write out the checksum and the file name, but the presence of multiple threads means the output is sometimes stepped on.

The dar program allows one to create slices instead of a giant tar-ball like archive. The catalog can be extracted separately, and the whole thing used together with Parchive to add redundancy to let you reconstruct despite some corruption.

Tape drives waste tape when many small files are written, and disk drives also prefer larger chunks that match boundaries.

Constraints

We are copying from lustre filesystems or from tape for the first iteration of archiving. A single lustre connection has limited bandwidth. It shines when many computers are accessing it at the same time, but any one computer gets a limited stream.

Therefore partitioning the copy into multiple sections to be farmed out to a processor farm is an important technical issue to be solved.

Writing to a single archive disk array will be limited by the write speed of that array. This will limit the number of farm jobs that can be active.

I don't know of any direct way of having multiple computers write to a single tape drive: that requires an intermediary program. The farm computers don't have direct tape access in any event. Therefore writing to tape means that data has to be staged to disk first.