Low Level Tool Testing

Hardware

Chuma has a disk array divided into 4 7.2TB partitions, which are NFS-served to Sam.

Rsync and dar

The dataset in question is 46GB. rsync was configured to dump information about the transfer status for each file, which may have slowed it down some.

Step 1: Copy files from lustre to the array.

time ~jbellinger/rsync/rsync-3.0.8/rsync -rLptgoDvWP --inplace /data/exp/IceCube/2005/FAT /data/F00/ *gt;& /mnt/space/testcopy/step1.log

Step 2: Do it again, and compare times

TrialReal(min)UserSys
1 rsync lustre to disk 130.57.510
2 rsync lustre to disk128.57.510
This is pretty consistent. However, I discovered that the strings containing the checksum and the file name were sometimes not as isolated as I wanted.

Step 3: Use dar to create stripes from the array to the array

export PATH=$PATH:/mnt/space/dar/dar-2.4.7/src/dar_suite
dar -s 1024M -c /data/F02/slices/FAT_Trial -R /data/F01/FAT
This refused to work as a background process! I had to reconnect it to the terminal.

TrialReal(min)
dar disk to disk37

Step 4: Copy stripes from the array to the array, using rsync

time ~jbellinger/rsync/rsync-3.0.8/rsync -rLptgoDvWP --inplace /data/F02/slices /data/F03/ > /mnt/space/testcopy/step3.log 2>/mnt/space/testcopy/step3.err

TrialReal(min)
rsync stripes disk to disk21.5

Step 5: Copy files from the array to the array, using rsync

time ~jbellinger/rsync/rsync-3.0.8/rsync -rLptgoDvWP --inplace /data/F01/FAT /data/F03/ > /mnt/space/testcopy/step4.log 2>/mnt/space/testcopy/step4.err

TrialReal(min)
rsync files disk to disk53.5

First conclusion: The combination of using dar to create slices and rsync to copy the slices is about 10% slower than using rsync to copy the individual files. In other words the times are comparable. The overhead of writing many small files is tangible.

Step 6: Copy files from lustre to stripes in the array, using dar

export PATH=$PATH:/mnt/space/dar/dar-2.4.7/src/dar_suite
dar -s 1024M -c /data/F03/slices2/slices2  -R /data/exp/IceCube/2005/FAT

TrialReal(min)
dar lustre to disk110.5

There are several unknowns. Since rsync writes so much to the log file I'm assuming that this dominates its processing time. It also calculates a checksum, and that may not actually be trivial.

Step 7: Check dar copy NFS stripe to NFS stripe with different block size

dar -s 2048M -c /data/F03/slices3/FAT_Trial -R /data/F02/slices

TrialReal(min)
dar NFS to NFS disk new stripe size26

Step 8: Check rsync NFS file copy without checksum or terminal logging. Note the absence of the P and v options.

time ~jbellinger/rsync/rsync-3.0.8/rsync -rLptgoDW --inplace /data/F01/FAT /data/F02/FATX/ > /mnt/space/testcopy/step8.log 2>/mnt/space/testcopy/step8.err

TrialReal(min)
rsync NFS to NFS disk non-verbose51.5
The first test showed 2 minutes uncertainty out of 130. Apparently the verbosity doesn't change the time here by more than a couple of minutes out of 53.5, or 4%.

NameDescription
RLTime required to read the files from lustre
RDTime required to read the files from NFS array
RSTime required to read the stripes from NFS array
WDTime required to write the files to NFS array
WSTime required to write the stripes to NFS array
PDar Processing time
RRsync processing time (dominated by log output?) for files
0rsync processing time for the stripes is presumed negligable

From the above we have several equations
RL+(R+WD)=130
RD+(R+WD)=54
RD+(P+WS)=37
RS+WS=22
RL+(P+WS)=110
RS+(P'+WS)=26

From this we quickly see that reading the files from the NFS disk is 76 minutes less than reading them from lustre. The processing time P' for dar to process large stripes as input only adds 4 minutes over rsync copying them. RD+(P+WS) is in the range (34,37) minutes. The difference between (R+WD) and (P+WS) is about 17minutes. Since P is presumably not smaller than P', P≥4minutes and R≥2minutes, and we're getting that the contribution due to writing to disk is of order (15-19) minutes larger when writing 118,000 small files than when writing 46 large files. Writing to a RAID array like this takes longer than reading, by a factor of about 2 or so. From RS+WS=22 that implies 14 minutes due to write and 7 to read (assuming the program isn't clever), which is not badly inconsistent with the previous estimate.

So writing stripes takes O(15minutes/46GB=3GB/min) and writing files takes O(32minutes/46GB=1.4GB/min). This is all over NFS.


rsync 3.0.8 starts copying quickly and does not maintain a monster memory footprint. dar starts copying quickly, and maintains O(650bytes/file), which for (e.g.) /net/user/aura demands 50GB of main memory. Not happening. Therefore dar can only be used on an already-partitioned set of files. For example a job copied 27GB in 100 minutes eating (by the end) 21% of the 8GB memory on sam, archiving /net/user/aura; it wasn't anywhere near done.

I can try rsync | tar | split | parchive

That doesn't work because par2 wants the filenames specified. Also rsync doesn't like to write to a pipe: it can't read first to check that the file already exists. (r"sync" for synchronization)

BFI possible: write all the files to a file and process that. Lots of latency, lots of overhead, and do you scramble the files to try to spread the lustre server overhead around? Or not?

Or do I need a custom program that uses an incremental scan for files, checksums them on the fly, and creates archive files of the desired size? And writes an index stream and uses reed-solomon redundancy to the degree specified?

How long would it take to create such a program? Maybe a week to steal the incremental scan from rsync, a week to steal the checksum from rsync, a week to devise a blocking scheme, a week to steal code from cp (mostly error handling), a week to steal the reed-solomon redundancy code? 5 weeks. Double that and use the next higher unit: 10 months. Not affordable. And I didn't specify the unpacking program... Staging can simplify the coding, since the output blocks can be par2'd in a script.

cpio has an 8GB file size limit.

pax? It has a useless checksum (of the header block)

Incremental search: has state, given location to search, discovers files until a filecount is reached, fills a buffer with the file name and path and hands this to the copier--and waits until it is restarted. Possible errors: failure to read. Possible states: more files, no more files.

Copier: Reads files individually (default block size?), checksums them on the fly and writes the result into a chunk buffer. file names and sizes and checksums and chunk file names go into a list which, when it reaches a given size, is written out to an index file. When the chunk buffer fills the chunk is written out (file may span chunks). Add redundancy here at specified level. Possible errors: failure to read file, failure to write chunk file, failure to write index, out of memory, overflow of index buffer, user interrupt. Chunk has file indexing information and header info and file contents.

Restoration tool uses index to find file chunks needed. Read file chunk and verify or restore the original chunk. Extract portion of file and write it, use other chunks if required.