Copying very large files with error handling


Once in a while I need to copy multi-gigabyte files on my very slow internet connection to one of the servers. This means the copy must go on for days, saturating my network, and any failure results in having to redo the copy all over again. So, for example, if you used normal scp

scp bigfile username@server.example.com:

and you were 99% done, then the connection failed, you would have to start all over again.

My perfect solution would be some scheme where, on failure, the data already copied to the server is validated, then the copy starts again from there. This allows me to manually stop the copy at any time, have full speed on my network, and start the copy again with minimal recopying.

I'm sure there are many solutions, probably some better. Some kind of super copy with resume. But, the following solution works for me. I use the unix command split to split the file into smaller pieces, the copy the pieces using rsync, then join them together on target using cat.

Note: this requires twice as much space on both the source and target as we are splitting the large file into several smaller pieces, copying them, then joining them back together.

Assuming the file to be copied is named bigfile and the target server is server.example.com, the following works. I create a temporary directory on both source and target named /tmp/copying.

mkdir -p /tmp/copying
split -n676 bigfile /tmp/copying/bigfile.
rsync -avc --progress /tmp/copying server.example.com:/tmp
ssh server.example.com 'cat /tmp/copying/bigfile* > ~/bigfile'
ssh server.example.com 'rm -fRv /tmp/copying'
rm -fRv /tmp/copying

The first line simply creates the requisite local working directory. This directory must be on a partition large enough to hold the entire file, in 676 chunks.

The second line splits bigfile into 676 chunks of approximately equal size, storing them as /tmp/copying/bigfile.aa through /tmp/copying/bigfile.zz. aa-zz results in 676 possibles, ie 26 * 26.

Line three is the workhorse. It may be started and stopped at will. Stopping the rsync command, then restarting it will first verify all files already copied to the server are correct (the -c does a checksum to do the match), then begins copying any files which have not been successful. Thus, worst case scenario is stopping the copy just before the last byte is copied on one of the chunks, then starting it back up again. In that case, the entire chunk will be recopied, but that is 1/676th of the total file size, so it helps. Note the --progress flag gives some more information about how far along each of the file sync's are, and the -c option takes a lot of time when the syncing is restarted. I generally do NOT use the -c option during copying, but do one final check at the end to make sure the copies succeeded correctly.

Line four concats all the files together and puts the output into ~/bigfile (users home directory, file named bigfile).

Lines 5-6 simply clean up the temp directories on both machines.

Analysis

The big part is the chunk size. What I did was simply told it to create the largest number of files available with the possible letter combinations for split's default 'aa' through 'zz', which is 676. That seems like a good number since keeping more than a thousand fiels in one directory really causes disk problems, and if bigfile is 10 gig, this will result in chunk sizes of about 15M. That takes about 5 minutes to copy on my connection, so it is acceptable to me.

Using smaller chunk sizes (and thus, more chunks) would result in less recopying. Increasing the number of chunks to 1024 would mean each chunk  of a 10G file would be 10M. You would need to modify the number of characters used in the suffix length using the -a parameter, chaning the above split command to:

split -a3 -n1024 bigfile /tmp/copying/bigfile.

That would give you the files from bigfile.aaa to bigfile.bnj, 1024 in all. By using the -a3 parameter, you can create up to 17576 (26*26*26) files from .aaa to .zzz, but of course most file systems are going to barf big time on that many files. However, each chunk would only be 597k in size!

Tags: copy, large file
Last update:
2015-08-23 23:13
Author:
Rod
Revision:
1.2
Average rating:0 (0 Votes)

You can comment this FAQ

Chuck Norris has counted to infinity. Twice.

Records in this category

Tags