[repost] How To Repair Corrupt tar Archives

[update]: there is a follow up article with a more efficient solution. You can find it here http://riaschissl.bestsolution.at/2015/03/repair-corrupt-tar-archives-the-better-way/

preface: This is a repost of an article I wrote more than ten years ago on our company homepage. Despite its age, the page still receives huge amounts of traffic and so I am reposting it here on my blog because the original article will vanish from our official company homepage soon.

Every sysadmin’s nightmare: You made a backup of important files using tar and for whatever reason you need to restore the files – but find the tar archive broken.

This thing happened to me once (and hopefully never again) and it took me quite a very long time to get the data back (or at least the useable part of it).

Before we start, some assumptions to make things clear:

  • tar is GNU-tar
  • your archive has been bzip2 compressed
    (although the compression type is secondary)
  • you have the tar-file ready on some accessible place

(GNU-)tar itself has some options that claim to be suitable for recovering data from lost (you’ll understand the sarcasm here if you read on …). So let’s first check what the problem is:

Now this indicates that I should use bzip2recover “to *attempt* to recover data from undamaged sections. Well, doesn’t sound too bad, does it?

So I used bzip2recover:

That way at least something happened. Depending on the size of the archive, bzip2recover produces a nice amount small ‘rec*’ files (typically 900K in size) which represent the default blocksize bzip2 uses per default for compression. The “nice amount of small files” however is likely to become a “huge amount of small files” if your archive is big – like mine was.

The archive I had to deal with was more than 200MB big, leaving me with several hundrets(!) of those “small files”. But still I was optimistic that I could retrieve the data from the small files by finding the corrupted files. So I tried to find out, which of the small files was corrupted and which ones were good:

bunzip2 stops when it finds the first (and hopefully last) corrupted file, which is exactly what I wanted to know. Krush kill and destroy: No use for a corrupt file and so I deleted it and repeated the above command plus the deletion for all further bad files. The only important thing is to remember the number of the deleted files.

So now I thought it would be easy: use tar on the bunzip’ed files, but I was taught otherwise. Say that rec00199 was the first (and last) corrupted file, so starting at rec00200:

Headache time … I could also try it with any of the >200 remaining allegedly “fixed” files, but always got the same error. Searches in google and postings in some mailinglists did not provide me with any useful results and my headeache grew.

Tar claims to have the feature to scan even corrupt files for tar headers in it but this feature has one major blemish: It only works, if no bytes are lost in the file because tar scans expects file headers to be 512 bytes in size. If only one byte is lost in such a header (or a following data block), this “recovery feature” fails and becomes an annoyance.

Luck returned a couple of weeks later when I received an email from a nice guy that had written a nice perl script that really searched a file for a tar header bytewise and not in the 512 bytes manner of tar itself. You can download it from here.

In order to get things working, I joined the second part of the bunzip’ed files (the ones after the bad rec00199):

The command above joins all files starting at rec00200 up to rec004999 together in good_tail.jar.

And now the only thing I had to do was to use the script below to find the position of the first good tar header in good_tail.tar:

The only thing that matters is the first line of the output, it tells that the first good tar header in good_tail.tar is at position 17185. What remained was to extract the content starting at this position and then untar it:

Happy end of the story!

[update]: there is a follow up article with a more efficient solution. You can find it here http://riaschissl.bestsolution.at/2015/03/repair-corrupt-tar-archives-the-better-way/

Spread the love

7
Leave a Reply

avatar
3 Comment threads
4 Thread replies
1 Followers
 
Most reacted comment
Hottest comment thread
4 Comment authors
Nathan NeulingerudoMarkAnonymous Recent comment authors
  Subscribe  
newest oldest most voted
Notify of
Anonymous
Guest
Anonymous

Hello, I believe, the problems with broken bz2 archives are present everywhere. most sysadmins don`t try to repair the corrupted archives and delete them… But I still think, this is not a solution! I had this problem by my self. A hdd had bad sectors, but was still spinning. I extracted the files, which are still readable. there were several big .tar.bz2 archives, too. but 2 of them were corrupted. So I had to find a way to repair the still readable parts of them. If you work similar to this article, you can have many trouble, if one of… Read more »

Mark
Guest
Mark

Can you please post this program or post the link to it (use pastebin or similar if you don’t have a website), and post the link to it here, so others members of the opensource community can benefit from it.

Mark
Guest
Mark

Dear Udo,
Thanks you very much,
but it would be great if the Anonymous who posted on March 11 2014 could also share his C program.
(The idea is to do on the fly bzip2 decompression and tar recovery, and write out usefull info + log of the corrupt files from the tar archive).
PS: Partially corrupt files can be kept.

Post Navigation