Every file is broken up into blocks that are (for example) no larger than 1MB (Dropbox uses 4MB). Each block is hashed using MD5/SHA-256/whatever and copied to the storage location. Each new block has its hash checked against the list of existing hashes, and repeat blocks are not copied to the server.īecause blocks and files are abstracted from each other, files with identical contents and different names/locations only take the space of one file, plus the list/database of the file's names/locations/blocks.Ī separate file/database contains a list of what blocks belong to what files, and where in the files. Files with 99% similarity only need the extra storage space to store the blocks that are different between the two files.Īs an example, assume you have a 20.5MB file. This could be broken down into 21 blocks, with 20 being 1MB each, and 1 being 0.5MB. You store the blocks, the sliding hash, and the MD5 of each of the blocks in your backup location. You also store a list that contains the file name, location, and blocks contained in the file. You then edit the file, removing 0.3MB from somewhere in the middle of the file, and save it as a new file. You use the sliding hash (confirmed with the MD5s) and determine that you already have 19 of the 1MB blocks, a 0.5MB block, and 0.7MB of data that matches no existing hashes. The 0.7MB of data becomes a new block, which is uploaded to your backup location along with its hashes. You then upload the new file's name, location, and the blocks contained in the file. Using this basic dedupe method, storing an additional 19.5MB file only required transferring and storing an additional 0.7MB. Deduplication for versioning works in an identical manner. If the file had been edited and saved with the same name, the exact same process would have occurred, and the same 0.7MB would have been stored. Removing old versions of files/backups involves building a list of all hashes in files you want to keep, and then deleting all other blocks. When restoring a file, you pull up a list of the blocks used in that file, download the blocks, and then reassemble the blocks into the proper files. With unused blocks deleted, the data still stored is only what is used by your files. ![]() This is possible because none of the blocks overlap or are overlaid onto other blocks. This means that all backups are incrementals. There is no reason to perform a "full backup" as there are no space or speed savings. Some companies, such as DropBox dedupe across all customers, which results in some pretty wild space savings. I'm personally interested in deduplicating files and data across multiple computers and locations. As long as each of the clients has the same decryption key, or the blocks are stored unencrypted on the server, this becomes trivial. Yes, the de-duplication features in DropBox work extremely well.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |