All my talk of hashes over on my Windows SP3 post on ZDNet has prompted a series of questions that I really should have anticipated - what is a hash?
Here goes ...
Think of a hash as a mathematical fingerprint (a digest or
checksum) of the contents of a file. You take the file (which consists of a string of 0s and 1s making up the binary), pass this through a mathematical function (of which there are many ...), and out the other end you get a fixed-length string that is the hash.
Fortunately for those of us who aren't Rain Man there are tools that we can use that handles all the heavy mathematical lifting for us. A firm favorite of mine is called HashCalc. The chances of two files having different content giving the same hash is very small indeed (these events are known as collisions and the mathematical function used try to keep these collisions to a minimum). Hashes are a great way to determine is a file that you've downloaded hasn't become corrupted or tampered with in any way.
Here's a quick example that you can follow along with at home if you want (if you want to download a copy of HashCalc you can follow along).
Open up Notepad (not Word or WordPad, as these add formatting and structure to the file) and type the following:
Hello
Save the file (giving it any name you want since the name of the file doesn't form part of the hash - you can try this out for yourself if you want). Now load the file into HashCalc. You'll notice that there are several different hash functions available but, for now I'm only going to concentrate on MD5. The MD5 hash of a file containing the word Hello is as follows:
8b1a9953c4611296a827abf8c47804d7
Now reopen the file that you created and make a change, for example:
hello
Take the MD5 hash of this. It should be:
5d41402abc4b2a76b9719d911017c592
That single character difference makes a huge (and immediately noticeable) difference to the hash.
Now scale this up. The Windows XP SP3 file that I was talking about on ZDNet is huge - 316MB. That's a lot of binary data and a small change wouldn't be anywhere as noticeable. A hash is a simple way for us to know if we are talking about the same file - so if you check the file that you have and the hash is bb25707c919dd835a9d9706b5725af58, and mine is the same, we know that we are talking about the same file and that the download hasn't been corrupted and that no one has tampered with it and maybe planted a virus inside the file. A single change to the file would result in a vastly different hash.
One final point worth noting - hashes are one way. That is, you can create a hash from a particular binary sequence, but you can't take that hash and work backwards to find out what the contents of that file was in the beginning
If you want to know more about hashes, here's a good place to start.