The Curious Case of MD5

December 31, 2023

Comments on Hacker News | Average reading time: 11 minutes.

Recently I came across a puzzling fact: the International Criminal Court hashes electronic evidence with MD5¹.

What’s wrong with that? Well, MD5 is badly broken. So broken that experts have been saying for over a decade that “no one should be using MD5 anymore.”

Given the wide variety of better alternatives, using MD5 today makes no sense.

And more puzzling: it’s not just the International Criminal Court using MD5. Apparently, the entire American legal and forensics community uses MD5.

So, why are lawyers using broken, outdated technology?

In asking this, I should be clear: I’m not a lawyer and I’m not a cryptographer. I’m a software engineer² and consultant in applied cryptography. And I suspect that I might be one of the only people interested in both cryptography and law, otherwise someone else would have written this article already.

There are a few reasons why lawyers are still using broken, outdated technology. Fundamentally, the legal community disputes that MD5 is broken, for them. Yes, they say, MD5 is broken for encryption, but since they’re not doing encryption, it’s fine for them to use it.

This post is my exploration of how this dispute came about, and whether the legal community is correct in believing they can safely use MD5.

How law uses hashing

Let’s start with how MD5 and other hash functions are used in law.

Let’s imagine that we have 200 documents that we must store as evidence for a case. It’s important that each of these 200 documents is 1) identified correctly, 2) copied correctly from the originals, and 3) unchanged since the time we collected it.

We can use something called a cryptographic hash function to accomplish these goals. A cryptographic hash function produces a unique identifier (called a “hash” or “hash value”) for a document, based on the contents of the document.

Importantly, if anything at all in the document changes, even just a single character added or subtracted, the hash value will be different. As Judge Paul W. Grimm points out in Lorraine v. Markel American Ins. Co., hashing a document can help to easily distinguish between the “final” (and thus legally operative) version and earlier versions. While both documents might look similar to the human eye, the hash value of two very similar but different documents is completely different.

We can test this out. For instance, in less than a second of computation, the 783 page book Ulysses by James Joyce hashes to:

c6061f63b03774425a5b06083af4c9cb33f6f47cf0efd71b1258828f3332a604

And if we change a single letter three-quarters of the way through the book, we get a completely different hash:

2e605eca536c927d629fec4c8ab4759af59bd2ba15ef562989766c348b1a72a6

And thus, by hashing, we know–without reading a single line of the book–that the book has been altered and is no longer the original. And we can do the same check for millions of documents at a time. That’s pretty useful as a time-saving device!

So, in law, we want to be able to identify a document with a particular id, and then be able to know that 1) we’re all talking about the same document and aren’t confusing it with some other document or a different version, and that 2) our copy of the document is the same as the original, and 3) that the document hasn’t been changed or altered without our awareness. Cryptographic hash functions help us accomplish all of these goals.

MD5 is obsolete

There are many cryptographic hash functions, and only a few are recommended for current use. The others are obsolete. MD5 is extremely old, in tech years. It was introduced in 1992, problems were noticed in 1996 and 2005, and by 2008, it was deemed unusable. Carnegie Mellon’s Software Engineering Institute stated in 2008:

“Do not use the MD5 algorithm. Software developers, Certification Authorities, website owners, and users should avoid using the MD5 algorithm in any capacity. As previous research has demonstrated, it should be considered cryptographically broken and unsuitable for further use.”

Note that this was not controversial in any sense among technologists. But, we find ourselves in the year 2023, a full 15 years after being told not to use MD5, with the legal community using and even recommending MD5.

What exactly is wrong with MD5, and does it matter for law?

As far as I can tell as an outsider, there’s very little current discussion in the legal community about whether MD5 should be replaced. For instance, the Sedona Conference 2021 commentary on ESI Evidence & Admissibility mentions MD5 as one of the “most commonly used algorithms” for hashing, and eDiscovery and forensics software such as OpenText and Exterro are apparently proud to use MD5.

And, when the legal community does attempt to grapple with MD5 being broken, the problem is dismissed because it is “only broken for encryption”, whatever that means.

For example, the Sedona commentary references a 2008 article by Don L. Lewis (now taken down and only available through the Internet Archive) that has the following exchange:

When I testified recently a defense attorney brought this subject up. The testimony went something like this.

Q. “Mr. Lewis, are you aware that the MD5 algorithm has been compromised?”

A. “Yes, I am.”

Q. “So, its use to authenticate evidence is no longer valid!”

A. “No, the use of the MD5 algorithm is still a valid function for authentication.”

Q. “Why is that?”

A. “There are multiple uses for hash algorithms. One is cryptography (encryption), another is identification, and another is authentication. In digital evidence forensics, we use hash algorithms for known file identification and evidence authentication, which differs from its use in encryption.”

Here, Lewis is simply wrong. The primary use of hashing in cryptography is identification and authentication, same as in law. Of course, the context differs – outside of the law, the hashed files don’t end up in a courtroom. But, this idea that the tech world mainly uses cryptographic hash functions for something other than identification and authentication is wrong.

How Tech Uses Hashing

Cryptographic hash functions have a number of use cases in tech.

First, hashes are often used to verify that data was not tampered with or altered in transit. For instance, a file download page might also include a signed hash so that users can check that the file they have on their computer is an identical copy of the original. This is very similar to the legal use case of checking that a forensic image is an identical copy of the original.

Second, a major use of hashes in tech is in digital signatures. A digital signature is like a handwritten signature in that it connotes that the signer has either witnessed or approved of whatever is being signed, but a digital signature is very different in form. A digital signature uses cryptography, and effectively proves to the general public that only the person or entity that knows a particular secret has signed the file or data, without revealing the secret to anyone else.

There’s a lot more to understand about digital signatures (see a talk on digital signatures for economists that I gave at APEE), but for our purposes, the important thing about digital signature schemes is that they use cryptographic hash functions to get a unique identifier for the file, and it’s the unique identifier that is signed, not the entire file.

Now, there are a few more ways in which cryptographic hash functions are used in tech – for instance, as a way of identifying files in peer-to-peer file sharing and content-addressed storage (so still identification, contrary to Lewis.) But let’s focus on the digital signature example.

Back in 2005, a group of researchers showed how the problems with MD5 could be used in a real world attack. Their example is a bit convoluted, but essentially, they showed that an attacker could make the victim sign a harmful document. Importantly, it requires that the attacker can create two files, one innocuous and one harmful, both of which have the same MD5 hash. MD5 is broken in this particular way: given access to two files, it is easy to change some data in both of them to result in the same MD5 hash.

In their particular example, the innocuous file is a letter of recommendation, and the harmful file is a security clearance, both postscript files. (If your computer can’t read postscript files, here are screenshots of both so that you see what each says. The hashes of the screenshots are of course not the same; only the postscript files have the same hash.)

The Innocuous File: A Letter of Recommendation

The Harmful File: The Security Clearance Order

In this case, the key element is that a third party believes that Julius Caesar has signed a security clearance for Alice when he has not, and Alice has set up both files for the deception. In other words, the deception must be preplanned.

We can easily imagine a similar example in the legal world, even without digital signatures. For example, suppose that a witness has identified a particular file provided by the opposition as the legally operative document. And that the particular file is identified by its MD5 hash. It would be easy for the opposing party to create two documents with the same MD5 hash, which say different things. Depending on which is advantageous to them at the time, they could say either document is the legally operative version, since the legally operative version was only identified by the broken MD5 hash. Of course, like the postscript files, on close and thorough manual inspection, the two files will be found to be distinct. But the whole point of cryptographic hashes is that we don’t need to use manual inspection, especially for large files that are hundreds of pages long.

Another example: suppose someone is arrested for possession of CSAM, but ahead of time, they have manipulated both the harmful image and an innocuous image so that they both have the same MD5 hash. If law enforcement only records the MD5 hash of the image, law enforcement cannot say for sure, based on the hash alone, whether the image was CSAM or not. The image must be manually inspected, which defeats the entire purpose of the cryptographic hash. A cryptographic hash should uniquely identify a file.

It might be possible that the adversarial setup of the American court system can help in this regard, to provide an opportunity for manual inspection. But this assumes that the attack is detected in the first place. What if the attack is undetected? What if it is subtle enough, or the files large enough that it is not caught? For instance, what if the harmful file has a slightly different excel function hidden in the background, that changes the calculations just a tad, but enough to make a difference?

It’s best not to gamble, especially since the fix is trivial.

MD5 isn’t collision resistant

In each of the above examples, MD5 is badly broken in a particular way: it isn’t collision resistant. Practically, for MD5, this means that given access to two files, it is easy, given the right software, to change both of them such that they will produce the same MD5 hash. In other words, to produce a particular collision. In fact, producing any arbitrary collision would be enough to make MD5 unusable.

But, I want to be careful not to claim too much. So far, it is not possible, given a particular, non-manipulated file and the MD5 hash for that file, to find a second file that has that same MD5 hash. This is called a second-preimage attack. This is good news in the case in which we know an attacker or a malicious party doesn’t have access to files before we hash them. For instance, as Lewis points out, the “hash sets” (databases of MD5 hashes for known operating system and application files) are unlikely to be affected by MD5’s lack of collision-resistance, given that the hashes are of presumably non-manipulated files.

However, as society becomes more tech-adept, and particularly when the defendants and their associates are familiar with forgery, malicious manipulation of evidence should be prevented as much as possible, especially since doing so is trivial: just use an up-to-date, unbroken hash function instead of MD5.

What to use instead of MD5

NIST’s current recommendation is the SHA3 family. It does not have any of the flaws that MD5 does, and works just as well.

If speed is a particular concern, perhaps with very large files, Blake3 is one of the fastest algorithms.

Do not use SHA1, which is also broken.

Shameless plug: If you’d like help integrating SHA3 or Blake3 into your current practice, feel free to contact me at katelynsills@gmail.com for consulting.

How did this happen? The difference between law and tech

So, going back to our question: why are lawyers using broken, outdated technology?

In addition to lawyers mistakenly believing Lewis’ argument that MD5 is only broken for “encryption,” I think there are a few more reasons:

The legal community started using MD5 at the same time that they started hashing, and it stuck.
The legal community doesn’t know that better alternatives to MD5 exist, because digital forensics is isolated from computer science.
The legal community hasn’t updated to the latest technology because legal culture isn’t accustomed to (and might even be hostile to) performing continuity-breaking updates at the pace that tech requires.

For example, in the previously mentioned Sedona commentary, the mention of MD5 is from a 2007 text: Managing Discovery of Electronic Information: A Pocket Guide for Judges. It’s admirable that a 2007 text included hashing, and it is somewhat excusable that MD5 is recommended, since the text was published before the major publicity of MD5’s flaws in 2008.

But making software recommendations based on what a 2007 text recommends is deeply negligent. There’s no reason to believe that what was true in 2007 is true now. In software years, 2007 is ancient history.

However, in legal culture, 2007 is yesterday, which is why this reference wasn’t seen as problematic.

It also appears that the legal community does not know that better alternatives to MD5 exist – algorithms that don’t have known weaknesses. For example, many forensics textbooks only reference MD5 and SHA1, both of which are not recommended for use by NIST. The textbook Digital Evidence and Computer Crime states:

“One approach to addressing concerns about weaknesses in any given hash algorithm is to use two independent hash algorithms. For this reason, some digital forensic tools automatically calculate both the MD5 and SHA-1 hash value of acquired digital evidence, and other tools provide multiple hashing options for the user to select.”

Or, you know, you could just use a single cryptographic hash function that actually works.

Lastly, I think a big part of the problem is cultural. Legal culture is hostile to rapid updates, and for good reason: lawyers and law-makers are used to running “code” on human beings, who simply can’t handle rapid change.

Law needs to learn to update

In the Morality of Law, legal theorist Lon L. Fuller identified the ways in which we can “fail to make law.” One of those is “introducing such frequent changes in the rules that the subject cannot orient his action by them.”

This makes sense: people need to be able to know what the rules are and will be, so that they can plan their actions accordingly.

But software, unlike law, doesn’t run on people. Software runs on computers, which can handle rapid change. And software has to defend against motivated attackers, who will exploit any weaknesses in old code at a rapid pace themselves. So software has a very different culture than law. In software security, it is expected that if any weakness is found or even suspected, it is patched and everyone updates as quickly as possible to new code. At times, this can be jarring, but for the most part, it happens entirely outside of human perception.

When law adopts technology, law must also adopt tech culture: a culture of regular updates for that particular technology. Otherwise, as in the case of hashing, the legal community will be going through the motions but won’t be able to reap the full benefits of the technology, because what they are using is badly outdated.

Imagine being able to instantly uniquely identify a document, no matter how large it is, and know for certain that it is distinct from any other document, without any confusion. Imagine knowing that the document is 1) identified correctly, 2) copied correctly from the originals, and 3) unchanged since the time we collected it.

If law uses a proper cryptographic hash function rather than MD5, we can be certain of all three.

Comments

Hacker News Discussion

Notes

I learned this from Chelsea Quilling, in https://jpia.princeton.edu/news/future-digital-evidence-authentication-international-criminal-court ↩
For those credentially-minded, I’ll mention that I have a degree in Computer Science from UC Berkeley, where I was a member of Upsilon Pi Epsilon, the International CS Honors Society.↩