Genii Weblog

Data mining with MD5 hash values

Thu 24 Mar 2016, 02:03 PM

by Ben Langhinrichs

After the relatively simple data mining example I showed previously using the Midas LSX, I thought I'd show a more useful example. In this, I want to find duplicate file attachments, whether or not they have the same name. This could be tracking down how a particular attachment has been circulated, or simply finding out whether replies and replies-to-replies have duplicated the same attachment many times, wasting valuable space. This could obviously be done with any database, but I wanted to use a real example rather than a demo example.

The basic concept here is that I generate an MD5 hash value for each file attachment and save the hash value, the file size and the attachment name into a line in a CSV file. Again, no additional programming necessary, I just used the Export to CSV sample db. Here is what I found. The top red arrow points to the item I moused over. This is a file called licmidas.lic, an old license file for our Midas LSX, but clearly it has been sent a few times, or has been left attached as the replies and replies-to-replies have been used.

Below that, the row of red arrows shows that the file attachments has different names. The numbers are the file sizes, all the same obviously for identical files. In this case, the same attachment has eleven different names. (These happen to be image attachments created by one of our customers every time they reply.

By simply viewing the data, we can see the UNID of the documents where the attachments live, so we can go and deal with them.

We can do the same with embedded images, as well as some other rich text constructs I will show another time. As an example of how the duplicate images could be useful, imagine that your company needs to track the use of a licensed image that appears on various web pages. This could ensure that they were all found, even though images don't have an internal name to make them easy to find.

As before, here is the Export Directive I created in the Export to CSV sample. (This uses a soon-to-be-released version of the Midas LSX.)

Update: I created a more specifically useful visualization for my own email database, so I thought I'd add it here.

What has been said:

S	M	T	W	T	F	S
	01	02	03	04	05	06
07	08	09	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

S	M	T	W	T	F	S
	01	02	03	04	05	06
07	08	09	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

S	M	T	W	T	F	S
	01	02	03	04	05	06
07	08	09	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

Genii Weblog

Data mining with MD5 hash values

Thu 24 Mar 2016, 02:03 PM

No documents found

S	M	T	W	T	F	S
	01	02	03	04	05	06
07	08	09	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30