Ben Langhinrichs

Photograph of Ben Langhinrichs

E-mail address - Ben Langhinrichs

Recent posts

Mon 21 Jun 2021

Custom Domino Extensions presentation

Thu 10 Jun 2021

Notes 12 without all the blue

Wed 5 May 2021

Pull public data into Notes on the fly

October, 2021
     01 02
03 04 05 06 07 08 09
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30

Search the weblog

Genii Weblog

Data mining with MD5 hash values

Thu 24 Mar 2016, 02:03 PM

by Ben Langhinrichs
After the relatively simple data mining example I showed previously using the Midas LSX, I thought I'd show a more useful example. In this, I want to find duplicate file attachments, whether or not they have the same name. This could be tracking down how a particular attachment has been circulated, or simply finding out whether replies and replies-to-replies have duplicated the same attachment many times, wasting valuable space. This could obviously be done with any database, but I wanted to use a real example rather than a demo example.

The basic concept here is that I generate an MD5 hash value for each file attachment and save the hash value, the file size and the attachment name into a line in a CSV file. Again, no additional programming necessary, I just used the Export to CSV sample db. Here is what I found. The top red arrow points to the item I moused over. This is a file called licmidas.lic, an old license file for our Midas LSX, but clearly it has been sent a few times, or has been left attached as the replies and replies-to-replies have been used. 

Below that, the row of red arrows shows that the file attachments has different names. The numbers are the file sizes, all the same obviously for identical files. In this case, the same attachment has eleven different names. (These happen to be image attachments created by one of our customers every time they reply.

Inline JPEG image

By simply viewing the data, we can see the UNID of the documents where the attachments live, so we can go and deal with them.

Inline JPEG image

We can do the same with embedded images, as well as some other rich text constructs I will show another time. As an example of how the duplicate images could be useful, imagine that your company needs to track the use of a licensed image that appears on various web pages. This could ensure that they were all found, even though images don't have an internal name to make them easy to find.

As before, here is the Export Directive I created in the Export to CSV sample. (This uses a soon-to-be-released version of the Midas LSX.)

Inline JPEG image

Update: I created a more specifically useful visualization for my own email database, so I thought I'd add it here.

Inline JPEG image

Copyright 2016 Genii Software Ltd.

What has been said:

No documents found