Why we need open standards for fingerprinting

Posted by on Apr 11, 2016 in Uncategorised

A fingerprint is a series of letters and numbers that hold no meaning to the human eye. The series of digits is the outcome of an algorithm applied to a media file. When using a certain algorithm and use perceptually the same video you would always gets the same fingerprint. However, when a different algorithm is used, a different series of numbers and digits comes out. Fingerprinting is not sexy. It is a complex mathematical construct to capture some identifying aspects of a media file. There are however good reasons why you want to know what fingerprinting is and why these need to be developed out in the open. In this post, I explain why you should care and why we should set standards for video fingerprinting.

To understand the need for media fingerprinting, imagine a library without any order. There is no way to determine if a book is already in the library or to find a particular book. Ordering these books by last name of the author or category of work seems straightforward, but this is not the case for online images and videos. Videos do not have a straightforward author like books. We simply publish them on our own sites, YouTube, Vimeo, etc. Provenance information is often not retained when videos are published online, they are not retained in copying and are not registered like an ISBN number can identify a particular book.

'Books' by Jukka Zitting (CC BY)

Books‘ by Jukka Zitting (CC BY)

The Internet works a bit differently than a library. Due the electronic nature of the Internet everyone can copy everything they encounter on the web. The resulting copies can be published and contribute to the unstructured pile of media files online. Videos can be modified which leads to multiple sizes and formats (imagine the same book with a different font and/or cover) while the words in the book stay the same.

Creating a fingerprinting standard

Fingerprinting is like the Dewey Decimal System or the Library of Congress Classification of the Internet, but instead of categorising topics it creates a system of identification that can generate a unique n-digit code based on the perceptual content of the media. These fingerprints create order after the videos have been copied, altered and/or shared online. This makes it possible to make a card catalog of all these media files.

The problem with fingerprinting is that the way we make these fingerprints does not happen in the open that often: there is no clear universal standard that is used over and again. One often used algorithm is YouTube’s ContentID, which is a closed, proprietary algorithm. It is not a standard that is shared with the world. We cannot use their fingerprinting technologies to make our own indexes or card catalogs.

'Card Catalog 2' by bookfinch (CC BY)

Card Catalog 2‘ by bookfinch (CC BY)

A fingerprint in which you cannot identify the method of creation has no value. They cannot significantly contribute to the Internet or be used or shared by users online. This means that it is not possible to collaborate to locate videos or identify duplicates and see where videos are used on the Internet.

Identifying the fingerprint

As I explained earlier, the actual fingerprint that comes out depends on the method that is used to create that fingerprint.

For example these are both fingerprints of the same file using different fingerprinting methods (crypto-graphical hashes in this example). Can you spot which is created with which algorithm?,

A3a1c25d9b71a19d412188fa9ee0949a

71db803472debd08beb63b7e99b72802344358c211f90fc114a85c041a3057a3

When different algorithms are used (which results in different fingerprints), it is still not possible to develop one clear card catalog for online video files. We need to be able to identify the type of fingerprint so that a useful card catalog can come about to find information on the work. It makes it easier for people familiar with hashes and fingerprinting to understand these fingerprints if we identify them by their method, like:

md5:A3a1c25d9b71a19d412188fa9ee0949a

sha256:71db803472debd08beb63b7e99b72802344358c211f90fc114a85c041a3057a3.

We need to work towards a standard that identifies the methods used for fingerpriting of media files, harmonise a common way of communicating them and start sharing them, like any library system.

1 Comment

  1. Gabriel
    15th August 2016

    Is there any plan to create a group to work in this?

    best regards

    Reply

Leave a Reply