Testing algorithms

Posted by on May 2, 2016 in Software

One of the main challenges in Videorooter is identifying and describing the algorithms available for fingerprinting video files, a task which sometimes reminds of the exchange between Hamlet and Horatio in Act 1, Scene 5 of William Shakespeares The Tragedy of Hamlet. We’re not talking about a single algorithm, but about more than you can dream of. In the last weeks, we’ve put together a set of tests which calculate the accuracy of different algorithms. The tests run automatically on any repository which is forked from our Github repository, meaning everyone can participate in the work of developing the algorithms further.

There are more things in heaven and earth, Horatio,
Than are dreamt of in your philosophy.
– Hamlet (1.5.167-8)

To get an idea of what kind of output we get from this, you can head over to our statistics page and read a bit more. To elaborate a bit more on what you see on that page, each “box” (aside from the information boxes) represent one specific algorithm. It’s identified by its github username, repository and branch.

The test suite runs every 30 minutes and checks for any new changes to any of the Github repositories. Every repository which is cloned from our template repository and contains the configuration file videorooter.conf is included in the testing. The configuration file gives some basic information about how to compile the source code and how to run the fingerprint algorithm (more information is available in the template).

Whenever there is a new commit to a repository which is included, our test scripts will notice this and run the test suite. It starts a separate Docker container for each test in which it compiles and runs the tests. There are two tests which are most relevant: one test which shows how well the algorithm performs when it comes to matching verbatim works which have been recoded (accuracy), and a second test which shows how many false positives (images which are matched as the same even if they are not).

(You can read more about algorithm testing in our post about measuring accuracy)

by x6e38 (licensed under CC-BY)

‘1040’ by x6e38 (CC BY)

For each algorithm, we calculate the “best performance”. This is tricky and it’s not to be generally relied upon: what is best depends heavily on what you want to do! In some applications, having a high accuracy is more important than the false positives. Sometimes it’s the other way around. By the best performance here, we refer to the threshold where the algorithm works as intended for most cases. We get this number by taking the accuracy and subtracting the false positives. The logic behind this is if you have a high accuracy but a high amount of false positives, this will be ranked lower than if you have a slightly lower accuracy but much fewer false positives.

If you expand the box with each particular algorithm, you will get more details about it, such as the exact statistics about false positives and accuracy for each threshold we test. What’s there left to do? One of the tasks is that we need a more representative sample of images and videos to test with. Specifically, what we need is:

  • About 100 images and the same number of videos, each unique, and each representative in some rough way some type of work (for instance, we should probably have some images of cats, some with famous landmarks, some portraits, some professional, some amateur, some sand beeches, some food, some memes, and so on and so forth. A representative sample of the types of images out there, to the best extent possible).
  • About 1000-5000 images and the same number of videos of all sorts and forms, just as random as possible, and known to be unique, meaning there should not be two images or videos in that set which are the same.

The better the data set, the better representative will our statistics be!

Leave a Reply