Measuring & testing algorithm accuracy

Posted by on Feb 17, 2016 in Uncategorised

One of Videorooter’s contributions to the video commons is developing algorithms to provide fingerprints of a video file. An API matches those fingerprints against videos that are openly licensed. These algorithms build on work that Commons Machinery did with fingerprinting of images in 2012-2014. This post gives the background of our work which we bring with us into the development and choice of the algorithms for video fingerprinting.

We can look at different types of metrics to determine what kind of algorithm we want to use. We’re primarily concerned with two: accuracy and false positives.

Accuracy

If you take an image, do a transformation on it (like compressing or changing the dimensions), and then compare it using a matching, you’ll either get a match with the original or you don’t, depending on if the algorithm thinks that the two are identical. If our algorithm is perfect, we’d get a 100% match since all of them are exact copies just with small variations. In practice, no algorithm is perfect and we may get 87% matches. We then say that such an algorithm has an accuracy of 87%, meaning that for any work, there’s an 87% chance that it will accurately match against the original.

To determine the accuracy during our development, we transform and compare images in 12 different ways:

  1. Recode to PNG with three different widths (100,200,640px)
  2. Recode to JPEG with three different widths (100,200,640px)
  3. Recode to JPEG with low quality (50) with three different widths (100,200,640px)
  4. Recode to a PNG with 8-bit color palette with three different widths (100,200,640px)

We compare each transformed image with the original to determine the accuracy.

Most algorithms allow us to do a partial match as well. If 90% of a fingerprint overlaps with another fingerprint we can make an assumption that these images might be the same. In fingerprinting we call this the distance to tell how far apart two images are. For the Blockhash algorithm that Commons Machinery implemented, that distance is given as the number of bits (out of 256) that are different between two image fingerprints.

So rather than Match, you may get a distance of 56 bits, or 4 bits. In order to determine whether two images match you need to determine what the threshold for matching should be. If we say that the threshold is 10, we’d make the assumption that every two fingerprints that have a distance of equal to or less than 10 is probably a fingerprint of the same image.

Where we place the threshold influences quite significantly the accuracy that we get from an algorithm. The larger distance we allow, the more likely it is that we’ll get an accurate match, and the more likely it will be that the image we’re looking for will be matched. We refer to this as “accuracy”. For example, in the Blockhash algorithm specifically, if we set the threshold to 4, we’d get an accuracy of 68.59%, but if we increase it to 10, we’d get an accuracy of 85.68%.

False positives

The downside of a large threshold is false positives. A false positive is what occurs when the algorithm claims that two images are identical, when in practice they are not. One reason this can happen is that the image is shaped in a way that doesn’t work well with the algorithm.

In the first versions of our algorithms, we had the biggest problem with images that had a significant contrast between the upper half and the lower half. That may not seem like a big problem, but in fact, a lot of images are exactly like this: a clear blue sky against the skyline of a city, the green forest beyond a snow covered field, and so on. Every time our algorithm came across an image like that, it would think that it matched against another image that was similar in style, if not in content.

We developed the algorithm further until we didn’t have that particular problem any more, but it demonstrates one type of false positive that can occur. Another type of false positive is when the two images are indeed very similar!

An example of this is someone sitting in a lecture hall and recording several different speakers. The only thing that changes between two recordings is the speaker. The setting looks the same, and very likely, the speaker will be small enough on the recording that a recording of a different speaker may result in a false positive.

In order to get a metric for false positive for images, we’ve put together a collection of about 4800 images, taken from OpenImages.eu and Wikimedia Commons. For each image, we calculate the fingerprint, and we then compare that fingerprint against the fingerprint of every other image. In total, we compare about 3.7 million fingerprint pairs. For videos, we do exactly the same, but with about 3800 videos from the same sources. The calculation is otherwise the same.

We know that the works we retrieved are originally unique, so if our algorithm was perfect, we’d get no matches on this cross compare. As with the accuracy test, in practice, we get quite a few more fingerprints that match, despite that they shouldn’t. These are the false positives.

We sum up all the false positives we get and this gives us a measure that explains the likelihood that two images which are different would actually be matched as being the same. Here too, the threshold play a major role. If we allow a threshold of 10, we may get 3.18% false positives. Lowering the threshold to 4 would get us 1.97% false positives.

Balancing false positives against false negative

These two metrics work against each other: if you have a high accuracy, you usually have a high number of false positives, and vice versa. Lowering the threshold to get fewer false positives lead to a decrease of accuracy. Any implementation using fingerprints for images need to balance these two metrics against each other to determine which threshold is most suitable for the specific application.

If you want a very high accuracy, but don’t care too much about false positives, you’d select a higher threshold. If you can not live with false positives, but are okay with a lower accuracy, you’d pick a very low threshold.

We’ve developed an an automated system that we run every time we do any change to our algorithms. It runs through the same tests with every algorithm change and for each algorithm, and for each threshold between 0 and 15, give the false positives and accuracy.

You may find it interesting to compare the first test, which is of the current stock version of Blockhash, and the second one, which is the same set run with pHash instead, a different hashing implementation. You’ll find the numbers almost comparable, but there’s a distinct difference in the third metric which we also include (even if we often don’t spend so much time on it): time.

Timing

Blockhash runs through the tests in 133 seconds, whereas pHash takes about 20 minutes. The rest of the tests on that page are different versions of the Blockhash algorithm. The most interesting one is the last one, which shows the accuracy and false positives for a 64 bit (8×8) fingerprint using the Blockhash algorithm.

As you can see, the false positives spiral completely out of control for any threshold that’s higher than 0, but what’s really interesting is that the accuracy, even with a threshold of 0, is 78.53%. There’s a particular reason this is interesting: when you store data in a database and need to search for a value that is *close to, but not the same* as another, this usually takes a very long time if you have a lot of data.

You can employ various techniques, like the one we implemented in the HmSearch project, to get the time down, but it’s never going to be as fast as searching for a unique value. So if you are in need of an algorithm that is exceptionally fast when searching, and gives a reasonable accuracy, you may want to go for a 64-bit Blockhash. You’ll have an 8 in 10 accuracy, and you’ll be able to store and search for unique values in your database without considering the threshold.

As you can appreciate, choosing an algorithm isn’t easy. There are a lot of considerations to be made that depend on the actual problem you’re trying to solve. But with this background about accuracy, thresholds and false positives, we’re now ready to start with the real work for Videorooter: video algorithms.

Leave a Reply