Taking our previous experience from fingerprinting of images, combined with what we learned when looking at ContentID, we’ve developed the first rough algorithms for video fingerprinting for use in the Videorooter project.
The algorithm we’ve settled on as our first one provides the baseline for any future work: it’s something to get us started, and developments to improve on that algorithm will be compared against the baseline we now have.
While there are many ways in which an algorithm for video fingerprinting could be envisioned, we’ve selected the most simple implementation for our baseline tests, with the hope this could then be improved on.
How it works
Here’s what the algorithm does: it picks four key frames (I-frames) from a video stream, and for each frame, it calculates a 64 bit fingerprint based on our existing image fingerprint code, then concatenates the four fingerprints together to form a 256 bit fingerprint for the whole video.
The frames are taken from the beginning, the end, and from the middle of the video file with the belief that if the beginning, end, and middle seem to correspond between two different files, then we’re likely talking about the same video.
Where it breaks
You’ll note of course, this does not cover audio, one of the current shortcomings of this algorithm, and something which do affect the false positives. One of the problems we’ve identified is with President Obama’s weekly address. Each of them are on Wikimedia Commons, and each one is very similar to any other. They first frames look identical (only the date change), the last frames are identical, and for the middle frames, for many videos, you see President Obama in the same shirt & tie, in the same room, in roughly the same position.
No wonder our algorithm gets confused and think they are all the same! And this will likely continue to be a problem for us for as long as we don’t consider audio, so any alterations to the algorithm to consider audio as a parameter would be very useful and interesting as a future development.
How we evaluate it
In order to test the accuracy of the algorithm, we’ve taken 100 videos from Wikimedia Commons, which we then transform in eight different ways:
- mpeg1video, yuv420p(tv), 352×240 [SAR 1:1 DAR 22:15], 1150 kb/s, 29.97 fps, 29.97 tbr, 90k tbn, 29.97 tbc (bitrate: 1391 kb/s)
- vp8, yuv420p, 400×300, SAR 1:1 DAR 4:3, 29.97 fps, 29.97 tbr, 1k tbn, 1k tbc (bitrate: 1056 kb/s)
- vp8, yuv420p, 400×300, SAR 1:1 DAR 4:3, 29.97 fps, 29.97 tbr, 1k tbn, 1k tbc (bitrate: 159 kb/s)
- h264 (High), yuv420p, 400×300, SAR 1:1 DAR 4:3, 29.97 fps, 29.97 tbr, 1k tbn, 59.94 tbc (bitrate: 645 kb/s)
- mpeg2video (Main), yuv420p(tv), 480×480 [SAR 1:1 DAR 1:1], max. 2516 kb/s, 23.98 fps, 23.98 tbr, 90k tbn, 47.95 tbc (bitrate: 558 kb/s)
- mpeg1video, yuv420p(tv), 160×120 [SAR 1:1 DAR 4:3], 104857 kb/s, 29.97 fps, 29.97 tbr, 90k tbn, 29.97 tbc (bitrate: 242 kb/s)
- mpeg1video, yuv420p(tv), 320×240 [SAR 1:1 DAR 4:3], 104857 kb/s, 29.97 fps, 29.97 tbr, 90k tbn, 29.97 tbc (bitrate: 243 kb/s)
- h264 (Constrained Baseline) (H264 / 0x34363248), yuv420p, 480×360, 249 kb/s, 29.97 fps, 29.97 tbr, 29.97 tbn, 59.94 tbc (bitrate: 360 kb/s)
We then compare each transformation against the original to get the distance. Something we’ve learned in this, which is different from our work with images, is that there are very many ways in which you can encode a video, and decoding a video is sometimes difficult and depend very much upon what software you use for decoding.
Just to illustrate the complexity; video files are usually organised into a number of different types of frames. You have key frames (I) and incremental frames (P and B). If you get an I frame in a video stream, you can decode that image from only the information in the I frame. If, however, you get a P frame, that frame only includes the incremental changes from the previous frame.
This makes P and B frames much smaller and easier to compress, but they depend upon other data. If you read a video stream sequentially, you may expect to find an I frame to start with, then some P frames, then an I frame again, and so on. Except of course when you don’t.
Turns out formats such as H.264 can have almost arbitrary relations between the frame order in the file, and the order in which they should be displayed. And how this looks like in the encoded file depends on which encoder you’ve used. What it means in practice for us is that, in general, we end up with thresholds which must be much higher than for images, to get reasonable accuracy, and some video files just can not be decoded.
As with images, we’ve written a test suite which runs through the tests automatically for different versions of our algorithm, the results of which you can see in our statistics page. We currently have two different versions: one which is implemented with our original blockhash and one which is implemented with pHash.
The data for accuracy is calculated on 100 different videos, each encoded in 8 different ways, and the data for false positives is based on the crosswise comparison of ca 4000 video files (about 92 GByte) from OpenImages and Wikimedia Commons.
What’s interesting with this baseline is that while the algorithms, for images, are comparable, they result in very different results when used in videos, that clearly skew the result in favor of pHash, which produce a very low rate of false positives while having a reasonable accuracy.
The current baseline seems to hint at a threshold of 50 bits, which gives a false positive rate of 2.12% and an accuracy of 86.79% using pHash. For smaller thresholds, Blockhash does give a higher accuracy. At the moment, our baseline algorithm is the blockhash version. The pHash algorithm improvement seem useful, but we’ve not yet made the switch throughout.
Where to from now?
Do you have ideas for how this algorithm could be improved? We’d love to hear from you, and if you change our code, please get in touch with a link to your repository so we can include it in the same set of tests for comparison! We’re working on a feature which will take any forks from our repository and automatically run our test cases against them, to provide immediate (or close to immediate!) feedback on how any change of algorithm affects the statistics.