A quick look at YouTube’s ContentID

Posted by on Feb 17, 2016 in Uncategorised

In working with Videorooter, we’ve taken a good look at the ContentID system used by YouTube to identify (potentially) infringing uses. The system has taken some criticism, not least when it identified the Creative Commons licensed movie Sintel by the Blender Foundation as infringing the copyright of Sony Pictures. Videorooter was envisioned in part to address the problem of clearly identifying videos that are openly licensed, and to provide a whitelist of such videos.

To do this, we’ve looked at how ContentID works in order to compare it with our own work in the area. While the actual algorithms of ContentID aren’t public, and likely change frequently, it’s possible to make some educated guesses based on how the system behaves when facing different types of content.

Materials in the Videorooter system are publicly available

First off, the ContentID system is designed to allow content owners to upload material to YouTube, which they then process internally and add to their records. This material is then matched against uploaded videos on YouTube. The uploads to the ContentID system are not made public.

This in itself is different from the way that Videorooter and Elog.io work. Our databases only hold a fingerprint of the videos and images we store: we don’t store the entire video or image. The advantage is that our index is very small and light weight, but the disadvantage is that we can not re-calculate fingerprints very often. Nor can we use any algorithms that work against the video files directly, but must make use of intermediate fingerprints.

ContentID algorithm changes relatively often

As YouTube store the content uploaded to them through the ContentID system, it’s fair to assume that they make changes relatively often to the algorithms they use. This means that some of our findings, and some of the findings of others who’ve looked at ContentID may not be relevant any more if the algorithm have changed.

What does not change, however, is what happens when the ContentID system identifies a match with another video. In principle, the identified “owner” of the content is then given a range of option for how to deal with this, including the slightly sneaky and devious option of allowing the video to remain but to redirect all monetisation (revenue from ads played and similar) to the content owner instead of the user uploading. There’s also an option of whether to allow the “infringement” or to block it completely.

Infringement or fair use

It’s important to keep in mind that an “infringement” could in fact be fair use, such as was the fact when Larry Lessig’s video Open got blocked by Australian record label Liberation Music for including parts of the song Lizstomania. Larry included those parts in the video as a demonstration of what constitute fair use, and eventually settled with Liberation Music in a way that recognised it as such.

On the ContentID algorithm itself, there’s naturally a lot of rumours going around that may or may not have anything to do with the actual algorithm in use, but it seems clear that the algorithm (or algorithms most likely) considers not only the content of the video but also the metadata: descriptions, titles, tags, etc. There are reports of users being subject of ContentID claims for fair use only after adding descriptions or tags that identify the music or video they’re claiming to use under fair use.

In 2009, Scott Smitelli did an analysis of the ContentID algorithm by uploading several videos which he had altered in different ways to figure out which one got caught by the ContentID algorithm. His findings were that playing the song in reverse, altering pitch, time or resamping to slow up or down made the ContentID algorithm skip over his upload, unless the change was very minor.

Testing YouTube’s algorithm

Introducing noise to cover the song track worked only as long as the volume of the noise roughly surpassed the volume of the original audio. Reducing volume had no effect at all but perhaps most telling about his findings were that if he only took a chunk of the song, a 5-120 second long chunk that did not include the beginning of the song, this was not caught by the algorithm in place.

This is particularly interesting as it gives a hint about that the ContentID algorithm heavily depends or depended on capturing the beginnings of an audio track, and that the track needed to be longer than 15 seconds. This matches well with our experience from similar algorithms for images.

From everything that’s been written about ContentID, it’s clear that any diverging views on how it works is due to the ever changing nature of the algorithms itself. And ultimately, YouTube seem to employ algorithms that are more focused on capturing as many derivative works as possible even at the expense of making mistakes along the way. And relies on individuals working for the copyright holders to make the ultimate determination of whether a work infringes their copyright or not.

In this way, ContentID works similarly to Videorooter: whatever algorithms are employed, they can only ever be so good. The algorithms can give you the potentials as accurately as we can make it, but ultimately, it’s down to people using Videorooter to make the final determination whether two works matched are the same or not.

Leave a Reply