How does Chromaprint work?

I’ve been meaning to write this post for a long time, but never really finished it. I hope it will help people understand how does the Chromaprint algorithm work, where do the individual ideas come from and what do the fingerprints really represent. It’s not meant to be a detailed description, just the basics to get the general idea.

Being primarily based on the Computer Vision for Music Identification paper, images play an important role in the algorithm. When people “see” audio, they usually see it as waveforms:

This is what most applications display, but it’s not really useful for analysis. A more useful representation is the spectrogram, which shows how does the intensity on specific frequencies changes over time:

You can get this kind of image by splitting the original audio into many overlapping frames and applying the Fourier transform on them (“Short-time Fourier transform”). In the case of Chromaprint, the input audio is converted to the sampling rate 11025 Hz and the frame size is 4096 (0.371 s) with 2/3 overlap.

Many fingerprinting algorithms work with this kind of audio representation. Some are comparing differences across time and frequency, some are looking for peaks in the image, etc.

Chromaprint processes the information further by transforming frequencies into musical notes. We are only interested in notes, not octaves, so the result has 12 bins, one for each note. This information is called “chroma features”. (I believe they were mentioned in the paper Audio Thumbnailing of Popular Music Using Chroma-Based Representations for the first time.) After some filtering and normalization, the final image looks like this:

Now we have a representation of the audio that is pretty robust to changes caused by lossy codecs or similar things and also it isn’t very hard to compare such images to check how “similar” they are, but if we want to search for them in a database, we need a more compact form. The idea how to do it again comes from the Computer Vision for Music Identification paper with some modifications based on the Pairwise Boosted Audio Fingerprint paper. You can imagine having a 16×12 pixel large window and moving it over the image from the left to the right, one pixel at a time. This will generate a lot of small subimages. On each of them we apply a pre-defined set of 16 filters that capture intensity differences across musical notes and time. What the filters do is they calculate the sum of specific areas of the grayscale subimage and then compare the two sums. There are six possible ways to arrange the areas:

You can basically take any of the six filter images, place it anywhere on the subimage and also make it as large as you want (as long as it fits the 16×12 pixel subimage). Then you calculate the sum of the black and white areas and subtract them. The result is a single real number. Every filter has three coefficients associated with it, that say how to quantize the real number, so that the final result is an integer between 0 and 3. These filters and coefficients were selected by a machine learning algorithm on a training data set of audio files during the development of the library.

There is 16 filters and each can produce an integer that can be encoded into 2 bits (using the Gray code), so if you combine all the results, you get a 32-bit integer. If you do this for every subimage generated by the sliding window, you get the full audio fingerprint. The simplest way to compare such fingerprints is to calculate bit error rates. I took the last UNKLE album and generated a few fingerprints for the original FLAC files as well as low quality 32 kbps MP3 versions and differences between them:


Heaven FLAC


Heaven 32kbps MP3


Differences between Heaven FLAC and Heaven 32kbps MP3


Under The Ice FLAC


Under The Ice 32kbps MP3


Differences between Under The Ice FLAC and Under The Ice 32kbps MP3


Differences between Heaven FLAC and Under The Ice FLAC

You can see that there is very few differences when comparing FLAC and MP3 of the same track, but comparing two different tracks will generate a lot of noise. It’s not perfect and some parts could be done differently to improve the rates, but overall I’m pretty happy with the results. I especially like how little conflicts does the algorithm generate, even on the individual subfingerprints.

Update:

Examples images for some instrumental versions of songs:


Differences between instrumental and normal versions of Britney Spears – Toxic


Differences between instrumental and normal versions of Coldplay – Viva la Vida

This entry was posted in Acoustid and tagged , , , . Bookmark the permalink.

28 Responses to How does Chromaprint work?

  1. Brian Schweitzer says:

    Thanks, that made it quite understandable. :) I’m curious, though; what would those same types of ‘bin images’ look like if you were to compare two different versions of the same song – a kareoke vs ‘normal’ version for instance? That’s where I’ve seen other fingerprinting scheme break down… Here, the notes would presumably be the same, so would there be any expectation of better performance than other fingerprint methods, such as last.fm’s or PUIDs?

  2. I have not specifically tested karaoke/normal versions, I’ll give it a try. I guess the results depend on how much voice is there.

  3. Added a couple of examples. Britney Spears apparently doesn’t add much to the result, so there are still very few differences. :) The second example looks much more like I expected it to look. The non-vocal parts that should be identical are still quite noisy though, but that’s probably because of an extra violin track in the background of the instrumental version.

  4. Brian Schweitzer says:

    Thanks, that looks promising. I’m thinking the worst/toughest case will be “no percussion”/”no drums” ( http://musicbrainz.org/search/textsearch.html?type=track&query=no+percussion&handlearguments=1 ) mixes of production music, vs the standard “full” mixes.

  5. KeithCu says:

    Does it work if you speed up or slow down the music like DJs often do?

  6. No, the algorithm was designed to not match tracks like that.

    The goal of the library is to match audio files to the MusicBrainz database. A DJ-
    modified version should have a different entry in the database, which means the fingerprint algorithm should see the as different too.

  7. Mike Haller says:

    Has it been tested with “non-music” audio like speech? I don’t mean speech recognition (translating speech to text). But audio pattern recognition using say 2 recordings of the same speech. We’d like to use a sample from one recording to find the same place on a different recording of the same speech.

  8. I’ve haven’t tested it in a situation like that, but I don’t expect it to work well.

  9. Matt says:

    Does the algorithm require you send the fingerprint for the entire song or can you send, say, a 10-15 second sample?

  10. Berker says:

    Hi,

    How can I find a small audio segment in a streaming (long) audio with chromaprint ?

    The code generated as FINGERPRINT (separated by comma) do not exist in the fingerprint code of the long audio file although that small audio segments exists at some part of the long audio.

    I am using fpwav.py script to generate FINGERPRINT codes.

    Regards

  11. Here is a simple Python script that can do that https://gist.github.com/1099996 (you will probably have to change the way fingerprints are loaded).

    Because of how the fingerprint algorithm works, this will only match audio that is really the same. If you have for example a filtered track in a DJ-mix, it will not match it. On the other hand, it works fine for finding songs in online radio streams.

    The basic idea is that there are at least some items that are common to both fingerprints. You can see that I’m stripping the fingerprints to 20 bits in the script to increase the chance. Once you find these common items, you locate them in both fingerprints and check at what offset of the long stream can the short snippet be. Then you calculate a histogram and take the most common offsets. You compare the fingerprints at these offsets and take the best matches. This tells you where in the long stream is the snippet located and how good is the match.

  12. Berker says:

    Hi,

    Thanks for the script. I am trying to match a snippet which is located in a long stream as the same, no DJ-mix, etc..

    Which form of the fingerprints does the script take ? fpwav.py outputs both encoded fingerprint and fingerprint.

    What I did up to now was,
    First, I put the integers , output of: chromaprint.decode_fingerprint(fp)[0] in fpwav.py, into lists named full and short. Then, joy_full.txt contains the fingerprint output of long audio and joy_short.txt contains the fingerprint output of the snippet. When I run the locate audio snippet script there was no output as indicating the position of snippet.

    Regards,

    Berker

  13. Berker says:

    Hi again,

    I finally got an output that indicates the position of snippet.
    My long audio file is about 20 min. length. When I use fpwav.py without any arguments, it didn’t work. I know that the snippet is about 2.51 min. so I give -l 300 when using fpwav.py on my long audio file. Then I use those fingerprints and fingerprints of the snippet with locate audio snippet script and it worked.
    is there a time limit in using fpwav.py ?

    Regards,

    Berker

  14. Berker says:

    Hi again,

    I figure out that in fpwav.py default length is 30, everything is working well now.

    Thanks for your work on chromaprint.

    Best Regards,

    Berker

  15. I’m glad it works for you. Let me know if you have any questions/problems with it.

  16. evilive says:

    Hi Lucas. Thank you for your great job, it helps me alot.
    But I have a question about the time that one fingerprint lasts. There is a comment in “fp-find.py” says that one fingerprint represent 0.1238 seconds. But I made few measures and noticed that it depend on how long track is.

    For example:
    10.0 sec = 60 fingerprints ( 0.167 sec/fp)
    60.0 sec = 464 fingerprints ( 0.129 sec/fp)
    126.3674 sec = 1000 fingerprints ( 0.126 sec/fp)
    1240.744 = 10000 fingerprints (0.124 sec/fp)

    Why does it happen?

  17. There are some constant values you have to subtract from the length before you can divide it by the magic value 0.1238. This is the exact formula:

    FFT_FRAME_SIZE = 4096
    SAMPLE_RATE = 11025
    OVERLAP = 3
    STEP_LENGTH = FFT_FRAME_SIZE / SAMPLE_RATE / OVERLAP
    WINDOW_SIZE = 16
    FILTER_SIZE = 5
    FP_SIZE = (LENGTH - STEP_LENGTH * (WINDOW_SIZE + FILTER_SIZE)) / STEP_LENGTH
    

    Also note that some rounding can happen on many places in the code, so the result should be considered within the +/- 1 range.

  18. jesus2099 says:

    If your Britney Spear example is a karaoke, could you label your pictures karaoke (just cut off vocal track) instead of the misleading instrumental (with instrumental lead melody added) ?

  19. Rickie says:

    Hi there

    I just came across this website.

    My question is: Can I use Acoustid/Chromaprint to basically compare two pieces of sound (very short pieces of sound 1-3 seconds only) ? Let’s say I have a reference sound and I want to check if another sound does match the reference.

    I actually just would need to compare the two generated fingerprints of the sounds, correct? Is there a possibility to not get black or white as comparing result output but a matching percentage (i.e. input sound matches reference sound to 80%)?

    If my thought are wrong do you know any better tool to do so?

    P.S: Performance is important.

    Thanks,
    Rickie

  20. The default configuration of Chromaprint is optimized for much longer audio samples. 3 seconds is just about enough to fill the internal buffers, so it will not generate much interesting data. Changing the configuration is not easy, as it requires you to run a machine learning process to generate new coefficients for the new settings.

    I guess the classic PRH algorithm would work good enough for this. You can find an implementation of that in pHash. I haven’t actually used the library, but I expect that it will be able to give you the fingerprint which is just an array of integers and you are supposed to compare it for bit differences.

  21. JWDC says:

    How sensitive is this algorithm to noise? Would it match if I compared a song from a radio station to a DB populated with fingerprints from CDs?

  22. To be honest, I don’t really know. It was designed for exact matches, with lossy encoding artefacts as the only noise. I might work in your scenario, but I haven’t tested it.

  23. Pingback: Inside the Acoustid server | Lukáš Lalinský

  24. Euna Syrstad says:

    Hello, I just hopped onto your web web site thru StumbleUpon. Certainly not somthing I’d usually surf, but I enjoyed your thoughts it’s unlikely that any the less. Thank you for making something worthy of browsing.

  25. Denis Mikhalkin says:

    Hello Lukas,

    thanks for the wonderful explanation of the algorithm – it’s very clear and easy to understand. I am investigating whether it is possible to implement automatic identification of bird songs from a recording against a DB of known birds.

    Judging by the description, and by your feedback to other people, I figure that this algorithm, at least in its current form, may not be suitable for bird songs because of the differences amplitude, frequencies and durations of a song between two different birds (the same specie).

    I am wondering though whether there is a modification of the algorithm that might work on bird songs, or whether you are aware of a different approach that might be applicable. Would appreciate any pointers – from general ideas to references to publications.

    Thanks.

    Denis

  26. Chef says:

    Hello, I was wondering if I can use this to identify songs from a online stream server (icecast) then in turn have them display on a website. The idea is that the listener would have to call to find out what a song is ….Like shazzam but for audio stream for a fm radio station. Any help on creating this would be much appreciated.

  27. Sorry, no, this isn’t possible yet. You could use the fingerprinting algorithm to implement a service like that, but Acoustid currently can’t do it. It is something I want to implement in the future though.

  28. Tomas Koctur says:

    Ahoj Lukas,

    som student Technickej Univerzity v Kosiciach a pisem bakalarsku pracu na temu Vyhladavanie informacie v audiodokumentoch alebo aj CBID. Moj veduci je doc. Jozef Juhar Csc. mozno ho aj z pocutia poznate on sa viac zameriava na rozpoznavanie reci. Ale aby som presiel k jadru mojej otazky. V BP sa chcem poukazat na musicbrainz.org a softwary ktore ste napisali. Jedna sa mi o detaily prace fingerprintov a system vyhladavania v databaze. Za odpoved vopred dakujem.

    S pozdravom

    Tomas Koctur
    Radio 9

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>