Acoustid

After asking for fingerprint test data in my last post about Chromaprint I received about 270k fingerprints from. This helped me to perform some larger tests on the proof-of-concept lookup server I had implemented using Java and PostgreSQL. I had to do some technical changes, but the main idea seemed to work pretty well. I wasn't sure about this when I was working on Chromaprint, but now I believe that I can realistically run an audio recognition service.

So, let me introduce Acoustid (named by herojoker), the server side component for audio fingerprint lookups. As I mentioned before, it uses PostgreSQL's GIN indexes with the intarray implemented to do the hard work. Java is used as a glue language that provides an web-based service for interacting with the database. I originally considered using C++ for this (using the POCO libraries), but there is significantly more tools in Java for an application like this, so even though I don't particularly like Java, it seemed like the best language for this.

Chromaprint fingerprints are basically timed sequences of 32-bit numbers ("subfingerprints"). They are extracted from the audio stream, using a technique which is not important here, every 0.1238 seconds and cover about 1.9814 seconds of the following audio data. This means that to represent a minute of audio, you need around 460 of such numbers. For two audio files representing the same song in different file formats, these numbers should be very similar in terms of bit errors. Acoustid's job is to take these numbers and find a fingerprint in the database that has the least bit errors compared to the query. Theoretically, this could be used to identify songs based on a few seconds of the original audio, but that would require indexing whole fingerprints, because we don't which part of the original fingerprint does the query match. This requires a lot of RAM to perform efficiently, so I decided to collect the data that is necessary to perform searches like this, but not implement the feature for now.

Instead, Acoustid's primary task is to identify audio files and for this purpose it can assume that the query is extracted from the beginning of the audio. It indexes only ~15 (currently from 0:25 to 0:40) seconds of audio and uses this index to search for possible matches. Currently it assumes that at least one number from the range matches exactly. This can yield some false positive matches, so the next step is to compare the full query to the full fingerprint. As the fingerprints might not be ideally aligned, it first finds the best alignment and then calculates a score as the number of subfingerprints with less than 2 bit errors. This matching and scoring is done completely on the database server, using the intarray extension and a custom C extension for calculating the score.

This seem to all work fine from technical point of view. What I'm fighting with is the API for this. How to pass the fingerprint data efficiently, how to authenticate users for submissions, whether to have the concept of users at all. The main problems currently are:

The fingerprints are quite large and it would be ideal to send them as binary data. This conflicts with most existing standards for web services. I've implemented a compression algorithm that makes the size of base64-encoded fingerprint similar to the size of an uncompressed binary fingerprint, but I'd still like to not waste the bandwidth if I can avoid it. I've considered bencode-formatted requests, but that is probably too non-standard. I've also considered accepting "Content-Encoding: gzip" in HTTP requests, which everybody seems to be using only for responses, but it seems usable also for requests.
I'd like to track who submitted which fingerprints. I don't need any private data, but I'd like to be able to delete all fingerprints from one source if I find out that there is something wrong with the data. Maybe I'm paranoid, but I'm afraid that some people might try to submit wrong data in order to reduce the usefulnes of the database.
If I want to track users, where should the user accounts come from. I definitely don't want people to have to register and I'd like the service to be integrated with the MusicBrainz database, so accepting MB usernames/passports would be one option. Another option would be to go use any OpenID account and use something like OAuth for API authentication. I guess some people would prefer to not use their Google/Yahoo/etc. OpenID accounts, so perhaps having MB as an OpenID provider would help?
There is also the question of how to authenticate the users. If I accept MB username/passwords I could use standard HTTP Digest authentication, but normally that means that applications would have to send most authenticated HTTP requests twice. Only submissions would have to be authenticated and these will have a lot of data in the HTTP body, it would be a waste to send them twice. Other options are OAuth or a custom token-based method.
How to handle metadata? Should it rely only on MusicBrainz identifiers? That would make it less usable for taggers that prefer other metadata databases. Should it allow submissions that are not in MusicBrainz?

Regarding licensing, the source code is distributed under the GPLv3 license. I've considered more permissible licenses, but while there isn't much code yet, I'd like to be sure that any improvements stay open source. The service is not yet live, so perhaps it's too soon to speak about data licensing, but I meant to use something like CC BY-SA for the data (including database dumps), but not allow commercial usage of the free service. If there will be a commercial application interested in the service in the future, it should either run their own mirror or help paying the hosting costs so that the service can stay free for free applications.

Anyway, the source code is on Launchpad and there is a mailing list on Google Groups. There is also a wiki, which will eventually contain documentation for the project. If you would like to help me with finishing this project or are just interested in its future, please join the mailing list.

Lukáš Lalinský

Links

Categories

Leave a Reply