Acoustid

After asking for fingerprint test data in my last post about Chromaprint I received about 270k fingerprints from. This helped me to perform some larger tests on the proof-of-concept lookup server I had implemented using Java and PostgreSQL. I had to do some technical changes, but the main idea seemed to work pretty well. I wasn’t sure about this when I was working on Chromaprint, but now I believe that I can realistically run an audio recognition service.

So, let me introduce Acoustid (named by herojoker), the server side component for audio fingerprint lookups. As I mentioned before, it uses PostgreSQL’s GIN indexes with the intarray implemented to do the hard work. Java is used as a glue language that provides an web-based service for interacting with the database. I originally considered using C++ for this (using the POCO libraries), but there is significantly more tools in Java for an application like this, so even though I don’t particularly like Java, it seemed like the best language for this.

Chromaprint fingerprints are basically timed sequences of 32-bit numbers (“subfingerprints”). They are extracted from the audio stream, using a technique which is not important here, every 0.1238 seconds and cover about 1.9814 seconds of the following audio data. This means that to represent a minute of audio, you need around 460 of such numbers. For two audio files representing the same song in different file formats, these numbers should be very similar in terms of bit errors. Acoustid’s job is to take these numbers and find a fingerprint in the database that has the least bit errors compared to the query. Theoretically, this could be used to identify songs based on a few seconds of the original audio, but that would require indexing whole fingerprints, because we don’t which part of the original fingerprint does the query match. This requires a lot of RAM to perform efficiently, so I decided to collect the data that is necessary to perform searches like this, but not implement the feature for now.

Instead, Acoustid’s primary task is to identify audio files and for this purpose it can assume that the query is extracted from the beginning of the audio. It indexes only ~15 (currently from 0:25 to 0:40) seconds of audio and uses this index to search for possible matches. Currently it assumes that at least one number from the range matches exactly. This can yield some false positive matches, so the next step is to compare the full query to the full fingerprint. As the fingerprints might not be ideally aligned, it first finds the best alignment and then calculates a score as the number of subfingerprints with less than 2 bit errors. This matching and scoring is done completely on the database server, using the intarray extension and a custom C extension for calculating the score.

This seem to all work fine from technical point of view. What I’m fighting with is the API for this. How to pass the fingerprint data efficiently, how to authenticate users for submissions, whether to have the concept of users at all. The main problems currently are:

  1. The fingerprints are quite large and it would be ideal to send them as binary data. This conflicts with most existing standards for web services. I’ve implemented a compression algorithm that makes the size of base64-encoded fingerprint similar to the size of an uncompressed binary fingerprint, but I’d still like to not waste the bandwidth if I can avoid it. I’ve considered bencode-formatted requests, but that is probably too non-standard. I’ve also considered accepting “Content-Encoding: gzip” in HTTP requests, which everybody seems to be using only for responses, but it seems usable also for requests.
  2. I’d like to track who submitted which fingerprints. I don’t need any private data, but I’d like to be able to delete all fingerprints from one source if I find out that there is something wrong with the data. Maybe I’m paranoid, but I’m afraid that some people might try to submit wrong data in order to reduce the usefulnes of the database.
  3. If I want to track users, where should the user accounts come from. I definitely don’t want people to have to register and I’d like the service to be integrated with the MusicBrainz database, so accepting MB usernames/passports would be one option. Another option would be to go use any OpenID account and use something like OAuth for API authentication. I guess some people would prefer to not use their Google/Yahoo/etc. OpenID accounts, so perhaps having MB as an OpenID provider would help?
  4. There is also the question of how to authenticate the users. If I accept MB username/passwords I could use standard HTTP Digest authentication, but normally that means that applications would have to send most authenticated HTTP requests twice. Only submissions would have to be authenticated and these will have a lot of data in the HTTP body, it would be a waste to send them twice. Other options are OAuth or a custom token-based method.
  5. How to handle metadata? Should it rely only on MusicBrainz identifiers? That would make it less usable for taggers that prefer other metadata databases. Should it allow submissions that are not in MusicBrainz?

Regarding licensing, the source code is distributed under the GPLv3 license. I’ve considered more permissible licenses, but while there isn’t much code yet, I’d like to be sure that any improvements stay open source. The service is not yet live, so perhaps it’s too soon to speak about data licensing, but I meant to use something like CC BY-SA for the data (including database dumps), but not allow commercial usage of the free service. If there will be a commercial application interested in the service in the future, it should either run their own mirror or help paying the hosting costs so that the service can stay free for free applications.

Anyway, the source code is on Launchpad and there is a mailing list on Google Groups. There is also a wiki, which will eventually contain documentation for the project. If you would like to help me with finishing this project or are just interested in its future, please join the mailing list.

This entry was posted in Acoustid, Announce, Programming and tagged , , , , , , . Bookmark the permalink.

5 Responses to Acoustid

  1. intgr says:

    > I could use standard HTTP Digest authentication, but normally
    > that means that applications would most authenticated HTTP
    > requests twice

    There are some solutions to this problem. One is using the “Expect: 100- continue” header to ask whether the server is willing to accept your request:
    http://www.w3.org/Protocols/rfc2616/rfc2616-sec8.html#sec8.2.3

    You can also early-abort HTTP submissions by sending the HTTP response from the server before receiving the whole POST content, although some clients/servers do not implement this.

    These are far from elegant, but they’re used in the wild.

  2. I think anything that relies on the client side is not an option, because people will use the (wrong) defaults if they can. Aborting unauthenticated requests early would probably help a little, but the body would not be *that* large, so a large part of it would be probably sent already.

    My favorite solution at the moment is initializing a session on a special URL using HTTP Digest authentication and then passing a temporary token. Not exactly nice, but I don’t see a way to make it stateless.

  3. Mayhem says:

    > perhaps having MB as an OpenID provider would help?

    I would love MB to be an OpenID provider.

    > I’d like to track who submitted which fingerprints. I don’t need any private data, but I’d like to be able to delete all fingerprints from one source if I find out that there is something wrong with the data. Maybe I’m paranoid, but I’m afraid that some people might try to submit wrong data in order to reduce the usefulnes of the database.

    I recently discussed this with Wendell Hicken and he says that its not so much users that are important to track, but versions of the software and what platform/OS they come from. The biggest problem for them had been buggy versions on various platforms, not human douchebaggery. So, keeping track of architecture, OS and OS version of the client seems to be much more important. That way if one client is deemed buggy, everything can be pulled from the DB.

  4. The problem is that I can’t be sure which client I’m dealing with. This is primarily intended for open source apps, so traditional API keys are out of question (you can’t hide them). With user accounts, I know exactly what is the source of the data. If there is something wrong (either caused by bugs or intentionally), I know which data to remove.

  5. Pingback: Inside the Acoustid server | Lukáš Lalinský

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>