Musicbrainz Prototype for a higher performance batch mode processing
I have been protyping some code for a new way to use musicbrainz, in a batch mode. I found that the tagger mechanism is not efficent in terms of computing resources and bandwidth.
Currently, the mb tagger requests information from an apache server running a perl script that looks in a database for each file. This is pretty high overhead. Considering the fact that for the first pass, all that is needed is to know :
- Is the file corrupt?
- Is the file duplicate?
- Is the file in the database?
When you know all this, then you could submit the new information to musicbrains, or then submit and large file that contains all needed information at once.
Otherwise, the database could be queried locally and only the new information could be submitted.
In anycase, when you have 10K of mp3 files, the processing with musicbrainz is painfully slow and the tp_tagger software crashes regulary.
This is what I did so far.
1. I downloaded the mbdump.tar.bz unzipped it, took the track file, used cut and sort to extract a sorted list of trms. then I converted this into a 50mb binary file that is packed very tight. bz gives only 8% improvement. This 50mb file will be easy to download and use.
You can find the packer here :
It uses the two files from musicbrainz :
Now, I want to use this index to do a merge join in linear time against my database of mp3s. So, I want to extract the TRMS of them into a file that is also sorted and can be quickly read. For this purpose, I have modified mp3info to extract all the attributes to create the trm.
It links agains libmusicbrainz, and takes a list of files as a parameter but runs pretty slow about 1000 files an hour.
Today, I took the mp3.cpp code and reworked it to be in c, all in one file and also to read the entire file into memory. It runs about 2x the speed of mp3info.
When it is finished, I will create a simple trm generator based on the output and write the join program.