Spud wrote:I know that most of the work has been done in harvesting the lyric archive, but has anyone made any progress in harvesting the review threads?
Here's what I've done. It's not finished, and I have a history of leaving things half-complete, so you're welcome to steal anything you see.
1) Wrote a perl script to parse songfight's "archive.txt" file (containing vote counts, song names, mp3 URLs, art URLs, etc), and build database tables from it (a song table and an entries table).
2) Wrote a Web interface to this database, so you can explore the songs. The newer songs are colored, newest being red. You can also sort results by vote count, or by name. Ignore the "Dillfrog" header/footer branding, that's just to make it easier for me to code. It's not linked from anywhere else on the Dillfrog site.
http://www.dillfrog.com/tools/songfight_explorer/
3) Wrote a perl script that downloads messages from the *OLD* messageboard (not the new one), for each known song review URL and song lyric URL. Note that the current archive.txt file has many dead links to old (non-archived) reviews. The reviews seem to exist, it's just a matter of finding the corresponding archive URL and updating this file to point to it. I can provide a list of these dead links if any group of people wants to resolve them manually. I tried, but got bored real fast.
4) Wrote a perl script that parses the messageboard HTML that it previously downloaded in item 3. From this, we get clean forum data, where for each post we know the message poster's username, ID#, and of course their message.
5) Configured the parser in #4 to compare the review posts to the list of known artists/songs, and automagically decide which band/song it belongs to. This is at least 50-66% efficient, but could be better and should probably be replaced with a manual process (imagine a Web page where you see the user's messageboard post, and you have to choose whether it's a lyric at all, and if so, pick the song and band name from a dropdown list). This wouldn't take much longer than 30-60 minutes to code.
So the end result is that you can see some lyrics in the URL from #2.
I haven't added review information to the database yet. I was planning on writing some code to automagically parse the reviews and split them up by band, though since I haven't tried that yet, I'm not sure how successful that'll be. It might be another semi-automatic process.
But I don't really want to do much before hearing what everyone else is doing, to avoid rework. If you want to use any of the parsers or tables or whatever, I'm happy to share. It's not all greatly commented right now, though, since I'm doing it mainly as a hack.
The other thing I'd like to do is write a client-side or mixed client-server application that cleans up the MP3s and ID3 tags (theoretically this could be done server-side, but that's risky) to reflect the band's name and song title, so they show up prettier on my Mp3 player.
It also would be cute to have a scripted review system (e.g. a checkbox to label the song as a "keeper", a textarea box to write your review, and the option to rate songs from 1-10 on production, performance and the song itself (or similar)). But anyway, I think I'm digressing (that's more of a system to enter new reviews than to archive/categorize existing ones).
Anyway, food for thought. What have others done? Does anyone want to try resolving the remaining review thread URLs?