10.25
So this semester I have been investigating and working on my thesis. Right now, my focus is in Statistical Natural Language Processing. I don’t want to discuss the specifics of the research just yet, but it has the potential of completely up-ending the entire search industry.
I have been investigating how to build a large corpus from the web. My advisor favors using Google directly since they already exist and provide their search for free.
The first thing I did was investigate the Google SOAP API only to find out that they deprecated it when they introduced the AJAX API. The new API only allows for about 60 results with no paging. Then I looked into the REST::Google API, but that only returns 10 results. Neither of those options seem feasible. I checked Yahoo’s Yahoo::Search interface and it only seemed to return 10 results (paging, if possible, was not obvious). I could write a direct scraper but that would take a good deal of effort and I am not sure it would be worth it.
Then I even started looking at writing my own spider using WWW::Robot. This is a fairly complex module that does a ton of grunt work for you. The downside is that it behaves and follows the robots.txt protocol; that’s a problem for someone who wants to scrape everything with no regard for such a protocol.
I spent maybe about 20-30 hours flipping over this in the last 6 weeks, I finally made the effort to meet with my advisor. Since he is no longer answering email or his phone, I met him after his late class and talked it over with him while he ate dinner in the campus restaurant. We talked and waffled back and forth about our approach. In the end, we decided to investigate Lucene’s capabilities.
Frustrated and lost, I went about my week until I talked with a PhD student currently being advised by my advisor as well. Her patience for our advisor has been continually declining. She missed a publication deadline because he failed to review a paper of hers. She also divulged that she intended on switching advisors because she is not making progress. I have been contemplaing this myself, so it was good to hear that I am not the only one at their wits’ end.
I am not making progress and I am not willing to sacrifice my graduation. If I change advisors, hopefully I will find an advisor that provides much more support and direction yet gives me the option to continue developing in perl. One of the professors I want to speak with runs a programming language lab.
Maybe I can merge my interest with Perl 6 with my thesis!
Try YahooBoss.
They provide an xml interface here:
http://developer.yahoo.com/search/boss/
Wow, that is very interesting. They actually provide a link to the next page so that you can implement your own cursor!
Thanks for telling me about BOSS though I think I’m convinced I need to change advisors.
[...] my last advisor all but abandoned me, but I wasn’t the only one. He almost abandoned other PhD students and has apparently done [...]