10.31
I am now a PumpKing. No, not that kind of pumpking, this kind:


I stole the Perl Foundation logo and created a stencil from it using GIMP. It took me approximately 3 hours to create that pumpkin.
Happy Halloween!
We define only out of despair, we must have a formula… to give a facade to the void.
I am now a PumpKing. No, not that kind of pumpking, this kind:


I stole the Perl Foundation logo and created a stencil from it using GIMP. It took me approximately 3 hours to create that pumpkin.
Happy Halloween!
So this semester I have been investigating and working on my thesis. Right now, my focus is in Statistical Natural Language Processing. I don’t want to discuss the specifics of the research just yet, but it has the potential of completely up-ending the entire search industry.
I have been investigating how to build a large corpus from the web. My advisor favors using Google directly since they already exist and provide their search for free.
The first thing I did was investigate the Google SOAP API only to find out that they deprecated it when they introduced the AJAX API. The new API only allows for about 60 results with no paging. Then I looked into the REST::Google API, but that only returns 10 results. Neither of those options seem feasible. I checked Yahoo’s Yahoo::Search interface and it only seemed to return 10 results (paging, if possible, was not obvious). I could write a direct scraper but that would take a good deal of effort and I am not sure it would be worth it.
Then I even started looking at writing my own spider using WWW::Robot. This is a fairly complex module that does a ton of grunt work for you. The downside is that it behaves and follows the robots.txt protocol; that’s a problem for someone who wants to scrape everything with no regard for such a protocol.
I spent maybe about 20-30 hours flipping over this in the last 6 weeks, I finally made the effort to meet with my advisor. Since he is no longer answering email or his phone, I met him after his late class and talked it over with him while he ate dinner in the campus restaurant. We talked and waffled back and forth about our approach. In the end, we decided to investigate Lucene’s capabilities.
Frustrated and lost, I went about my week until I talked with a PhD student currently being advised by my advisor as well. Her patience for our advisor has been continually declining. She missed a publication deadline because he failed to review a paper of hers. She also divulged that she intended on switching advisors because she is not making progress. I have been contemplaing this myself, so it was good to hear that I am not the only one at their wits’ end.
I am not making progress and I am not willing to sacrifice my graduation. If I change advisors, hopefully I will find an advisor that provides much more support and direction yet gives me the option to continue developing in perl. One of the professors I want to speak with runs a programming language lab.
Maybe I can merge my interest with Perl 6 with my thesis!
Apparently Ovid never heard of Clark’s Laws:
Any sufficiently advanced technology is indistinguishable from magic.
I’m sure people thought the same thing about black powder, mechanical engines, computers, TVs, and well basically everything else that has ever graced his lifetime. Remember Ovid, there was a point in time when structured programming reigned supreme.
I dropped off the Perl Ironman blogging challenge again. This time it wasn’t due to a date miscalculation; I took a midterm exam on Thursday. My current class has taken up most of my time in the past 6 weeks. There haven’t been any chances for coding just yet. I did want to talk about something I have been looking into lately for my thesis.
First, I want to post a lazyweb question: has anyone worked with REST::Google? My first question is if anyone knows how to advance the page cursor with this module. I tried reading the code itself, but it uses Class::Accessor and Class::Data. I’m a bit unfamiliar with those modules (are they popular anymore?), and it looks like the cursor is read-only. I don’t really see the use for the module if it returns 10 results and cannot paginate.
So from that question, I took the sample and tried playing with it. This is just experimental to exercise the module’s capabilities. This is fairly boring, so I want to see if I can get some code working that can paginate through Google results using their REST API (if that’s even possible).
The author of this module probably feels very clever; he hid subpackages from CPAN by putting the package name on a new line from the package keyword. They also used __PACKAGE__->mk_ro_accessors, which looks like it will generate attribute accessors at runtime. I’m guessing CPAN cannot index that as well. What’s the point of uploading your code to a public repository if you take measures to hide it from the repository?
Anyways, I’m soliciting ideas for paginating Google’s REST API results. Note: the SOAP API has been terminated, so that avenue is closed.