2009
05.26

Peter Markholm is right, benchmarking is hard. In typical fashion, I’m going to respond to a blog post response with another blog post.

Peter Markhold took issue with the improvements I made to the benchmark of grep, first, and smart match. So I felt compelled to explain why those are improvements and should actually improve the accuracy of the results. While my choice of wording may have been strong, my apologies to the original author (I can’t believe I criticized Michael Schwern’s code!), I stand by the improvements. Keep in mind, my opinions are formed from my time benchmarking things for my graduate work and for my employer on a few occasions.

The most important thing to providing a benchmark is the only thing within the timing loop is the operation you wish to time. In this case, we didn’t want to time the operation of creating the random data (rand). We didn’t want to time the data conversion and construction (chr and .) either. Those operations are things that the system may or may not optimize; since we have little control over how the system will optimize the code, we have to assume it adds unwanted ops to the timing loop. This is why I moved the creation of the @array outside of the annonymous sub. I left the annonymous sub because I didn’t want to write my own timing loop (Benchmark.pm can’t be used otherwise).

Also, we want to avoid any memory caching optimization that the system may provide for us. While this is typically useful, we absolutely do not want the system to load the entire @array of one test into memory and reuse the preloaded memory in another test. That could give an unreliable advantage to one of the tests depending on the system’s memory optimizer and the order of the tests. The reason why it’s a good idea to give each test it’s own set of data is it then avoid data prefetching.

Lastly, the argument that we might find the $needle in the beginning of the haystack is off track. With a large enough haystack, the probability that we will find the $needle within an eyeshot is negligible. We could increase the size of the haystack to be something like 100 billion random elements or even 1 trillion elements. Sure, the sample haystack was not large enough, but the testing method was much more sound (but not perfect).

Having said that, I will agree with Peter. My results absolutely should not be taken serious. There are many more things I would do differently. I would probably write my own timing loop so I could make sure the only thing within it would be the operation I want to test. I would also repeat the test a significant enough times with a variety of randomly generated large arrays (statistically large enough). I would test with different data types, different data structures, and different versions of all software involved. I’d also make sure not to test the code in multi-user mode with anything as few applications vying for the CPU as possible (this can have misleading results).

At the end of the day, those results were just examples. The changes were improvements but did not make for a complete and irrefutable test.

2009
05.25

We’ve all used splice. It’s a great function that basically encapsulates push, shift, unshift, and pop to name a few. While it’s overly useful, it can have its limitations. One limitation that has always bugged me was how it required indices. Sometimes, you may not know these or want to know these indices ahead of time. That’s why I created another splice function.

In this version, which I will officially call “yasplice”, it only takes references to 2 arrays and returns an array. It removes all of the elements in the second array from the first array by value rather than index. Note this requires perl5.10 because of the extremely useful smartmatch operator (~~). This snippet also overrides the splice function so this code should be localized.


use feature ':5.10';
use subs 'splice';

sub splice {
my ($aa, $bb) = @_;
grep { $one = $_; !grep { $_ ~~ $one } @$bb } @$aa;
}

@a = (1,2,4,5,7,5,4,2,6,6,6,5,4,4,5,6,6,4,3,4,5);
@b = (4,5,6);
@c = splice(\@a, \@b);
say join ", ", @c

Prints:

1, 2, 7, 2, 3

Simple and useful. It might not be a good idea to use this on N sized arrays as the runtime would then be N2.

While looking around for other interesting perl things, I found this use.perl.org blog post about smartmatch’s performance. That blog post illustrates a problem with his testing methodology. If I were to publish a paper with such a gaping hole like that, it would never be taken serious. In each test, he’s generating a random number and a random character and storing the result! That’s not part of the test. The $needles should each be generated outside the timing loop. This is adding tons of additional instructions that are not considered part of the test. These results are basically useless and should be redone.

So I did:


use Benchmark;
use 5.010;
use List::Util qw(first);

my $needle1 = chr(64+int(rand(26))).int(rand(1000)+1);
my $needle2 = chr(64+int(rand(26))).int(rand(1000)+1);
my $needle3 = chr(64+int(rand(26))).int(rand(1000)+1);
my $needle4 = chr(64+int(rand(26))).int(rand(1000)+1);
my @array1 = map { chr(64+int(rand(26)))."$_" } 1..1000;
my @array2 = map { chr(64+int(rand(26)))."$_" } 1..1000;
my @array3 = map { chr(64+int(rand(26)))."$_" } 1..1000;
my @array4 = map { chr(64+int(rand(26)))."$_" } 1..1000;

timethese(100_000, {
'first' => sub {
first { $_ eq $needle1 } @array1;
},
'grep BLOCK' => sub {
grep { $_ eq $needle2 } @array2;
},
'grep EXPR' => sub {
grep $_ eq $needle3, @array3;
},
'~~' => sub {
$needle4 ~~ @array4;
}
});

With the following results:

Benchmark: timing 100000 iterations of first, grep BLOCK, grep EXPR, ~~…
first: 20 wallclock secs (19.22 usr + 0.08 sys = 19.30 CPU) @ 5181.35/s (n=100000)
grep BLOCK: 15 wallclock secs (14.12 usr + 0.06 sys = 14.18 CPU) @ 7052.19/s (n=100000)
grep EXPR: 13 wallclock secs (12.96 usr + 0.06 sys = 13.02 CPU) @ 7680.49/s (n=100000)
~~: 5 wallclock secs ( 4.61 usr + 0.02 sys = 4.63 CPU) @ 21598.27/s (n=100000)

Those results are astonishing! Smart match is 4 times faster than first, not twice as fast as the original blog poster found. Notice that I changed each test to look for a different $needle in each @array. All of the randomly created data was done outside the timing loop so the only operations being tested is the creation of the annonymous sub, the execution of the timing, and the search function.

So I’m sure you could clearly write my yasplice function to search in a much quicker manner using smart match. I’ll leave that as an exercise to the reader as this is about all I want to write for today.

2009
05.23

I am officially booked for YAPC|10 in Pittsburg this year. I’m excited!

I’ll be staying in the dorms of CMU in a double occupancy room (all that was available). I’m leaving DFW at about 4:30pm on Friday 19 June 2009 and will arrive in PIT by 9:30pm. I’ll be taking the Port Authority shuttle from the airport, which is an astoundingly cheap $2.80.

I will be attending the Parrot Workshop on both Saturday and Sunday.

I have already lined out my schedule for the entire event, which I will need to see if I can post it or provide a link somehow. Maybe an import into Google Calendar is possible (anyone tried this?).

I’ll be leaving to go back to DFW on Wednesday at 3:45pm. It’s unfortunately a bit early but I need my wife to pick me up from the airport (public transportation in Dallas is terrible and expensive). I should only miss the lightning talks and the Closing so I’m not too disappointed.

Now I need to contact the university’s Computer Science Graduate department and the Pittsburgh runners club.

2009
05.18

In my exploration of traits, I stumbled across something that Tene created while working on the Web.pm grant. It basically uses trait_auxiliary to handle unknown sub/method traits. Here’s the actual code itself: LolDispatch.

Here’s the run-down of how this works. Be prepared, I may butcher the proper terminology or explanation since this is something that is new to me and have only started understanding it.

Line 7 defines a method called trait_auxiliary:<is>, which is what perl6 will call when an unknown sub/method trait is encountered. In this case, the trait “is” has the “http-handler” trait added to known traits. Line 8 just stashes some method information to use later.

Line 11 is the real meat of this module. It defines an exported subroutine “dispatch” that loops through the known methods and calls their postfix:() (aka operator()) method. $item<block> stashes the Closure that caused the unknown trait to be processed by trait_auxiliary:<is>.

The sample code starting in line 22 really has to go in a different file to work with Rakudo. While technically the ‘is http-handler’ trait works with methods as well, I encountered problems getting it to work.

I have an interesting use as I’m sure most people do. This basically helps remove unnecessary switch statements and dispatch tables. I simply just define my method and what the “is handler” looks like. This is really impressive, so everyone_involved++.

I wonder if there is a way (useful anyways) to combine this with a grammar.

2009
05.14

You know something is afoot when I start blogging more regularly.

This is starting to feel like a poetry slam from the 90′s. I wanted to take a minute a respond to moritz’s response to my article lambasting the confusing nature of subroutines and methods.

So pmichaud showed up late to our Perl6 Mongers meeting and resolved the whole mess. He explained that subs are the typical subs from perl5. They have no invocant associated to them and they are scoped to the class or package they are defined in. Oh yeah, in classes and modules are just sugarized packages.

Methods on the other hand require an invocant, such as an object or self. This as supposed to eliminate the confusion about how subs worked in perl5.

The source of confusion seemed to stem from my original thought that a sub in a class was something akin to Java’s Class-level function or c++’s public static function. pmichaud explained that there really is no such thing and the only real difference with subs from perl5 is the addition of method to indicate an invocant is required. Also, “$.method” and “$_ = self” hacks don’t really work and the latter leads to seriously dangerous code.

However, pmichaud said that self will generally be required when calling a method from a method (within the same class/package/module). The second example I gave is a legitimate gripe. In fact, pmichaud said that I am definitely not the first person to voice a complaint about it.

In case you can’t recall my kvetch about this issue, the following requires the self invocants:


class A {
method foo { ... };
method bar { self.foo; };
}

I can’t just call foo without the self invocant. The problem I have is that it’s not implicitly passed to the foo function for me. Everyone’s argument is that this smells too much like the problem with subs from perl5. So let me explain myself a bit more.

To be properly Huffmanized, not specifying an invocant from a method should pass self to the method within the scoped class. From a sub, it should call the sub by the name without passing self. That is, this should work:


class A {
method foo { ... };
method bar { foo };
sub baz { ... };
sub zab { baz };
}

So then I was asked, “well how do you call something like IO::open if my class has an open?” Just like that, specify the whole package namespace.

The reason why I expect implicit invocant passing to work is because, from within the scope of a method, I should not have to continue specifing my invocant. This leeds to overly terse code that is not properly Huffmanized like this:


class A {
method s {...};
method t {...};
method u {...};
method v {...};
method w {...};
method x {...};
method y {...};
method z {
self.s;
self.t;
self.u;
self.v;
self.w;
self.x;
self.y;
};

where method z should look more like


method z { s; t; u; v; w; x; y; }

I can work around this by doing clever given (read: Basic/VB ‘With’ blocks) statements:


method z { given self { .?s; .?t; .?u; .?v; .?w; .?x; .?y; }; }

It’s close, but no cigar. Of coarse, I could always be wrong and I’m just looking at the problem from the wrong angle. I’d love to hear any feedback.

2009
05.11

So today, I ended up starting a very long discussion with several members of the Perl6 community about how functions are declared. Basically, I made several mistakes in trying to call my class functions “sub” when I meant method. Perl6 happily lets me shoot myself in the foot and then complains when I try calling the sub against an object, which doesn’t exist. Here’s a few examples of things to illustrate the clunky OO model:

Snippet 1:

class A {
sub foo {
say "never gonna happen"
}
};
my A $a .= new;
$a.foo;

Snippet 2:

class A {
method foo {
say "nope"
};
method bar {
foo()
}
}

note: the code highlighter does not support perl6

So basically, that first snippet is an error. ‘sub’ apparently means something other than what we want it to mean in this case. That snippet will fail with an error about no such method. The second snippet will fail because the ‘self’ invocant is not assumed in the bar function. jnth says it’s because they don’t want methods like push being overwritten if I define a push method in my class.

The really sad thing is that I still don’t get it. I can’t make heads or tails of this object model. In some cases, Perl6 chooses to reuse syntax correctly, and in others it fails. There are about a billion ways to define a function. Each function can be a sub, method or submethod. It can have modifiers like multi and my and attributes like does and a huge multitude of parameters. The list of function modifiers just keeps going. How anyone is going to be able to use all of this is beyond me; I doubt I’ll ever be able to make sense of it.

Okay, so my raw opinion is this: they are doing it wrong. The “many paradigms” argument is wrong because the language should do one thing and do it well. It can’t serve all masters and still do a good job. If the object model is to be a real first-class citizen, then it must place the object model above all else.

That being said, I am of the opinion that C++ is the only language to get it right (see this, this, and this). Well almost, it still can’t do runtime template binding (think variant or Boost::Any) and I loathe the ‘friend‘ keyword (breeds bad code). Namespaces and classes are different concepts, so they are treated differently. When I place a function inside of a class definition, it becomes a member function. If I prepend ‘static‘, it becomes a class method (not a Class member like Java) and I can call it without a ‘this’ reference. If I want a function outside any class, just put it at the file scope and external storage. Calling a function defined within the scope (file or member) that has been included (imported) implies the ‘this’ invocant. All parameters are required unless given default values (Perl6 has 3 different syntaxes for this: required “!”, optional “?”, default value “=”). This is simple and direct. Templates were added for an additional level of flexibility, which worked out pretty well. In fact, C++0x will improve on this, assuming it’s ever released this year (otherwise it’ll be C++1x).

In bringing this up in the channel, I was really hoping to make people think about how Perl6 is doing it in hopes that someone who knows better than I would stop and think whether Perl6′s design is ideal. I was wrong and I think I may have agitated some of the developers. The beauty of open source is that I can do things my way if I see fit and likely will once the only usable implementation has support for macros.

2009
05.03

I finally took the plunge and requested a PAUSE user account. PAUSE, for those uninformed, is the Perl Authors Upload Server. Through this, users are able to upload modules, scripts, and source code to CPAN.

I’ve been writing perl programs for maybe 7 years now but have never found anything I wrote to be useful to anyone other than myself. Then I started reading Linus Torvalds’ book “Just for Fun: The Story of an Accidental Revolutionary“. I’m not done reading it yet, but I will post a review when I finish.

In the 10 years or so of using Linux, I’ve never really contributed much. I wrote a quick program over on SourceForge but found I never wanted to use it myself. I’ve contributed a handful of patches to various projects. I even wrote the arbitrary radix parsing code in Rakudo. Yet I’ve never contributed any significant amount of code. I’ve always felt guity about that; like a leech who used other people’s work for my own selfish ends.

Well, that book changed everything.

I’ve got a backlog of a ton of little programs that all solve some domain specific problem. Most have been for school; some are completely boring while some produce highly interesting results. It’s time I start contributing my code back to the world.

I’m not sure what I’ll contribute first. Most of them need to be cleaned up from their current state of “get it working” to publishable code. I’m going to start working on that over the summer. There are a few projects that I am working on, one of them being my thesis, which can all ultimately find their way on CPAN.