Blogged by Ujihisa. Standard methods of programming and thoughts including Clojure, Vim, LLVM, Haskell, Ruby and Mathematics written by a Japanese programmer. github/ujihisa

Monday, August 16, 2010

Regular Expression Benchmarks

Most modern programming languages have regular expression engines. Here I'll compare the speed of each regular expression engines with long text matching which has backtracking. I used Ruby, Perl and Python for this examination.

A time-consuming regular expression example

/(aa)*aa/ =~ 'aa' needs backtracking. The longer the length is, the longer it would take. Let m as the length of the characters and n as the number of repeat. For example the case when n = 2 and m = 3 is

/(aa)*aa(bb)*bb(cc)*cc/ =~ 'aabbcc'

Codes

Ruby:

def time
  t = Time.now
  yield
  Time.now - t
end

def ryutorion(m, n)
  chars = (0...n).map {|i| ('a'.ord + i % 26).chr }
  regex = Regexp.new chars.map {|c| "(#{c*m})*#{c*m}" }.join
  str = chars.map {|c| c*m }.join
  time { regex =~ str }
end

Python:

import re
import time

def timeof(f):
    t = time.time()
    f()
    return time.time() - t

def ryutorion(m, n):
    chars = [ chr(ord('a') + i % 26) for i in range(n) ]
    regex = re.compile(''.join([ '(' + c*m + ')*' + c*m for c in chars ]))
    str = ''.join([ c*m for c in chars ])
    return timeof(lambda: regex.match(str))

Perl:

use warnings;
use strict;
use Time::HiRes qw(gettimeofday tv_interval);

sub gen_regex {
    my ($m, $n) = @_;

    join "", map { 
        my $s = chr(ord('a') + ($_ % 26)) x $m;
        "($s)*$s"
    } (0 .. $n - 1);
}

sub gen_input {
    my ($m, $n) = @_;

    join "", map { chr(ord('a') + ($_ % 26)) x $m } (0 .. $n - 1);
}

my $r = gen_regex 1, 1;
my $i = gen_input 1, 1;
$r = qr/$r/;
my $ts = [gettimeofday];
my $te = [gettimeofday] if($i =~ /$r/);
my $elapsed = tv_interval $ts, $te;
print "$elapsed\n";

for my $N (10, 20) {
    for my $M (10000, 20000) {
        $r = gen_regex $M, $N;
        $i = gen_input $M, $N;
        $r = qr/$r/;
        $ts = [gettimeofday];
        $te = [gettimeofday] if($i =~ /$r/);
        $elapsed = tv_interval $ts, $te;
        print "$elapsed\n";
    }
}

This perl example was written by Masahiro Kimoto.

Result

shorter test: n = {10000, 20000} = m = {10, 20}

  • Ruby

    p ryutorion(10000, 10) #=> 0.000278
    p ryutorion(20000, 10) #=> 0.000698
    p ryutorion(10000, 20) #=> 0.000538
    p ryutorion(20000, 20) #=> 0.001026
    
  • Python

    print ryutorion(10000, 10) #=> 0.00126099586487
    print ryutorion(20000, 10) #=> 0.00168895721436
    print ryutorion(10000, 20) #=> 0.00180792808533
    print ryutorion(20000, 20) #=> 0.00338697433472
    
  • Perl

    ($M = 10000, $N = 10) #=> 0.000066
    ($M = 20000, $N = 10) #=> 0.000185
    ($M = 10000, $N = 20) #=> 0.000163
    ($M = 20000, $N = 20) #=> 0.000294
    

longer test: n = {10000, 20000} = m = {100, 200}

  • Ruby (ruby 1.9.2)

    0.002721
    0.00672
    0.005441
    0.013364
    
  • Python (python 2.7)

    Traceback (most recent call last):
      File "a.py", line 16, in <module>
        print ryutorion(10000, 100)
      File "a.py", line 11, in ryutorion
        regex = re.compile(''.join([ '(' + c*m + ')*' + c*m for c in chars ]))
      File "/usr/local/Cellar/python/2.7/lib/python2.7/re.py", line 190, in compile
        return _compile(pattern, flags)
      File "/usr/local/Cellar/python/2.7/lib/python2.7/re.py", line 243, in _compile
        p = sre_compile.compile(pattern, flags)
      File "/usr/local/Cellar/python/2.7/lib/python2.7/sre_compile.py", line 511, in compile
        "sorry, but this version only supports 100 named groups"
    AssertionError: sorry, but this version only supports 100 named groups
    
  • Perl (perl 5.10.0)

    0.000981
    0.00269
    0.004181
    0.005484
    

Summary

Perl is the strongest. Use Perl.

No comments:

Post a Comment

Followers