Today I got a shocker.
I tried the not-too-complicated regular expression:
href=['"](?<link>[^?'">]*\??[^'" >]*)[^>]*(?<displayed>[^>]*)</a> |
I worked in the excellent RegexBuddy, and ran the above regex on a normal size HTML page (the regex aims to find all links in a page). The regex hung, and I got the following message:
The match attempt was aborted early because the regular expression is too complex. The regex engine you plan to use it with may not be able to handle it at all and crash. Look up "catastrophic backtracking" in the help file to learn how to avoid this situation. |
I looked up “catastrophic backtracking”, and got that regexes such as “(x+x+)+y” are evil. Sure – but my regex does not contain nested repetition operations!
I then tried this regex on a short page, and it worked. This was another surprise, as I always thought most regex implementations are compiled, and then run in O(n) (I never took the time to learn all the regex flavors, I just assumed what I learned in the university was the general rule).
It turns out that one of the algorithms to implement regex uses backtracking, so a regex might work on a short string but fail on a larger one. It appears even simple expressions such as “(a|aa)*b” take exponential time in this implementation.
I looked around a bit, but failed to find a good description of the internal implementation of .NET’s regular expression engine.
BTW, the work-around I used here is modify the regex. It’s not exactly what I aimed for, but it’s close enough:
href=['"](?<link>[^'">]*)[^>]*>(?<displayed>[^>]*)</a> |
9 comments:
Along with other approaches, I suggest constructing your regex with non-greedy operators. (In Python these are "+?" and "*?").
I also found that this is what I actually need in most cases. For example,
"x.*?x" will match anything from the first x to the next x, without another x in the middle.
It's not always what you want. But in any case, the regex I composed initially had non-greedy operators - it was still too slow (== never finished). I tried some tweaking to make it faster, including replacing all * with {0, 200}, but it didn't do much for the speed.
There's a classic article on the effect of backtracking support on most modern regex implementations. I'm surprised you didn't know that.
It appears java & perl also suffer from this. It's beyond me why these languages do not implement the efficient NFA approach (the same O(n) algorithm that is taught in basic automata classes, and has been implemented in awk and grep years ago).
Backtracking should definitely not be used for regular expressions that do not contain back references.
From the link you posted:
Regular expression matching can be simple and fast, using finite automata-based techniques that have been known for decades. In contrast, Perl, PCRE, Python, Ruby, Java, and many other languages have regular expression implementations based on recursive backtracking that are simple but can be excruciatingly slow. With the exception of backreferences, the features provided by the slow backtracking implementations can be provided by the automata-based implementations at dramatically faster, more consistent speeds.
Good stuff. Regex implementations are a fascinating topic, and I've spent a few pleasant hours exploring the subject (bottom) a few years ago.
The link tomer gabel posted is excellent and has its place, but it is important to see the whole picture, which is why it can be very educational to read the ensuing discussion on Perlmonks, which explains *why* Perl (and others) chose what they chose, and how to avoid such problems.
Thanks Eli for the referral to the interesting and rather long discussion. My viewpoint - PERL and other relevant languages should provide Thompson NFA, because this problem is not limited to so called pathological cases, but does come up (albeit rarely) in practice.
Of course it's a Task to write an efficient implementation, and has to be prioritized, tested, integrated... but in a perfect world, my regexes would not get stuck, and I would not have to learn intricacies on which are the pathological cases (if the class of pathological cases can be described so simply, then it should be simple to test for them and use NFA only for these cases).
I did a little experiment.
I tried out your original regexp in Python (tweaked to conform to the Python re syntax), on http://qimmortal.blogspot.com/
Indeed, it got stuck.
Then, I rewrote your regexp like so:
r = 'href=[\'"](?P<link>.+?)[\'"](.*?)>(?P<displayed>.*?)</a>'
On the same input, my regexp runs fast, and re.finditer found all links without a problem.
I'm not 100% sure, but I think it's not exactly the same semantic as my regex. It could be "good enough", just like my own modified version.
This thread is going a bit off topic, but as lorg pointed out in a private conversation, his version of the regex works better for cases like "text". I still need to do a bit of work to find the optimal regex for this purpose.
Post a Comment