Regex to match odd numbers

Sat Aug 23 16:43:06 BST 2014

Paul Makepeace <paulm at paulm.com> wrote:
> Does anyone have any concrete examples where the locale affecting
> meaning/matching of \d causes real problems?

In my experience, it's not necessarily the locale so much as Unicode
characters that the programmer wasn't expecting, which then cause
surprising behaviour. For example, this looks more-or-less sensible:

say $arg + 17 if $arg =~ /\A\d+\z/

but if $arg is a digit other than 0..9, Perl will treat it as 0 and
emit a warning. (Which is particularly problematic if you also have
fatal warnings enabled.)

> I'm assuming the worst case is it matches too much, e.g. picks up
> spurious Chinese numerals, which seems like a wildly improbable edge
> case for most datasets+patterns.

"Improbable" sounds reasonable, but bear in mind that people often use
regexes containing things like \d and \w for validating input from
untrusted sources, so there's scope for significant brokenness there.

> Presumably there isn't a situation
> where \d _doesn't_ match [0-9] at least? In other words [0-9] is a
> subset of \d for all locales.

For all *sane* locales, sure. :-)

One of the many unpleasant things about locales is that you never
really know what you're going to get — and there's no shortage of OSes
with broken locale definitions.

> $ export LC_CTYPE=zh_CN.utf-8
> $ perl -Mlocale -Mutf8 -le 'print "一" =~ /\d/'  # 1
>
> Doesn't print 1 - why?

I don't know what the expected behaviour is for the zh_CN.utf-8
locale, but that behaviour doesn't surprise me for Unicode: the hanzi
numerals don't have the Unicode "numeric" property. More specifically,
their general category is Lo ("other letter"), rather than (say) Nd
("decimal digit"):

$ perl -MUnicode::UCD=charinfo -E \
> 'say charinfo($_)->{category}, " ", chr =~ /\d/u for 0x4e00, 0x661'
Lo
Nd 1

(U+0661 is ARABIC-INDIC DIGIT ONE.)

> $ export LC_CTYPE=en_US.utf-8
> $ perl -Mlocale -Mutf8 -le 'print "三" =~ /[一-六]/'
> 1
>
> Why is it still 1?

That's because /[一-六]/ matches the set of characters whose codepoints
are in the range 0x4E00 through 0x516D (regardless of locale), and 三
is U+4E09 (which is in that range). Adding 'use re "debug"' to your
program reveals more information about what's going on there:

$ perl -Mlocale -Mutf8 -le 'use re "debug"; print "三" =~ /[一-六]/'
Compiling REx "[%x{4e00}-%x{516d}]"
Final program:
   1: ANYOF{loc}[{unicode}4E00-516D] (12)
  12: END (0)
stclass ANYOF{loc}[{unicode}4E00-516D] minlen 1
Matching REx "[%x{4e00}-%x{516d}]" against "%x{4e09}"
UTF-8 pattern and string...
Matching stclass ANYOF{loc}[{unicode}4E00-516D] against "%x{4e09}" (3 bytes)
   0 <> <%x{4e09}>           |  1:ANYOF{loc}[{unicode}4E00-516D](12)
   3 <%x{4e09}> <>           | 12:END(0)
Match successful!
1
Freeing REx: "[%x{4e00}-%x{516d}]"

-- 
Aaron Crane ** http://aaroncrane.co.uk/