regex

Jonathan Peterson JPeterson at bmjgroup.com
Wed Mar 1 10:32:23 GMT 2006


> > Our server has no modules so I would have to do it like this:
> > 
> > /<a [^<]*href=["|\']?([^ "\']*)["|\']?[^>].*>([^<]*)</a>/i
> 
> Ooh! Ooh! Can I be the first to go "Don't use a regex, use an actual
> parser as indicated in the FAQ"? Huh?? Can I??

I notice perldoc.com is down. But the FAQ is here too:

http://faq.perl.org/perlfaq6.html#Can_I_use_Perl_regul

What the faq doesn't say is that if you have a good knowledge of, and 
perhaps even control over, the data you are dealing with, regex solutions 
are often acceptable.

Looking at your regex above, it might be that you are unaware of 
'non-greedy quantifiers'. These are very useful (especially in your 
situation) and can often remove the need for complicated negated character 
classes and such. Here's a little program that I think does what you want:

#!/usr/bin/perl
# warning flag and use strict deliberately ommitted
# to wind people up

my $str = qq! This is an <a href="http://www.foo.com/bar.html"> elephant 
</a> I
think.!;
$str =~ s!<a .*?>(.*?)</a>!$1!i;
print $str;

There are many kinds of HTML that will not be correctly modified by this 
simple regex. You'll have to try it and see if it's good enough.




More information about the london.pm mailing list