regex
Jonathan Peterson
JPeterson at bmjgroup.com
Wed Mar 1 10:32:23 GMT 2006
> > Our server has no modules so I would have to do it like this:
> >
> > /<a [^<]*href=["|\']?([^ "\']*)["|\']?[^>].*>([^<]*)</a>/i
>
> Ooh! Ooh! Can I be the first to go "Don't use a regex, use an actual
> parser as indicated in the FAQ"? Huh?? Can I??
I notice perldoc.com is down. But the FAQ is here too:
http://faq.perl.org/perlfaq6.html#Can_I_use_Perl_regul
What the faq doesn't say is that if you have a good knowledge of, and
perhaps even control over, the data you are dealing with, regex solutions
are often acceptable.
Looking at your regex above, it might be that you are unaware of
'non-greedy quantifiers'. These are very useful (especially in your
situation) and can often remove the need for complicated negated character
classes and such. Here's a little program that I think does what you want:
#!/usr/bin/perl
# warning flag and use strict deliberately ommitted
# to wind people up
my $str = qq! This is an <a href="http://www.foo.com/bar.html"> elephant
</a> I
think.!;
$str =~ s!<a .*?>(.*?)</a>!$1!i;
print $str;
There are many kinds of HTML that will not be correctly modified by this
simple regex. You'll have to try it and see if it's good enough.
More information about the london.pm
mailing list