regex
Matt Lawrence
matt.lawrence at virgin.net
Wed Mar 1 13:59:53 GMT 2006
Jonathan Peterson wrote:
>>>Our server has no modules so I would have to do it like this:
>>>
>>>/<a [^<]*href=["|\']?([^ "\']*)["|\']?[^>].*>([^<]*)</a>/i
>>
>>Ooh! Ooh! Can I be the first to go "Don't use a regex, use an actual
>>parser as indicated in the FAQ"? Huh?? Can I??
>
>
> I notice perldoc.com is down. But the FAQ is here too:
>
> http://faq.perl.org/perlfaq6.html#Can_I_use_Perl_regul
>
> What the faq doesn't say is that if you have a good knowledge of, and
> perhaps even control over, the data you are dealing with, regex solutions
> are often acceptable.
>
> Looking at your regex above, it might be that you are unaware of
> 'non-greedy quantifiers'. These are very useful (especially in your
> situation) and can often remove the need for complicated negated character
> classes and such. Here's a little program that I think does what you want:
>
> #!/usr/bin/perl
> # warning flag and use strict deliberately ommitted
> # to wind people up
>
> my $str = qq! This is an <a href="http://www.foo.com/bar.html"> elephant
> </a> I
> think.!;
> $str =~ s!<a .*?>(.*?)</a>!$1!i;
> print $str;
>
> There are many kinds of HTML that will not be correctly modified by this
> simple regex. You'll have to try it and see if it's good enough.
>
It'll catch more is you give it the s flag.
s!<a .*?>(.*?)</a>!$1!is
Matt
More information about the london.pm
mailing list