regex

Wed Mar 1 13:59:53 GMT 2006

Jonathan Peterson wrote:
>>>Our server has no modules so I would have to do it like this:
>>>
>>>/<a [^<]*href=["|\']?([^ "\']*)["|\']?[^>].*>([^<]*)</a>/i
>>
>>Ooh! Ooh! Can I be the first to go "Don't use a regex, use an actual
>>parser as indicated in the FAQ"? Huh?? Can I??
> 
> 
> I notice perldoc.com is down. But the FAQ is here too:
> 
> http://faq.perl.org/perlfaq6.html#Can_I_use_Perl_regul
> 
> What the faq doesn't say is that if you have a good knowledge of, and 
> perhaps even control over, the data you are dealing with, regex solutions 
> are often acceptable.
> 
> Looking at your regex above, it might be that you are unaware of 
> 'non-greedy quantifiers'. These are very useful (especially in your 
> situation) and can often remove the need for complicated negated character 
> classes and such. Here's a little program that I think does what you want:
> 
> #!/usr/bin/perl
> # warning flag and use strict deliberately ommitted
> # to wind people up
> 
> my $str = qq! This is an <a href="http://www.foo.com/bar.html"> elephant 
> </a> I
> think.!;
> $str =~ s!<a .*?>(.*?)</a>!$1!i;
> print $str;
> 
> There are many kinds of HTML that will not be correctly modified by this 
> simple regex. You'll have to try it and see if it's good enough.
> 

It'll catch more is you give it the s flag.

s!<a .*?>(.*?)</a>!$1!is

Matt