Human-style URL recognition

3/12/2009

One of my pet peeves is that most regular expressions matching URLs fall somewhat short of what I expect. This pattern from John Gruber is so far the best I’ve found but, like virtually every other implementation, it doesn’t match URLs without protocol. Nobody expects to have to include “www” in a URL for it to work these days, and in daily conversations it’s rare to enunciate “aitch-tee-tee-pee-colon-slash-slash” when you refer to some website. So why is it so hard to match URLs without using these strings as crutches?

Fail Pattern
Read the rest of this article »

7 Comments

ActionScript 3 URL validator class

2/02/2009

A fairly common task in ActionScript projects is to deal with URLs in dynamically loaded text. Whenever I faced this prospect I usually ended up with some half-assed searching for “http://” and then indexOf(” “) to determine when the URL had ended and then smacked a <a href> on either side. Which pretty much meant that “http://___||/**.tasty_cheese_omelete” would be acceptable while “google.com” would not.

Hence I always promised myself that next time I’d write a proper class to deal with URLs so that this bullshit wouldn’t pass anymore.

Regular Expressions have always scared the bejeezus out of me, but I realized there was probably no other option when looking for something that has a valid URL structure. It was surprisingly easy to find some fairly good examples for what I wanted to achieve, and surprisingly difficult to get any of them to work in ActionScript. I’m blaming it on the AS implementation of RegExp, and I’m sure ActionScript is blaming it on my implementation of dumb.

After some swearing though I managed to determine whather a string had the proper structure of a URL but it still wasn’t perfect. I wanted my class to be intelligent enough to accept all valid Top Level Domains (there’s 267 in use) but reject any invalid attempt to pass as a valid TLD. IE: egypt.eg should pass, but breakfast.egg should not. The solution ended up being storing all the TLDs in a Vector and parsing through them to check if the domain was valid.

All in all the class now does a pretty good job, although there are probably still holes in it (let me know about them when you find them).

Check it out and grab it below, and then post a comment explaining to me slowly and carefully that I can achieve this exact functionality with the

1
String.giveMeAllTheURLs();

method or something similar, which I’m sure is what will happen. ;)

The code is freely available here. I would appreciate any credit and that you share whatever improvements you make with me and the rest of the world, but I have neither time nor energy surplus to actually do any enforcing, so knock yourself out.

14 Comments