Josh's Quick Intro to RegEx

You may be a new programmer, or a web designer, or just someone who's heard the word 'RegEx', and asked: What is a Regex? How do I use it? And why does it hurt my brain? Well, relax. The doctor is in. Here's two aspirin and some water.

What is it doctor?

Oh, it's two parts hydrogen and one part oxygen; but that's not important right now. We're here to talk about RegEx. RegEx is short for Regular Expressions, but that's also not important right now. Regexs are just patterns for matching bits of text. Whenever you hear RegEx, just think: pattern. In fact, I'll stop using the word regex right now, since it mainly sounds like the kind of medicine you'll need after trying to write a complex regex.

What are these patterns good for?

Patterns are mainly used for three things: to see if some text contains the pattern (matching), to replace part of the text with other text (replacement), and pulling out portions of the text for later use (extraction). Patterns contain a combination of regular letters and special symbols like ., *, ^, $, \w and \d. Most programming languages use pattern matching with a subset of the Perl syntax. My examples will use JavaScript, but the same pattern syntax should work in Java, Python, Ruby, Perl, and many other languages.

Matching

Suppose you want to know if the text "Sally has an apple and a banana." contains the word 'apple'. You would do it with the pattern 'apple'.

var text = "Sally has an apple and a banana.";
if(text.match(/apple/)) { console.log("It matches!");
}

Now suppose you want to know if the text begins with the word 'apple'. You'd change the pattern to '^apple'. The ^ is a special symbol meaning 'the start of the text'. So this will only match if the 'a' of apple is right after the start of the text. Call it the same way as before.

var text = "Sally has an apple and a banana.";
if(text.match(/^apple/)) { //this won't be called because the text doesn't start with apple console.log("It matches!");
}

Besides the ^ symbol for 'start of text', here's some other symbols are important to know (there are far more than this, but these are the most important).

$ = end of the text
\s = any whitespace (spaces, tabs, newlines)
\S = anything *but* whitespace
\d = any number (0-9)
\w = any word (upper & lower case letters, numbers, and the underscore _)
. = anything (letter, number, symbol, whitespace)

If you want to match the same letter multiple times you can do this with a quantifier. For example, to match the letter q one or more times put the '+' symbol after the letter.

var text = "ppqqp";
if(text.match(/q+/)) console.log("there's at least one q");

For zero or more times use the '*' symbol.

var text = "ppqqp";
if(text.match(/q*/)) console.log("there's zero or more q's");

You can also group letters with parenthesis:

var text = "ppqqp";
if(text.match(/(pq)+/)) console.log("found at least one 'pq' match");

So, to recap:

. = any x+ = match 'x' one or more times
x* = match 'x' zero or more times ex: match foo zero or more times, followed by bar one or more time = (foo)*(bar)+
x|y = match x or y

Replacing text

Now that you can match text, you can replace it. Replace every instance of 'ells' with 'ines'.

var text = "Sally sells seashells"
var text2 = text.replace(/Sally/,"Billy"); //turns into "Billy sells seashells"
var text2 = text.replace(/ells/,"ines"); //turns into "Sally sines seashines"

Modifiers

Most pattern apis have a few modifiers to change how the search is executed. Here's the important ones:

Make the search case insensitive:text.match(/pattern/i)

Normally the patterns are case sensitive, meaning the pattern 'apple' won't match the word 'Apple'. Add the i parameter to match() to make it case insensitive.

Make the search multiple lines:text.match(/pattern/m)

Normally a pattern will only match the first line of the text. It will stop at the newline character '\n'. With the m parameter it will treat newlines as whitespace and let you search the entire string.

Make a replace global:text.replace(/foo/bar/g)

Normally the replace() function will only replace the first match. If you want to replace every match in the string use the g parameter. This means you could replace every copy of 'he' to 'she' in an entire book with a single replace() call.

Substring Extraction

Another major use for patterns is string extraction. When you do a match, every group of parenthesis becomes a submatch, which you can use individually. Suppose you have a text string with a date in it and you want to get the year and month and day parts out of it. You could do it like this:

var text = "I was born on 08-31-1975 and I'm a Virgo."
var parts = text.match("(\d\d)-(\d\d)-(\d\d\d\d)");
//pull out the matched parts
var month = parts[1]; //08
var day = parts[2]; //31
var year = parts[3]; //1975
//parts[0] would give you the entire match, eg: 08-31-1975

The Cheet Sheet

The standard pattern syntax in most languages is expansive and complex, so I'll only list the ones that are actually useful. For a full list refer to the documentation for the programming language you are working with.

Match anywhere in the text: "Sally sells seashells".match(/ells/) (matches both sells and seashells)

Match beginning of text: "Sally sells seashells".match(/^Sally/)

Match end of text: "Sally sells seashells".match(/ells$/) (matches only the seashells at the end)

any word at least three letters long \w\w\w

anything .

anything followed by any letter .\w

the letter q followed by any letter q\w

the letter q followed by any white space q\s

the letter q one or more time q+

the letter q zero or more times q*

any number \d any number with exactly two digits: \d\d

any number at least two digits long: \d+

any decimal number \d+\.\d+ //ex: 5.0

any number with an optional decimal \d+(\.\d+)* //ex: 5.0 or 5

match the numbers in this date string: 2011-04-08(\d\d\d\d)-(\d\d)-(\d\d)

also be able to match this date string: 2001-4-8(\d\d\d\d)-(\d+)-(\d+)

Conclusion

Patterns are a very complex subject so I've just tried to give you the basics. While complex, they are also incredibly powerful and useful. As you learn them you'll find you use them more and more for all sorts of cool things. For a more in-depth tutorial read the Mozilla JavaScript Regular Expression guide.

PS: Found a bug in the above code? Want to suggest more examples of useful regex's? Want to just shoot the breeze? I've turned off comments due to 99% of it being spam, so please tweet your suggestions to me instead: @joshmarinacci. Thanks for understanding!

Talk to me about it on Twitter

Posted April 12th, 2011

Tagged: code programming