Naiteluo

home

Regular Expression in JavaScript

24 Sep 2012

Define Regular Expressions

In JavaScript, regular expressions are represented by RegExp objects.

We can create it in two ways: RegExp()constructor, or simply using literal syntax. For example:

var p = new RegExp('javascript');

var p = /javascript/;

Literal characters

All alphabetic characters and digits match themselves literally in regular expressions. JavaScript regular-expression syntax also supports certain noalphabetic characters through escape sequences that begin with backslash(\). The following table lists these characters.

Character Matches
alphabetic charecters itself
\0 the Null character (\u0000)
\t tab (\u0009)
\n Newline (\u0000A)
\v Vertical tabn (\u0000B)
\f Form feed (\u000C)
\r Carriage return (\u000D)
\x nn The Latin character specified by the hexadecimal number nn; for example, \x0A is the same as \n
\u xxxx The Unicode character specified by the hexadecimal number xxxx; for example, \u0009 is the same as \t
\cX The control character ^ X; for example the \cJ is equivalent to the newline character \n

Some punctuation characters have special meaning in regular expression.

	^ $ . * + ? = ! : | \ / ( )[ ] { }

if we want them to be treated literally in regualr expressions, we can precede them with a backslash(\).

Character class

Individual literal characters can be combined into character classes by placing them within square brackets.

/[abc]/		// match any one of the characters a, b or c

A negated character class is specified by placing a caret (^) as the first character inside the left bracket.

/[^abc]/	// match any one characters other than a, b or c

Character classes can use a hyphen to indicate a range of characters.

/[a-z]/		// match any one lowercase character from the Latin alphabet

We can also combine ranges.

/[a-zA-Z0-9]/	// match any letter or digit from the Latin alphabet

Because certain character classes are commonly used, the JavaScript regular-expression syntax includes special characters and escape sequences to represent these common classes.

Character Matches
[…] Any one character between the brackets.
[^…] Any one character not between the brackets.
. Any character except newline or another Unicode
\w Any ASCII word character. Equivalent to [a-zA-Z0-9_].
\W Any character that is not an ASCII word character. Equivalent to [^a-zA-Z0-9_].
\s Any Unicode whitespace character.
\S Any character that is not Unicode whitespace. Note that \w and \S are not the same thing. \d Any ASCII digit. Equivalent to [0-9].
\D Any character other than an ASCII digit. Equivalent to [^0-9].
[\b] A literal backspace (special case).

We can use them within square brackets:

/[\s\d]/

Repetition

Regular expression repetition characters

Character Matches
{n,m} Match the previous item at least n times but no more than m times.
{n,} Match the previous item n or more times.
{n} Match exactly n occurrences of the previous item.
? Match zero or one occurrences of the previous item. That is, the previous item is optional. Equivalent to {0,1}.
+ Match one or more occurrences of the previous item. Equivalent to {1,}.
* Match zero or more occurrences of the previous item. Equivalent to {0,}.

Some examples:

/\d{2,4}/ // Match between two and four digits
/\w{3}\d?/ // Match exactly three word characters and an optional digit 
/\s+java\s+/ // Match "java" with one or more spaces before and after 
/[^(]*/ // Match zero or more characters that are not open parenthesis

Be careful when using the * and ? repetition characters. Since these characters may match zero instances of whatever precedes them, they are allowed to match nothing. For example:

/a*/ // matches string 'bbbb'

Nongreedy repetition

The repetition characters listed in Table 10-3 match as many times as possible while still allowing any following parts of the regular expression to match. It is also possible to specify that repetition should be done in a nongreedy way. Simply follow the repetition character or characters with a question mark(?), it will match as few characters as necessary.

'baaaaa'.match(/ba+/);		// ['baaaaa']
'baaaaa'.match(/ba+?/);		// ['ba']

Alternation, Grouping, and References

The | character separates alternatives. For example, /ab|cd|ef/ matches the string “ab” or the string “cd” or the string “ef”. And /\d{3}|[a-z]{4}/ matches either three digits or four lowercase letters.

Note that alternatives are considered left to right until a match is found.

'ab'.match(/a|ab/);		// ['a']

Parentheses have several purposes in regular expressions.

see more in 10.1.4 of <JavaScript the Definitive Guide>

Specifying Match Position

Other regular ex- pression elements match the positions between characters, instead of actual characters.

Here summarizes regular-expression anchors.

Character Meaning
^ Match the beginning of the string and, in multiline searches, the beginning of a line.
$ Match the end of the string and, in multiline searches, the end of a line.
\b Match a word boundary. That is, match the position between a \w character and a \W character or between a \w character and the beginning or end of a string. (Note, however, that [\b] matches backspace.)
\B Match a position that is not a word boundary.
(?=p) A positive lookahead assertion. Require that the following characters match the pattern p, but do not include those characters in the match.
(?!p) A negative lookahead assertion. Require that the following characters do not match the pattern p.

For example:

/^JavaScript$/	// match the word “JavaScript” on a line by itself
/\bJava\b/ 	// search for “Java” as a word by itself

The following pattern matches the word “JavaScript” in “JavaScript: The Definitive Guide”, but it does not match “Java” in “Java in a Nutshell”, because it is not followed by a colon.

/[Jj]ava([Ss]cript)?(?=\:)/

/Java(?! Script)([A-Z]\w*)/ matches “Java” followed by a capital letter and any number of additional ASCII word characters, as long as “Java” is not followed by “Script”. It matches “JavaBeans” but not “Javanese”, and it matches “JavaScrip” but not “Java- Script” or “JavaScripter”.

Flag

JavaScript supports three flags

Character Meaning
i Perform case-insensitive matching.
g Perform a global match—that is, find all matches rather than stopping after the first match.
m Multiline mode. ^ matches beginning of line or beginning of string, and $ matches end of line or end of string.

String Methods for Pattern Matching

Strings support four methods that use regular expressions.