Thursday, February 28, 2008

Nothing is Regular About Expressions..

And I say that because I just love regular expressions, they are dazzling, challenging, and there isn’t a text you can’t search and manipulate based on a regular expression pattern.
I thought that I might share some of the useful things I have leant while working with patterns to validate user’s input.
Now here are some of the basic concepts that you need to learn about patterns:
Alternation:
A vertical bar separates alternatives. For example, apple apple can match "apple” or “apple".
Grouping:
Parentheses are used to define the scope and precedence of the operators (among other uses). For example, apple apple and ap (pl) e are equivalent patterns which both describe the set of “apple" and " apple".
Quantification:
A quantifier after a token (such as a character) or group specifies how often that preceding element is allowed to occur. The most common quantifiers are ?, *, and +.
  • ? The question mark indicates there is zero or one of the preceding element. For example, colou?r matches both "color" and "colour".

  • * The asterisk indicates there are zero or more of the preceding element. For example, ab*c matches "ac", "abc", "abbc", "abbbc", and so on.

  • + The plus sign indicates that there is one or more of the preceding element. For example, ab+c matches "abc", "abbc", "abbbc", and so on, but not "ac".

  • I am also refering to this Basic Syntax Reference that will help you a lot in understanding patterns.


    CharacterDescription Example
    Any character except [\^$.?*+()All characters except the listed special characters match a single instance of themselves. { and } are literal characters, unless they're part of a valid regular expression token (e.g. the {n} quantifier).a matches a
    \ (backslash) followed by any of [\^$.?*+(){}A backslash escapes special characters to suppress their special meaning.\+ matches+
    \Q...\EMatches the characters between \Q and \E literally, suppressing the meaning of special characters.\Q+-*/\E matches +-*/
    [ (square brackets)]Starts a character class. A character class matches a single character out of all the possibilities offered by the character class. Inside a character class, different rules apply. The rules in this section are only valid inside character classes. [abc] matches a, b or c
    \ (backslash) followed by any of ^-]\A backslash escapes special characters to suppress their special meaning.\^\] matches ^ or ]
    - (hyphen) except immediately after the opening [Specifies a range of characters. (Specifies a hyphen if placed immediately after the opening [)[a-zA-Z0-9] matches any letter or digit
    ^ (caret) immediately after the opening [Negates the character class, causing it to match a single character not listed in the character class. (Specifies a caret if placed anywhere except after the opening [)[^a-d] matches x (any character except a, b, c or d)
    \d, \w and \sShorthand character classes matching digits 0-9, word characters (letters and digits) and whitespace respectively. Can be used inside and outside character classes.[\d\s] matches a character that is a digit or whitespace
    \D, \W and \SNegated versions of the above. Should be used only outside character classes. (Can be used inside, but that is confusing.)\D matches a character that is not a digit
    . (dot)Matches any single character except line break characters \r and \n. Most regex flavors have an option to make the dot match line break characters too.. matches x or (almost) any other character
    \n, \r and \tMatch an LF character, CR character and a tab character respectively. Can be used in character classes.\r\n matches a DOS/Windows CRLF line break.
    ^ (caret)Matches at the start of the string the regex pattern is applied to. Matches a position rather than a character. Most regex flavors have an option to make the caret match after line breaks (i.e. at the start of a line in a file) as well.^. matches a in abc\ndef. Also matches d in "multi-line" mode.
    $ (dollar)Matches at the end of the string the regex pattern is applied to. Matches a position rather than a character. Most regex flavors have an option to make the dollar match before line breaks (i.e. at the end of a line in a file) as well. Also matches before the very last line break if the string ends with a line break..$ matches f in abc\ndef. Also matches c in "multi-line" mode.
    \AMatches at the start of the string the regex pattern is applied to. Matches a position rather than a character. Never matches after line breaks.\A. matches a in abc
    \Z, \zMatches at the end of the string the regex pattern is applied to. Matches a position rather than a character. Never matches before line breaks, except for the very last line break if the string ends with a line break..\Z matches f in abc\ndef
    \bMatches at the position between a word character (anything matched by \w) and a non-word character (anything matched by [^\w] or \W) as well as at the start and/or end of the string if the first and/or last characters in the string are word characters..\b matches c in abc
    \BMatches at the position between two word characters (i.e the position between \w\w) as well as at the position between two non-word characters (i.e. \W\W).\B.\B matches b in abc
    (pipe)Causes the regex engine to match either the part on the left side, or the part on the right side. Can be strung together into a series of options.abcdefxyz matches abc, def or xyz
    (pipe)The pipe has the lowest precedence of all operators. Use grouping to alternate only part of the regular expression.abc(defxyz) matches abcdef or abcxyz
    ? (question mark)Makes the preceding item optional. Greedy, so the optional item is included in the match if possible.abc? matches ab or abc
    ??Makes the preceding item optional. Lazy, so the optional item is excluded in the match if possible. This construct is often excluded from documentation because of its limited use.abc?? matches ab or abc
    * (star)Repeats the previous item zero or more times. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is not matched at all.".*" matches "def" "ghi" in abc "def" "ghi" jkl
    *? (lazy star)Repeats the previous item zero or more times. Lazy, so the engine first attempts to skip the previous item, before trying permutations with ever increasing matches of the preceding item.".*?" matches "def" in abc "def" "ghi" jkl
    + (plus)Repeats the previous item once or more. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is matched only once.".+" matches "def" "ghi" in abc "def" "ghi" jkl
    +? (lazy plus)Repeats the previous item once or more. Lazy, so the engine first matches the previous item only once, before trying permutations with ever increasing matches of the preceding item.".+?" matches "def" in abc "def" "ghi" jkl
    {n} where n is an integer >= 1Repeats the previous item exactly n times.a{3} matches aaa
    {n,m} where n >= 1 and m >= nRepeats the previous item between n and m times. Greedy, so repeating m times is tried before reducing the repetition to n times.a{2,4} matches aa, aaa or aaaa
    {n,m}? where n >= 1 and m >= nRepeats the previous item between n and m times. Lazy, so repeating n times is tried before increasing the repetition to m times.a{2,4}? matches aaaa, aaa or aa
    {n,} where n >= 1Repeats the previous item at least n times. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is matched only n times.a{2,} matches aaaaa in aaaaa
    {n,}? where n >= 1Repeats the previous item between n and m times. Lazy, so the engine first matches the previous item n times, before trying permutations with ever increasing matches of the preceding item.a{2,}? matches aa in aaaaa
    Here are some of the expressions which I came across or implemented and would like to share:

    ExpressionDescriptionUsage
    ^[a-zA-Z]+(([\'\,\.\-][a-zA-Z])?[a-zA-Z]*)*$Any single word that contains alphabets only.First Name or Family Name
    [^\<>~/^\"\'!@#$%\^&*()=+]+It allows any series of words that may contain alphabets, digits, hyphens, and commasAddress
    ^[\.\w\s\\\:?]+\.(JPGjpgDOCdocDOCXdocx)$It allows any text that ends with .doc or .jpgAttachments
    \d{4,5}It allows to enter no less than 4 digits or more than 5Postal Code
    \b(?:\d{1,3}\.){3}\d{1,3}\bWill match any IP address.IP address
    (0[1-9][12][0-9]3[01])[- /.](0[1-9]1[012])[- /.](1920)\d\dMatches a date in the dd-mm-yyyy formatDates
    [\d]*[\w]*Matches any numbers any digitsSection1

    No comments: