Regular Expressions - matching rules

Previous: Regular Expressions - operator precedence

Basic pattern matching

All start with the basics. Pattern is a regular expression is the most basic elements, which are characterized by a group description string of characters. Mode can be very simple, consisting of ordinary strings, can also be very complex, often represent a range of characters with special characters, repeated, or represents the context. E.g:

^once

This model contains a special character ^, indicating that the pattern matches only once those beginning with string. For example, the pattern string "once upon a time" match does not match "There once was a man from NewYork". As such ^ symbol indicates the beginning of the same, $ symbols to match those strings at the end of a given mode.

bucket$

The model and "Who kept all of this cash in a bucket" match does not match with the "buckets". ^ And $ characters when used simultaneously, it represents an exact match (string mode as). E.g:

^bucket$

Matches only the string "bucket". If a model does not include the ^ and $, it contains the string to match any of the pattern. For example: Mode

once

With string

There once was a man from NewYork
Who kept all of his cash in a bucket.

Match.

Letters (once) in this mode is a literal character, that is, they said the letter itself, the figure is the same. Some other slightly more complex characters such as punctuation and white characters (spaces, tabs, etc.), use the escape sequence. All escape sequences beginning with a backslash (\). Tabs escape sequence is: \ t. So if we want to test whether a string beginning with a tab, you can use this mode:

^\t

Similarly, represented by \ n "new line", \ r carriage return. Other special symbols can be used in front of a backslash, such as the backslash itself with \\ said period. With \. Representation, and so on.

Character cluster

Program in INTERNET, the regular expression is often used to validate user input. When a user submits a FORM then, enter the phone number you want to judge, address, EMAIL address, credit card number is valid, with the general character based on the literal is not enough.

So, to use a more free model to describe our approach, it is the character clusters. To create a character representation of all vowel cluster, put all of the vowel character in a square brackets:

[AaEeIiOoUu]

This pattern matches any vowel characters, but only represent a character. With a hyphen can represent a range of characters, such as:

[a-z] //匹配所有的小写字母 
[A-Z] //匹配所有的大写字母 
[a-zA-Z] //匹配所有的字母 
[0-9] //匹配所有的数字 
[0-9\.\-] //匹配所有的数字，句号和减号 
[ \f\r\t\n] //匹配所有的白字符

Similarly, these only represent a character, this is a very important. If you want to match a lowercase letter and one by one string of numbers, such as "z2", "t6" or "g7", but not "ab2", "r2d3" or "b52", then use this pattern:

^[a-z][0-9]$

Although the [az] on behalf of the 26 letters of the range, but here it is only the first character is lowercase string matching.

As mentioned previously ^ represents the beginning of the string, but it also has another meaning. When used in a set of square brackets ^ is that it means "not" or "exclusion" means, often used to eliminate a character. Also use the previous example, we ask first character can not be a number:

^[^0-9][0-9]$

This mode "& 5", "g7" and "-2" is matched, but with the "12", "66" is not a match. Here are a few examples of exclude specific characters:

[^a-z] //除了小写字母以外的所有字符 
[^\\\/\^] //除了(\)(/)(^)之外的所有字符 
[^\"\'] //除了双引号(")和单引号(')之外的所有字符

Special character "." (Dot, full stop) in the regular expression used to represent the addition to the "new line" of all the characters. So the pattern "^ .5 $" and any two characters to the end of the number 5 and the match string in other non- "new line" character at the beginning of. Mode. "" Can match any string, in addition to the empty string, and includes only a "new line" string.

PHP regular expression has some built-in universal character cluster, the list is as follows:

字符簇	描述
[[:alpha:]]	任何字母
[[:digit:]]	任何数字
[[:alnum:]]	任何字母和数字
[[:space:]]	任何空白字符
[[:upper:]]	任何大写字母
[[:lower:]]	任何小写字母
[[:punct:]]	任何标点符号
[[:xdigit:]]	任何16进制的数字，相当于[0-9a-fA-F]

OK repeated

Until now, you already know how to match a letter or number, but more cases, you may want to match a word or a set of numbers. A word has a number of letters, a group of several figures in odd number. With the character or characters behind the cluster of curly braces ({}) used to determine the number of repeats of the preceding content.

字符簇	描述
^[a-zA-Z_]$	所有的字母和下划线
^[[:alpha:]]{3}$	所有的3个字母的单词
^a$	字母a
^a{4}$	aaaa
^a{2,4}$	aa,aaa或aaaa
^a{1,3}$	a,aa或aaa
^a{2,}$	包含多于两个a的字符串
^a{2,}	如：aardvark和aaab，但apple不行
a{2,}	如：baad和aaa，但Nantucket不行
\t{2}	两个制表符
.{2}	所有的两个字符

These examples describe the three different braces usage. A number, {x} means "the character or characters in front of the cluster only appears x times"; a number comma, {x,} means "the content appears in front of the number of times x or more"; two with comma separated numbers, {x, y} that "the content in front of at least appear x times, but not more than y times." We can extend the model to more words or numbers:

^[a-zA-Z0-9_]{1,}$ //所有包含一个以上的字母、数字或下划线的字符串 
^[1-9][0-9]*$ //所有的正数 
^\-{0,1}[0-9]{1,}$ //所有的整数 
^[-]?[0-9]+\.?[0-9]+$ //所有的浮点数

The last example is not well understood, is not it? So Kanba: and all with an optional minus sign (? [-]) At the beginning (^), followed by one or more digits ([0-9] +), and a decimal point (. \) Talk on one or more digits ([0-9] +), and is not followed by anything else ($). Below you will know more simple method can be used.

"?" And special characters {0,1} are equal, they represent: "0 or 1 in front of 'or' in front of the content is optional." So just examples can be simplified as:

^\-?[0-9]{1,}\.?[0-9]{1,}$

Special character "*" and {0,} are equal, they all represent the "0 or more of the preceding content." Finally, the characters "+" and {1} are equal, it indicates "1 or more of the preceding content", so the above four examples can be written as:

^[a-zA-Z0-9_]+$ //所有包含一个以上的字母、数字或下划线的字符串 
^[0-9]+$ //所有的正数 
^\-?[0-9]+$ //所有的整数 
^\-?[0-9]*\.?[0-9]*$ //所有的浮点数

Of course, this does not fundamentally reduce the technical complexity of regular expressions, but can make them easier to read.

Previous: Regular Expressions - operator precedence

Next: Regular Expressions - Examples

Regular Expressions Tutorial

Regular Expressions - matching rules

Basic pattern matching

Character cluster

OK repeated