Regular Expressions - Syntax

Previous: Regular Expressions - Introduction

Next: Regular Expressions - metacharacters

Regex (regular expression) describes a set of strings that can be used to check whether a string contains a certain substring matching substring do replace or remove a condition matches a substring from a string Wait.

The column catalog, dir * .txt ls * .txt or * .txt is not in a regular expression, because there * is * with regular expression meaning is different.
The method of constructing a regular expression and create mathematical expressions the same way. That is, using a variety of metacharacters and operators can combine small expressions together to create larger expressions. Component regular expressions can be a single character, character set, select a range of characters, between the characters, or any combination of all of these components.

Regular expressions are text mode consists of common characters (such as letters from a to z) and special characters (called "meta-characters") thereof. Mode description when searching for text to match one or more strings. Regular expression as a template, a character mode and the search string to match.

Ordinary characters

Ordinary characters are not explicitly designated as metacharacters all printable and non-printable characters. This includes all uppercase and lowercase letters, all digits, all punctuation and other symbols.

Non-printing characters

Non-printing characters can also be a part of the regular expression. The following table lists the escape sequences that represent non-printing characters:

字符	描述
\cx	匹配由x指明的控制字符。例如， \cM 匹配一个 Control-M 或回车符。x 的值必须为 A-Z 或 a-z 之一。否则，将 c 视为一个原义的 'c' 字符。
\f	匹配一个换页符。等价于 \x0c 和 \cL。
\n	匹配一个换行符。等价于 \x0a 和 \cJ。
\r	匹配一个回车符。等价于 \x0d 和 \cM。
\s	匹配任何空白字符，包括空格、制表符、换页符等等。等价于 [ \f\n\r\t\v]。
\S	匹配任何非空白字符。等价于 [^ \f\n\r\t\v]。
\t	匹配一个制表符。等价于 \x09 和 \cI。
\v	匹配一个垂直制表符。等价于 \x0b 和 \cK。

Special characters

The so-called special characters, is that some characters have special meaning, as it says "* .txt" in *, simply means that the string representation of any meaning. If you want to find the file name with * file, you need to escape the *, that is a plus in front \. ls \ *. txt.

Many metacharacters require special treatment when trying to match them. To match these special characters, you must first make the character "escape", that is, the backslash character (\) in front of them. The following table lists the regular expression special characters:

特别字符	描述
$	匹配输入字符串的结尾位置。如果设置了 RegExp 对象的 Multiline 属性，则 $ 也匹配 '\n' 或 '\r'。要匹配 $ 字符本身，请使用 \$。
( )	标记一个子表达式的开始和结束位置。子表达式可以获取供以后使用。要匹配这些字符，请使用 $ 和 $。
*	匹配前面的子表达式零次或多次。要匹配 * 字符，请使用 \*。
+	匹配前面的子表达式一次或多次。要匹配 + 字符，请使用 \+。
.	匹配除换行符 \n之外的任何单字符。要匹配 .，请使用 \。
[	标记一个中括号表达式的开始。要匹配 [，请使用 \[。
?	匹配前面的子表达式零次或一次，或指明一个非贪婪限定符。要匹配 ? 字符，请使用 \?。
\	将下一个字符标记为或特殊字符、或原义字符、或向后引用、或八进制转义符。例如， 'n' 匹配字符 'n'。'\n' 匹配换行符。序列 '\\' 匹配 "\"，而 '\(' 则匹配 "("。
^	匹配输入字符串的开始位置，除非在方括号表达式中使用，此时它表示不接受该字符集合。要匹配 ^ 字符本身，请使用 \^。
{	标记限定符表达式的开始。要匹配 {，请使用 \{。
\|	指明两项之间的一个选择。要匹配 \|，请使用 \\|。

Qualifier

Qualifier is used to specify the regular expression of a given component must appear many times to meet the match. There * or + or? Or {n} or {n,} or {n, m} a total of 6.

Regular expressions qualifiers are:

字符	描述
*	匹配前面的子表达式零次或多次。例如，zo* 能匹配 "z" 以及 "zoo"。* 等价于{0,}。
+	匹配前面的子表达式一次或多次。例如，'zo+' 能匹配 "zo" 以及 "zoo"，但不能匹配 "z"。+ 等价于 {1,}。
?	匹配前面的子表达式零次或一次。例如，"do(es)?" 可以匹配 "do" 或 "does" 中的"do" 。? 等价于 {0,1}。
{n}	n 是一个非负整数。匹配确定的 n 次。例如，'o{2}' 不能匹配 "Bob" 中的 'o'，但是能匹配 "food" 中的两个 o。
{n,}	n 是一个非负整数。至少匹配n 次。例如，'o{2,}' 不能匹配 "Bob" 中的 'o'，但能匹配 "foooood" 中的所有 o。'o{1,}' 等价于 'o+'。'o{0,}' 则等价于 'o*'。
{n,m}	m 和 n 均为非负整数，其中n <= m。最少匹配 n 次且最多匹配 m 次。例如，"o{1,3}" 将匹配 "fooooood" 中的前三个 o。'o{0,1}' 等价于 'o?'。请注意在逗号和两个数之间不能有空格。

Since the chapter number in a large input document will likely more than nine, so you need a way to deal with two or three chapters numbers. Qualifier gives you this ability. The following regular expression matches any number of bits numbered chapter titles:

/Chapter [1-9][0-9]*/

Please note that the qualifier appears after the range expression. Thus, it applies to the entire range of expression, in this case, only the specified number from 0-9 (including 0 and 9).

+ Qualifier is not used here, as in the second position or the rear position do not necessarily need to have a number. Nor use? Character, because it limits the chapter numbers to only two digits. You need to match at least one number Chapter and space characters back.

If you know the chapter number is limited to only 99 chapters, you can use the following expression to specify at least one but at least two digits.

/Chapter [0-9]{1,2}/

The disadvantage of the above expression is greater than 99 chapters still number only matches the beginning of two digits. Another disadvantage is that Chapter 0 will also match. Better matching of only two digits of expression is as follows:

/Chapter [1-9][0-9]?/

/Chapter [1-9][0-9]{0,1}/

*, +, And? Qualifiers are greedy because they match as much text, only behind them, plus a? Can be achieved non-greedy or minimal match.

For example, you might search for HTML documents to find enclosed in H1 tags section headings. The text in your document as follows:

<H1>Chapter 1 – Introduction to Regular Expressions</H1>

All contents match the following expressions starting from less than sign (<) to close the H1 tag is greater than symbol (>) between.

/<.*>/

If you only need to match the beginning of H1 tags, the following "non-greedy" expression matches only <H1>.

/<.*?>/

By *, +, or? After the qualifier place? The expressions can switch from "greedy" to "non-greedy" or minimal expression matching.

Locator

Locator enables you to regular expression is fixed to the beginning or end of the line. They also allow you to create such a regular expression, these regular expressions appear in a word, a word at the beginning or end of a word.

Locator used to describe the string or word boundary, ^ and $ refer to the beginning and end of the string, \ b boundary description before or after the word, \ B represents a non-word boundary.

Regular expressions qualifiers are:

字符	描述
^	匹配输入字符串开始的位置。如果设置了 RegExp 对象的 Multiline 属性，^ 还会与 \n 或 \r 之后的位置匹配。
$	匹配输入字符串结尾的位置。如果设置了 RegExp 对象的 Multiline 属性，$ 还会与 \n 或 \r 之前的位置匹配。
\b	匹配一个字边界，即字与空格间的位置。
\B	非字边界匹配。

Note: You can not use the qualifier and anchor point. Because immediately before or after the wrapping or word boundaries can not have more than one location, and therefore does not allow such an expression like ^ *.

To match a line of text at the beginning of the text, please start using the expression n ^ character. ^ Do not use this use the expression in parentheses within the confusion.

To match text at the end of a line of text, see the end of the expression using $ characters in the positive.

To use the section headings in the search when the anchor point, the following regular expression matches a chapter title, the title contains only two followed by a number, and the start of the line:

/^Chapter [1-9][0-9]{0,1}/

True chapter headings appear only at the beginning of the line, but it is the only text in the row. It appears the line first appeared at the end of the same row. The following expression can be sure that the specified match only match without matching the cross-reference section. By creating only match the beginning and end of a line of text of the regular expression, you can do it.

/^Chapter [1-9][0-9]{0,1}$/

Match word boundary is slightly different, but the regular expressions a very important capability. Word boundary is the location and the space between the words. Non-word boundary is any other location. The following expression matches the beginning of the word Chapter three characters, because these three characters appear after a word boundary:

/\bCha/

\ B character position is very important. If it is located at the beginning of the string to be matched, it looks for a match in the beginning of the word. If it is at the end of the string, it looks for a match at the end of a word. For example, the following expression matching words Chapter string ter, as it appears in front of the word boundary:

/ter\b/

The following expression matches the string Chapter apt, but does not match the string aptitude apt:

/\Bapt/

Apt word string appears in Chapter non-word boundary, but the word appears in the aptitude of the word boundary. For \ B non-word boundary operator, position is not important, because the match does not care whether it is at the beginning or end of a word.

select

Use parentheses to enclose all of the selections, with options between adjacent separated by |. But there is a side effect of using parentheses, is relevant matches will be cached, available at this time:? On the first option to eliminate this side effect.

Wherein:? Non-capture one element, there are two non-capturing element is = and the two have more meaning, the former Positive pre-investigation, in any start being matched parentheses is the expression??! positional pattern to match the search string, which is negative pre-investigation, in any start position does not match the regular expression pattern to match the search string.

Backreferences

For a regular expression pattern or part of the pattern on both sides will result in added parentheses relevant match to a temporary storage buffer, each sub-captured are stored in order from left to right match appear in a regular expression pattern. Buffer No. 1 from the beginning, can store up to 99 captured subexpression. Each buffer can use '\ n' visit, where n is a one or two decimal digits identify the specific buffer.

You can use the non-capturing metacharacters':? ', Or to rewrite the capture, save to ignore the relevant match' =? '?!'.

Backreferences easiest, one of the most useful is the ability to provide the text to find two identical adjacent word matches. An example in the following sentence:

Is is the cost of of gasoline going up up?

The above sentence obviously have many repeated words. If we can devise a method to locate the sentence, rather than find duplicate occurrences of each word, that the more good. The following regular expression uses a single subexpression to achieve this:

/\b([a-z]+) \1\b/gi

Capture expressions, as [az] + specified, including one or more letters. The second part of the regular expression is to match a child previously captured reference, that is, the second occurrence of the word just matched by the expression in parentheses. \ 1 specifies the first sub-match. Word boundary metacharacters ensure that only detect the entire word. Otherwise, phrases such as "is issued" or "this is" and the like are not correctly identify this expression.

Regular expression after global tag (g) indicates, this expression is applied to the input string can find as many matches. Expression case insensitive at the end of (i) tag specifies case-insensitive. Potential match multi-line tag specifies line breaks can occur on both sides.

Backreferences can also be a Universal Resource Indicator (URI) into its components. Suppose you want to be broken down into the following URI protocol (ftp, http, etc.), domain addresses and page / path:

http://www.w3cschool.cc:80/html/html-tutorial.html

The following regular expression provides this functionality:

/(\w+):\/\/([^/:]+)(:\d*)?([^# ]*)/

The first parenthetical subexpression captures the protocol part of the Web address. The subexpression matches any word colon and two forward slashes in front of. The second parenthetical subexpression captures the domain address part of the address. One or more characters other than: subexpression match / and. The third parenthetical subexpression capture the port number (if one is specified). The sub-expression matches zero or more digits after the colon. It can only be repeated once the sub-expression. Finally, the fourth parenthetical subexpression captures the path specified Web address and / or page. This does not include the sub-expression matches any character sequence # or space characters.

The regular expression applied to the above URI, the sub-matching entry contains the following elements:

The first parenthetical subexpression contains "http"
The second parenthetical subexpression contains "www.w3cschool.cc"
The third parenthetical subexpression with ": 80"
The fourth parenthetical subexpression contains "/html/html-tutorial.html"