Rational Expression
A rational expression or regular expression is in Informatique a Character string whose reason describes a Ensemble character strings according to a precise syntax. These notations are used by several text editors and utilities to traverse in an automatic way a document with research of pieces of text compatible with the reason for research, and to possibly carry out an addition, a substitution or a suppression.
The rational expressions were invented at one time when the characters merged with the bytes. Alternatives exist in Bash, Perl, ICU (Unicode, where the characters are coded on 2,4 or one variable number of bytes).
Use
The shells UNIX (Bash, Ksh, HSC, HS, etc) nativement use this kind of expressions in their searchs for files. But certain programs (python, Awk, Perl, sed, etc) use also this kind of expressions, and it is necessary then, to avoid interpretation by Shell, to protect each special character from these expressions by a \ , or more simply to protect the whole by a couple from apostrophes (' regexp ' ).
Origin
The origin and the mathematical justification of the rational expressions are in the theory of the automat S and the formal languages. These fields of study cover models of calculation (automats) and ways of describing and of classifying formal languages. A formal language is simply defined here like a whole of character strings.In the Years 1940, Warren McCulloch and Walter Pitts described the nervous system by modelling the neurons by simple automats. The logician Stephen Cole Kleene then described these models in terms of regular units , concept that it introduced with a certain notation. In 1959, Michael Rabin and Dana Scott propose the first mathematical and rigorous treatment of these concepts in a famous article which is worth the Prix Turing to them and which contributes to make start the study of these languages.
Ken Thompson implemented this notation in the editor qed, then the editor ED under Unix, and finally in Grep. Since then, the rational expressions were largely used in the utilities such as lex like in the computer programming languages born under Unix, such as Expr, Awk, Perl, Tcl, Python… They rest on the library Regex, or the library PCRE which is more powerful.
Basic principles
A regular expression is a continuation of character S typographical which one more simply calls “reason” or “ pattern ” in his English form charged to describe a Character string to find it in a block of text and to apply an automated treatment to him, like an addition, a replacement or a suppression. For example can the whole of words “ex-équo, ex-equo, ex-aequo and ex-æquo” be condensed in only one reason “ex- (has? E|æ|E) quo”. The basic mechanisms to form such expressions are based on special characters of substitution, grouping and quantification.A vertical Barre generally separates two alternative expressions: “equo|aequo” indicates either equo, or aequo. It is also possible to use brackets to define the field and the priority of detection, “(ae|E) quo” indicating the same unit as “aequo|equo” and to quantify the groupings present in the reason by affixing characters of quantification on the right of these groupings.
Are the most widespread quantifieurs ? which defines a group which exists zero or once, toto? corresponding then to “early” or “louse”, * which defines a group which exists zero times or one or more time, toto* correspondent in “early”, “louse”, “totoo”, “totooo”, etc and + which defines a group which exist one or more time, toto+ correspondent in “louse”, “totoo”, “totooo”, etc
Theory
See also: rational Language
The rational expressions correspond to the grammars of the type 3 (see formal Grammaire) of the Hiérarchie of Chomsky; they can thus be used to describe the morphology of a language.
Notations
Simplified notation of the shells Unix or Linux
For example the program of free Shell Bash, used under Linux. The basic rational expressions under Unix and Linux omit the union ensemblist (in general called here “operator of choice” or “alternative”). The rational expressions are used to find agreements only on the totality of the name of the files present in the lines of order which they seek to thus interpret and extend automatically with the file names corresponding to the regular expression (by specifying the rational expression simplest " a" on a line of order, they can find only the only file named " a" but none the named files " ab" or " ba" or " a.out" for example).These shells supports:
Examples
-
.ac: represent the chains of 3 characters which end in “ac”. -
: corresponds to any small letter (not-accentuated). -
: corresponds to any character which is not an not-accentuated small letter. -
ac: represent inter alia “bag” and “TAC”. -
ac: represent the words of three letters which end in “ac” and do not start with “F”.
Usual notation of grep, ED, sed and VI
The utility of research Grep of the world Unix uses the same regular expressions, however they are used to find occurrences anywhere in a text. It shares the same functionalities of research as the simplest editor of the Unix world, ED (which works line with line on a textual file), and its extensions like its version of automatic treatment in line of order sed , or the more advanced editor VI which extends ED with a mode of interactive insertion and the support of the edition in mode full screen.Also, these tools extend the preceding list to fix additional conditions or to facilitate research with:
Examples
-
chat|chien: corresponds to the word “cat” or the word “dog” (and only them); however it can find them anywhere in the text, and thus will find “cat” in “she-cat”. -
hat|hien: corresponds to one of the words “cat”, “Cat”, “dog” or “Dog” (and only them); however it can find them anywhere in the text, and thus will find “Cat” in “Cats and dogs”. -
ch+t: corresponds to “cht”, “chht”, “chhht”, etc anywhere in the text. -
a+: corresponds to “aou”, “ao”, “auuu”, “aououuuoou”, etc anywhere in the text. -
peu?: corresponds to one of the words “little”, “can” and “can” (and only them) anywhere in the text. Research turns over longest possible in the event of event multiples to the same position. -
^ac: represent words “bag” and “TAC” at the beginning of line. -
ac$: represent words “bag” (or “undertow”, etc) and “TAC” only if they are at the end of the line or text. -
^trax$: only represent the word “trax” on a line.
Notation extended in vim and emacs
Similar extensions are used in the alternate editor Emacs which uses a different command set but takes again the same rational expressions while bringing a wide notation. The wide rational expressions are now supported also in Vim , the version improved of VI .
Of more than many other escape sequences are added to indicate preset classes of characters. They are specific to each utility or sometimes variables according to the version or the platform (however they are stable for a long time in emacs which had the appearance of a precursor of these extensions, that other authors implemented partially or implemented in a limited or different way).
Examples
-
p \ (S \) +t: corresponds to “psst”, “psssst”, “psssssst”, etc anywhere in the text, but not with “pst” nor with “pssst”, etc (because the latter have an odd number of “S”).
Rational expressions wide POSIX
The standard POSIX sought to cure the proliferation of syntaxes and functionalities, by offering a standard of configurable rational expressions. One can obtain an outline from it by reading the handbook ofregex under most of the dialects Unix of which GNU/Linux. However, even this standard does not include all the functionalities added to the rational expressions of Perl.
Rational expressions wide POSIX are often supported in the utilities of the distributions Unix and Linux by including the flag -E in the Ligne of order of invocation of these utilities. However the métacaractères above are now often recognized also in the utilities Unix and Linux, standard POSIX defining the minimum play of métacaractères to be supported.
The wide rational expressions (ERA) of POSIX, as defined by standard POSIX 1003 .2 of the IEEE, are similar by their syntax to the regular expressions Unix (see Grep above), with some exceptions (for the moment this standard is not the subject yet of an ISO international standard, and the current national standards can also vary on the exact definition of certain essential or optional characteristics).
Moreover, certain reversed oblique bars are removed for the delimitors of grouping used in the utilities Unix or Linux: \ {… \} becomes {…} and \ (… \) becomes (...) . Special operators { and } , used in certain rational expressions of Unix for the wide quantifiers (limited) or as delimitors of grouping without capture, are not recognized directly in version POSIX.
Lastly, POSIX adds the support for platforms using a character set not based on the ASCII, in particular EBCDIC, and a support partial of local for certain méta-characters.
The following métacaractères were added to the usual notation Unix (that of grep , ED , sed , VI , and not that of the wide notation):
Examples:
-
+atcorresponds to " hat ", " cat ", " hhat ", " cat " , " hcat ", " ccchat ", etc but not with " At ". -
? atcorresponds to " hat ", " cat " and " At ". -
ch (At|IEN)corresponds to " cat " or " dog " (moreover the brackets delimit a grouping of capture of the actual value of each alternative and allows to use this capture for the operations of automatic replacement after a search for correspondence using the rational expressions).
Escape sequences POSIX
Since the characters (, ) , , ., *, ? , +, ^, |, $ and \ are used like special symbols, they must be referred in an escape sequence if they must indicate the corresponding character literally. This is made by preceding them with a reversed oblique bar \ (which must thus be delimited itself in the same way to make it correspond literally).
The following escape sequences are thus supported:
Notes:
-
\* : represent only the chain “*” (the \ returns literal the * which follows it). -
\\* : represent the null string, or the chains “\”, “\ \”, “\ \ \” etc (the first \ makes the second \ literal; * guard its direction of Closing of Kleene). -
. \. (\ (|\)): represent the chains “A.)” and “A. (” and “B.)” and other chains of 3 characters (the first character can vary and to be any space character or graph, the second character must be a literal point, the third is one of the two brackets taken literally).
It should be noted that POSIX does not define any standard way to literally indicate characters by their numeric digital code in character sets to more than 8 bits (for example Unicode). Also, many implementations of compatible POSIX Unicode or ISO 10646 accept also the sequences \ U NNNN (where NNNN indicates on 4 hexadecimal digits the point of Unicode code of a character of basic the multingue plan) or \ U NNNNNNNN (where NNNNNNNN indicates on 8 hexadecimal digits the point of Unicode code of an unspecified nature of the play).
The standard does not specify either if the characters indicated by a hexadecimal code indicate those of the source file, or if their code results from a transcribing of the character set coded of entry towards a common play (such as Unicode). Unicode or the ASCII basic play is almost always used as an internal coding, but it is not always true on the systems with coding based on EBCDIC with rational expressions POSIX.
Moreover, the character sets on 8 bits can differ largely in particular in the high zone (not ASCII) and interpretation from the control characters (according to the system used). That constitutes a problem of interworking, which is generally solved while using, in the utilities of word processing, a character set interns common single based on Unicode and a transcribing of the character set of entry towards this coding interns common: with this system, the rational expressions can become independent of the coded character sets used in various documents.
Classes of characters POSIX
Since many subsets and extended from characters dependant on local are used (for example, in certain configurations, the letters are organized in ABC… zABC… Z , but like aAbBcC… zZ in others), standard POSIX defines certain classes or categories of characters as shown in the table below:
~]
|-
|
| Alphanumeric
|
|-
|
| Decimal digit
|
|-
|
| Hexadecimal figure
|
|-
|
| Alphabetical character
|
|-
|
| Small letter
|
|-
|
| Capital letter
|
|}
For example, made correspond a character has among the unit formed by the union of the small letters " has " and " B " and of the subset of the capital letters.
It should be noted that in the rational expressions of Perl, the class is differently defined and corresponds to union (Perl thus includes there the tabulations and separators of lines or paragraphs, contrary to POSIX).
An additional class not POSIX, supported by certain tools, is
Compatible elements POSIX are the following:
The differences with POSIX are the following ones:
Extensions not POSIX are the following ones:
Examples:
The specifications of Perl 6 regularize and extend the mechanism of the system of rational expressions.
Moreover it is integrated better into the language than in Perl 5. The control of the Retour on trace is very fine there. The system of regex of Perl 6 is enough powerful to write parsers without the assistance of external modules of analysis. The rational expressions are there a form of sub-routines and grammars a form of class. The mechanism is implemented in Assembleur Parrot by the module PGE in the implementation Parrot of Perl 6 and Haskell in the implementation Pugs. These implementations are a big step for the realization of a Compilateur Perl 6 complete. Some of the functionalities of the regexp of Perl 6, as the named captures, are integrated in the next Perl 5.10.
Other utilities often add their own conventions. The capacity of expression then exceeds often that of the rational expressions such as above definite, i.e. they become able to describe whole of character strings inaccessible to the rational expressions “normal” presented in the preceding sections on the rational expressions POSIX (supported in the module
One of the defects reproached PHP is related to its limited support of the character strings, while at the same time it is mainly used to treat text, since the text can be represented there only in one character set coded on 8 bits, without being able to specify clearly which coding is used. In practice, it is thus necessary to associate with PHP of the libraries of support for coding and the decoding of the texts, would be this only to represent them in UTF-8. However, even in UTF-8, the problem arises immediately with the semantics of the rational expressions since the characters then have a variable coding length, which requires to complex the rational expressions. Optional extensions of PHP are thus developed to create a new type of data for the text, in order to facilitate its treatment (and to be compatible in the long term with Perl6 which, like Haskell, will nativement have nativement the integral support of Unicode). Thus, the integration of ICU (below) as a plug-in for PHP was tested successfully, and will be integrated in PHP6.
The rational expressions usable in ICU take again the characteristics of the rational expressions of Perl, but supplements them to bring to them the integral support of the character set Unicode (see the following section for the relative questions with standardization always in progress). They also clarify their significance while making the expressions rational independent of the coded character set used in the documents, since the character set Unicode is used as internal coding pivot.
Indeed the rational expressions of Perl (or PCRE) are not portable to treat documents using of the different coded character sets, and either correctly do not support the coded character sets multi-bytes (with length variable such as ISO 2022, Shift-JIS, or UTF-8), or coded on one or more binary units of more than 8 bits (for example UTF-16) since the effective coding of these plays in the form of sequences of bytes depends on the platform used for the treatment (order of storage of the bytes in a word of more than 8 bits).
ICU résoud that by adopting an in-house treatment using a definite single play on 32 bits and supporting the totality of the universal character set (UCS), such as it is defined in the standard ISO/IEC 10646 and semantically specified in the standard Unicode (which add to the standard the support of properties informative or normative on the characters, and of the recommendations for the automatic treatment of the text, some of these optional or informative recommendations being, others having become standard and having integrated into the standard Unicode itself, others finally having acquired the statute of international standard in the ISO or national standard in certain countries).
ICU supports the following extensions, directly in the rational expressions, or the rational expression of a class of characters (between
The rational expressions of ICU are currently among most powerful and most expressive in the treatment of the multilingual documents. They are largely at the base of the standardization (always in progress) of the regular expressions Unicode (see below) and a subset is supported nativement in the standard library of the language Java (which uses in-house a portable character set to coding variable, based on UTF-16 with extensions, and the units of coding are on 16 bits).
ICU is a bookstore still in evolution. In theory, it should also adopt all the extensions announced in Perl (in particular named captures), with an aim of ensuring interworking with Perl 5, Perl 6, and PCRE, and the other languages increasingly many which use this extended syntax, and the authors of ICU and Perl work in concert to define a common notation. However, ICU adopts in priority the extensions adopted in the rational expressions described in the Unicode standard, since ICU is used as principal reference in this standard appendix of Unicode.
However, there does not exist yet any standard or normalizes technical to treat certain important aspects of the rational expressions in a multilingual context, in particular:
To specify these last missing aspects, of the additional métacaractères should be able to be used to control or filter the found occurrences, or a standardized order imposed on the list of the turned over occurrences. The authors of applications must thus be vigilant on these points and make sure of reading all the found occurrences and not only the first, in order to be able to decide which occurrences is best appropriate to a given operation.
A question is to know which format of internal representation of Unicode is supported. All the engines of rational expressions in line of order await UTF-8, but for the libraries, some await also UTF-8, but others only await a play coded on UCS-2 (even its extension UTF-16 which restricts also the valid sequences), or on UCS-4 only (even its standardized restriction UTF-32).
A second question is to know if the entirety of the beach of the values of a version of Unicode is supported. Many engines support only the BASIC Multilingual Plane, i.e., the characters encodables on 16 bits. Only some engines can (since 2006) manage the beaches of Unicode values on 21 bits.
A third question is to know how ASCII constructions are extended to Unicode.
However, in practice it is often not the case:
Another field in which variations exist is the interpretation of the indicators of insensitivity to breakage.
Another answer to Unicode was the introduction of the classes of characters for the Unicode blocks and the general properties of the Unicode characters:
Notes:
which is generally déinie like more the underlining; that translated the fact that in many computer programming languages, in fact the characters can be used in a Identificateur. The text editor Vim still distinguishes the classes and (by also using the supported notations \ w and \ h) since in many computer programming languages, the characters usable at the beginning of an identifier are not same as those usable in the other positions.
SQL
Python
Python uses rational expressions based on rational expressions POSIX, with some extensions or differences.
: specification of class of characters (e.g.: indicates a letter in the interval of has with Z ). .: the preset class of character of the visible or white graphic characters or of control (other than the slew characters of line). *: quantifier for zero, one or more occurrences by what precedes; is equivalent to {0,} . ? : quantifier for with more the one occurrence of what precedes; is equivalent to {0,1} . +: quantifier for one or more occurrences by what precedes; is equivalent to {1,} . |: alternative: either what precedes or what follows. ( ): delimitors of group (with capture). \ other : one of the special characters above definite, but interpreted literally. \t : horizontal tabulation. \n : jump of line. \v : vertical tabulation. \f : page break. \r : carriage return. \ ooo : literal character whose octal code (between 0 and 377, out of 1 to 3 digits) is ooo . \ X NR : literal character whose hexadecimal code is NR (on 2 digits).
\b : back character of return, 0x08 with an ASCII compatible coding (as in POSIX, but only in one class of characters). \b : true condition in extreme cases of a word (except in a class of characters).
\B : true condition except in extreme cases of a word, opposite of \ b (not recognized in a class of characters). \w : a character letter or figure; is equivalent to (is equivalent to the class of POSIX). \W : a character neither letter, nor figure, the complement of \ w (is equivalent to the class of POSIX). \s : a space character; is equivalent to \ T \ N \ R \ F (is equivalent to the class of POSIX). \S : a character not space, the complement of \ s (is equivalent to the class of POSIX). \d : a figure; is equivalent to (is equivalent to the class of POSIX). \D : a non-numerical character, the complement of \ d (is equivalent to the class of POSIX). { m , N } : quantifier limited for at least m and to more N occurrences by what precedes.
Tcl
Tcl integrates the engine of rational expressions developed by Henry Spencer
.
*: 0 or more atom which precedes
Perl
Perl offers a particularly rich whole of extensions. This Computer programming language is a very important success due to the presence of operators of rational expressions included in the language itself. The extensions which he proposes are also available for other programs under the name of lib PCRE ( Perl-Compatible Regular Expressions , literally library of rational expressions compatible with Perl ). This library was written initially for the Serveur of email Exim, but is now taken again by other projects like Python, Apache, Postfix, KDE, Analog, and Ferite.
PHP
PHP supports two forms of notation: syntax POSIX (POSIX 1003.2) and that, much richer and powerful, of the bookstore PCRE (Compatible Perl Regular Expression).
regex by defect of syntax PHP, in all its versions) and rational expressions Perl (supported in the optional module pcre of PHP, present in all the recent versions of PHP4 and superior).
ICU and Java
ICU defines a portable library for the word processing international. This bookstore is developed initially in language C (named version ICU4C) or for the platform Java (named version ICU4J). Bearings (or adaptations) of this bookstore are available also in many other languages, by using the library developed for the language C (or C++).
):
\ U hhhh : corresponds to a character of which the point of code (according to ISO/IEC 10646 and Unicode) to the hexadecimal value hhhh . \ U hhhhhhhh : corresponds to a character of which the point of code (according to ISO/IEC 10646 and Unicode) to the hexadecimal value hhhhhhhh ; exactly eight hexadecimal digits must be provided, even if the point of the largest code accepted is \ U0010ffff. \ NR { NAME OF CHARACTER UNICODE } : corresponds to the character indicated by its standardized name, i.e. such as it is defined in an irrevocable way in standard ISO/IEC 10646 (and included in the Unicode standard). This syntax is a simplified version of following syntax making it possible to indicate other properties on the characters: \ p { code of a property Unicode } : corresponds to a character equipped with the Unicode property specified. \ P { code of a property Unicode } : corresponds to a character not equipped with the Unicode property specified. \s : corresponds to a separating character; a separator east defines as .
). Currently, ICU supports yet only the extents in the binary order of the points of Unicode code, an order which is not adapted at all for the correct treatment of many languages since he contravenes their order of standard collation.
Rational expressions and Unicode
The rational expressions originally were used with the characters ASCII. Many engines of rational expressions can now manage the Unicode. On several points, the character set coded used does not make any difference, but certain problems emerge in the extension of the rational expressions for Unicode.
java.util.regex of Java, the particle shape categories \ p { InX } validate the characters of the block X and \ P { InX } validates the complement. For example, \ p {Arabic} validates any character of the Arab writing (in any of the standardized blocks of Unicode/ISO/IEC 10646 where such characters are present). \ p { X } validates any character having the property of general category of character X and \ P { X } the complement. For example, \ p {Lu} validates any capital letter ( upper-box letter ). \ p { m. = value } and its complement \ P { m. = value } , where m. is the code of a property of characters, and value its value allotted to each character.
Random links: Barry Bostwick | Jimmy Lydon | Anthony Braizat | Lilia Merodio Reza | Circles | Alexandre_Anderson_(mathématicien)