Regex pronunciation: “Redj-ex”
C++11 onwards (<regex>
header), Python, Java, Javascript, PHP, Java, etc…
Regex is written in the form of a string inside double-quotes (""
)
regex foo("Geek[a-zA-Z]+"); //foo is the object of regex
Regex is written within forward slashes (/regex/
)
var str = 'cat';
if (str.match(/a/)) {
console.log("matched 'a'");
}
if (str.match(/x/)) {
console.log("matched 'x'");
}
import re
txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt)
Reserved chars. Use \
to escape. Ex - .
, []
, etc…
Prepend a metacharacter with a backslash \
to use it as a matching char. Ex: ^\^exponent
.
Matches a single character only (except newline char \n
)
aka Character Class. They work character-by-character i.e. [ctpj]ar
is same as [c|t|p|j]ar
.
Specified inside square brackets []
A dot .
inside a Char class means a literal dot .
Any char/charset written within []
and preceded by ^
will be negated.
Symbols *
, +
, and ?
Asterisk *
: No. of occurances of char/charset preceding it must be >=0
Plus +
: No. of occurances of char/charset preceding it must be >=1
Question Mark ?
(strictly zero or one)
aka “Quantifiers”, written only post char/charset.
Can be applied to both Char and Charset
Usage Styles : {n1,n2}
, {n1,}
, {n}
Uses ()
to create a group and capture the match. Ex - (ab)*
Can use altenations inside of it as (c|r|m|e|f)at
: Ex#0
Grouping and applying quantifiers is also possible: Ex#1
Uses ?
followed by a :
within ()
to create a group but not capture the match. Ex - (?:c|r|m|e|f)at
|
Works like logical OR operator.
Question: Isn’t [Tt]he
same as (T|t)he
?
Ans: In the above case it’s the same. But, alternations work at expression level and charset at char level only.
We can alter between expressions (multiple-chars/string) using |
as abhi(shek|manyu)
but not as abhi[shekmanyu]
.
Caret ^
and Dollar $
^
used to specify start position of the input string. It does not matches the character but position at the start of the input. Ex - String input that starts with T is matched by ^T
$
used to specify used to specify start position of the input string. It does not matches the character but position at the end of the input. Ex - String that ends with e is matched by e$
\b
and \B
are non-character consuming anchors too.Shorthand | Description |
---|---|
. | Any character except new line |
\w | Matches alphanumeric characters: [a-zA-Z0-9_] |
\W | Matches non-alphanumeric characters: [^\w] |
\d | Matches digits: [0-9] |
\D | Matches non-digits: [^\d] |
\s | Matches whitespace characters: [\t\n\f\r\p{Z}] |
\S | Matches non-whitespace characters: [^\s] |
Denoted by \b
, in most regex dialects, is a position between \w and \W (non-word char), or at the beginning or end of a string if it begins or ends (respectively) with a word character ([0-9A-Za-z_]
). So, in the string “-12” , it would match before the 1 or after the 2. The dash is not a word character.
Ex#0
Note that ^
and $
are also word boundries as start and end of input respectively.
Also note that \b
and \B
matches without consuming any char.
Non-Word Boundry:: Matches, without consuming any characters, at the position between two characters matched by \w
.
Ex#1
When we make a capture group, regex engine references the result implicitly in a first occurance order and we can use that reference with \ref_num
to use it again.
Ex#0
Note that capturing is neccessary for backreferening to happen and work.
Also note that backreferences match the same text as most recently matched by the capturing group they reference (i.e. its resultant match 33
) and not just the group pattern \d\d
.
Ex#1
Backreferences to failed groups: Concept and important distinction here. First Ex is “b participated but failed, the result was then captured using ()
”, second Ex is “never participated at all” since the entire group was optional, it skipped checking match for (b)
group altogether.
TL;DR - We can make a capture group optional using ? and any backreference to it won’t match if capture group doesn’t match anything.
Branch reset group is supported by Perl, PHP, Delphi and R
Syntax:(?|(regex1)|(regex2)|(regex3)|.....)
Explanation: It allows us to branch/choose from among regex1
, regex2
, regex3
capture groups, any one, capture the result one time and reference it later on. Note that the (?|regex)
itself counts as one occurance of the chosen matched single capture group.
Forward reference is supported by JGsoft, .NET, Java, Perl, PCRE, PHP, Delphi and Ruby
Explanation: It allow you to use a backreference to a group that appears later in the regex. Forward references are obviously only useful if theyโre inside a repeated group. Ex#0 Ex#1
Also called “Assertions”.
Two types: Lookaheads and Lookbehinds.
They do not capture the lookaround symbol that is being checked (character inside ()
).
Symbol | Name | Description |
---|---|---|
(?= ) | Positive Lookahead | Used on RHS of main regex |
(?! ) | Negative Lookahead | " |
(?<= ) | Positive Lookbehind | Used on LHS of main regex |
(?<! ) | Negative Lookbehind | " |
Also known as “Modifiers”. Heavily dependent on regex engine being used.
Flag | Description |
---|---|
i | Case insensitive: Match will be case-insensitive. |
g | Global Search: Match all instances, not just the first. |
m | Multiline: Anchor meta characters work on each line. By default, the $ is at the end of the whole input, but m flag forces it at every line end |
We can use them together: /^regex$/gm
Use ?
with any of the six quantifiers to match in a lazy way.
‘Greedy’ means match longest possible string.
‘Lazy’ means match shortest possible string.l
For example, the greedy h.+l
matches ‘hell’ in ‘hello’ but the lazy h.+?l
matches ‘hel’.
Good explaination taken from here.
Good Example here.