perlre.html 243 KB
Newer Older
1 2 3
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
4 5
<head>
  <title>perlre - perldoc.perl.org</title>
6 7
  <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
  <meta http-equiv="Content-Language" content="en-gb">
8 9 10
  <link rel="search" type="application/opensearchdescription+xml" title="Search perldoc.perl.org" href="/static/search.xml"/>
  <link href="static/css-20100830.css" rel="stylesheet" rev="stylesheet" type="text/css" media="screen">
  <link href="static/exploreperl.css" rel="stylesheet" rev="stylesheet" type="text/css">
11 12
</head>

13 14 15 16 17 18 19
<body onLoad="perldoc.startup();" onPageShow="if (event.persisted) perldoc.startup();">
    <div id="page">
      
      <div id="header">
	<div id="homepage_link">
	  <a href="index.html"></a>
	</div>
20 21 22 23 24 25 26 27 28
	<div id="strapline">
	  Perl Programming Documentation
	</div>
	<div id="download_link" class="download">
	  <a href="http://www.perl.org/get.html">Download Perl</a>
	</div>
	<div id="explore_link" class="download">
	  <a id="explore_anchor" href="#">Explore</a>
	</div>
29
      </div>
30 31 32 33
      
      <div id="body">
        <div id="left_column">
          <div class="side_group">
34 35
            
	    <div class="side_panel doc_panel">
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
              <p>Manual</p>
              <ul>
                <li><a href="index-overview.html">Overview</a>
                <li><a href="index-tutorials.html">Tutorials</a>
                <li><a href="index-faq.html">FAQs</a>
                <li><a href="index-history.html">History / Changes</a>
                <li><a href="index-licence.html">License</a>
              </ul>
            </div>
            <div class="side_panel doc_panel">
              <p>Reference</p>
              <ul>
                <li><a href="index-language.html">Language</a>
                <li><a href="index-functions.html">Functions</a>
                <li><a href="perlop.html">Operators</a>
                <li><a href="perlvar.html">Special Variables</a>
                <li><a href="index-pragmas.html">Pragmas</a>
                <li><a href="index-utilities.html">Utilities</a>
54
                <li><a href="index-internals.html">Internals</a>

                <li><a href="index-platforms.html">Platform Specific</a>
              </ul>
            </div>
            <div class="side_panel doc_panel">
              <p>Modules</p>
              <ul>
		<li>
		
                
                  
		    
		  
		
                  
		    
		  
		
                  
		    
		  
		
                  
		    
		  
		
                  
		    
		  
		
                  
		    
		  
		
                  
		    
		  
		
                  
		    
		  
		
                  
		    
		  
		
                  
		
                  
		
                  
		    
		  
		
                  
		    
		  
		
                  
		    
		  
		
                  
		    
		  
		
                  
		    
		  
		
                  
		
                  
		
                  
		    
		  
		
                  
		    
		  
		
                  
		    
		  
		
                  
		
                  
		
                  
		    
		  
		
                  
		
                  
		
		
                    <a href="index-modules-A.html">A</a>
                    
                      
                        &bull;
                      
                    
                
                    <a href="index-modules-B.html">B</a>
                    
                      
                        &bull;
                      
                    
                
                    <a href="index-modules-C.html">C</a>
                    
                      
                        &bull;
                      
                    
                
                    <a href="index-modules-D.html">D</a>
                    
                      
                        &bull;
                      
                    
                
                    <a href="index-modules-E.html">E</a>
                    
                      
                        <li>
                      
                    
                
                    <a href="index-modules-F.html">F</a>
                    
                      
                        &bull;
                      
                    
                
                    <a href="index-modules-G.html">G</a>
                    
                      
                        &bull;
                      
                    
                
                    <a href="index-modules-H.html">H</a>
                    
                      
                        &bull;
                      
                    
                
                    <a href="index-modules-I.html">I</a>
                    
                      
                        &bull;
                      
                    
                
                    <a href="index-modules-L.html">L</a>
                    
                      
                        <li>
                      
                    
                
                    <a href="index-modules-M.html">M</a>
                    
                      
                        &bull;
                      
                    
                
                    <a href="index-modules-N.html">N</a>
                    
                      
                        &bull;
                      
                    
                
                    <a href="index-modules-O.html">O</a>
                    
                      
                        &bull;
                      
                    
                
                    <a href="index-modules-P.html">P</a>
                    
                      
                        &bull;
                      
                    
                
                    <a href="index-modules-S.html">S</a>
                    
                      
                        <li>
                      
                    
                
                    <a href="index-modules-T.html">T</a>
                    
                      
                        &bull;
                      
                    
                
                    <a href="index-modules-U.html">U</a>
                    
                      
                        &bull;
                      
                    
                
                    <a href="index-modules-X.html">X</a>
                    
                
              </ul>
            </div>
277 278 279 280 281 282 283 284
            
	      <div class="side_panel doc_panel">
		<p>Tools</p>
		<ul>
		  <li><a href="preferences.html">Preferences</a>
		</ul>
	      </div>
            
285 286 287 288 289 290 291 292 293
          </div>
        </div>
        <div id="centre_column">
          <div id="content_header">
            <div id="title_bar">
              <div id="page_name">
                <h1>perlre</h1>
              </div>
              <div id="perl_version">
294
                Perl 5 version 26.0 documentation
295
              </div>
296
              <div class="page_links" id="page_links_top">
297 298
                <a href="#" onClick="toolbar.goToTop();return false;">Go to top</a>
		
299 300
              </div>
	      <div class="page_links" id="page_links_bottom">
301
		
302
                  <a href="#" id="page_index_toggle">Show page index</a> &bull;
303
		
304 305 306 307 308 309 310
                <a href="#" id="recent_pages_toggle">Show recent pages</a>		
	      </div>
	      <div id="search_form">
		<form action="search.html" method="GET" id="search">
		  <input type="text" name="q" id="search_box" alt="Search">
		</form>
	      </div>
311 312 313 314 315 316 317 318 319 320 321
            </div>
            <div id="breadcrumbs">
                
    <a href="index.html">Home</a> &gt;
    
      
        <a href="index-language.html">Language reference</a> &gt;
      
    
    perlre
  
322

323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362
            </div>
          </div>
          <div id="content_body">
	    <!--[if lt IE 7]>
 <div class="noscript">
   <p>
     <strong>It looks like you're using Internet Explorer 6. This is a very old
     browser which does not offer full support for modern websites.</strong>
   </p>
   <p>
     Unfortunately this means that this website will not work on
     your computer.
   </p>
   <p>
     Don't miss out though! To view the site (and get a better experience from
     many other websites), simply upgrade to
     <a href="http://www.microsoft.com/windows/Internet-explorer/default.aspx">Internet
Explorer 8</a>
     or download an alternative browser such as
     <a href="http://www.mozilla.com/en-US/firefox/firefox.html">Firefox</a>,
     <a href="http://www.apple.com/safari/download/">Safari</a>, or
     <a href="http://www.google.co.uk/chrome">Google Chrome</a>.
   </p>
   <p>
     All of these browsers are free. If you're using a PC at work, you may
     need to contact your IT administrator.
   </p>
 </div>
<![endif]-->
	    <noscript>
	      <div class="noscript">
	      <p>
                <strong>Please note: Many features of this site require JavaScript. You appear to have JavaScript disabled,
	        or are running a non-JavaScript capable web browser.</strong>
	      </p>
	      <p>
		To get the best experience, please enable JavaScript or download a modern web browser such as <a href="http://www.microsoft.com/windows/Internet-explorer/default.aspx">Internet Explorer 8</a>, <a href="http://www.mozilla.com/en-US/firefox/firefox.html">Firefox</a>, <a href="http://www.apple.com/safari/download/">Safari</a>, or <a href="http://www.google.co.uk/chrome">Google Chrome</a>.
              </p>
	      </div>
	    </noscript>
363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378

	    <div id="recent_pages" class="hud_container">
	      <div id="recent_pages_header" class="hud_header">
		<div id="recent_pages_close" class="hud_close"><a href="#" onClick="recentPages.hide();return false;"></a></div>
		<div id="recent_pages_title" class="hud_title"><span class="hud_span_top">Recently read</span></div>
		<div id="recent_pages_topright" class="hud_topright"></div>
	      </div>
	      <div id="recent_pages_content" class="hud_content">
	      </div>
	      <div id="recent_pages_footer" class="hud_footer">
		<div id="recent_pages_bottomleft" class="hud_bottomleft"></div>
		<div id="recent_pages_bottom" class="hud_bottom"><span class="hud_span_bottom"></span></div>
		<div id="recent_pages_resize" class="hud_resize"></div>
	      </div>
	    </div>
  
379 380 381 382 383 384
	    <div id="from_search"></div>
            <h1>perlre</h1>


  <!--    -->
<ul><li><a href="#NAME">NAME
385 386 387
  </a><li><a href="#DESCRIPTION">DESCRIPTION</a><ul><li><a href="#The-Basics">The Basics
  </a><li><a href="#Modifiers">Modifiers</a><li><a href="#Regular-Expressions">Regular Expressions</a><li><a href="#Quoting-metacharacters">Quoting metacharacters</a><li><a href="#Extended-Patterns">Extended Patterns</a><li><a href="#Backtracking">Backtracking
 </a><li><a href="#Special-Backtracking-Control-Verbs">Special Backtracking Control Verbs</a><li><a href="#Warning-on-%5c1-Instead-of-%241">Warning on \1 Instead of $1</a><li><a href="#Repeated-Patterns-Matching-a-Zero-length-Substring">Repeated Patterns Matching a Zero-length Substring</a><li><a href="#Combining-RE-Pieces">Combining RE Pieces</a><li><a href="#Creating-Custom-RE-Engines">Creating Custom RE Engines</a><li><a href="#Embedded-Code-Execution-Frequency">Embedded Code Execution Frequency</a><li><a href="#PCRE%2fPython-Support">PCRE/Python Support</a></ul><li><a href="#BUGS">BUGS</a><li><a href="#SEE-ALSO">SEE ALSO</a></ul><a name="NAME"></a><h1>NAME
388 389 390
  </h1>
<p>perlre - Perl regular expressions</p>
<a name="DESCRIPTION"></a><h1>DESCRIPTION</h1>
391
<p>This page describes the syntax of regular expressions in Perl.</p>
392 393 394 395 396 397 398 399
<p>If you haven't used regular expressions before, a tutorial introduction
is available in <a href="perlretut.html">perlretut</a>.  If you know just a little about them,
a quick-start introduction is available in <a href="perlrequick.html">perlrequick</a>.</p>
<p>Except for <a href="#The-Basics">The Basics</a> section, this page assumes you are familiar
with regular expression basics, like what is a "pattern", what does it
look like, and how it is basically used.  For a reference on how they
are used, plus various examples of the same, see discussions of <code class="inline"><a class="l_k" href="functions/m.html">m//</a></code>,
<code class="inline"><a class="l_k" href="functions/s.html">s///</a></code>, <code class="inline"><a class="l_k" href="functions/qr.html">qr//</a></code> and <code class="inline"><span class="q">&quot;??&quot;</span></code>
400
 in <a href="perlop.html#Regexp-Quote-Like-Operators">Regexp Quote-Like Operators in perlop</a>.</p>
401 402 403
<p>New in v5.22, <a href="re.html#'strict'-mode">use re &#39;strict&#39; </a> applies stricter
rules than otherwise when compiling regular expression patterns.  It can
find things that, while legal, may not be what you intended.</p>

<a name="The-Basics"></a><h2>The Basics
  </h2>
<p>Regular expressions are strings with the very particular syntax and
meaning described in this document and auxiliary documents referred to
by this one.  The strings are called "patterns".  Patterns are used to
determine if some other string, called the "target", has (or doesn't
have) the characteristics specified by the pattern.  We call this
"matching" the target string against the pattern.  Usually the match is
done by having the target be the first operand, and the pattern be the
second operand, of one of the two binary operators <code class="inline">=~</code>
 and <code class="inline">!~</code>
,
listed in <a href="perlop.html#Binding-Operators">Binding Operators in perlop</a>; and the pattern will have been
converted from an ordinary string by one of the operators in
<a href="perlop.html#Regexp-Quote-Like-Operators">Regexp Quote-Like Operators in perlop</a>, like so:</p>
<pre class="verbatim"><ol><li> <span class="i">$foo</span> =~ <span class="q">m/abc/</span></li></ol></pre><p>This evaluates to true if and only if the string in the variable <code class="inline"><span class="i">$foo</span></code>

contains somewhere in it, the sequence of characters "a", "b", then "c".
(The <code class="inline">=~ <span class="q">m</span></code>
, or match operator, is described in
<a href="perlop.html#m%2fPATTERN%2fmsixpodualngc">m/PATTERN/msixpodualngc in perlop</a>.)</p>
<p>Patterns that aren't already stored in some variable must be delimitted,
at both ends, by delimitter characters.  These are often, as in the
example above, forward slashes, and the typical way a pattern is written
in documentation is with those slashes.  In most cases, the delimitter
is the same character, fore and aft, but there are a few cases where a
character looks like it has a mirror-image mate, where the opening
version is the beginning delimiter, and the closing one is the ending
delimiter, like</p>
<pre class="verbatim"><ol><li> <span class="i">$foo</span> =~ <span class="q">m&lt;abc&gt;</span></li></ol></pre><p>Most times, the pattern is evaluated in double-quotish context, but it
is possible to choose delimiters to force single-quotish, like</p>
<pre class="verbatim"><ol><li> <span class="i">$foo</span> =~ <span class="q">m&#39;abc&#39;</span></li></ol></pre><p>If the pattern contains its delimiter within it, that delimiter must be
escaped.  Prefixing it with a backslash (<i>e.g.</i>, <code class="inline"><span class="q">&quot;/foo\/bar/&quot;</span></code>
)
serves this purpose.</p>
<p>Any single character in a pattern matches that same character in the
target string, unless the character is a <i>metacharacter</i> with a special
meaning described in this document.  A sequence of non-metacharacters
matches the same sequence in the target string, as we saw above with
<code class="inline"><a class="l_k" href="functions/m.html">m/abc/</a></code>.</p>
<p>Only a few characters (all of them being ASCII punctuation characters)
are metacharacters.  The most commonly used one is a dot <code class="inline"><span class="q">&quot;.&quot;</span></code>
, which
normally matches almost any character (including a dot itself).</p>
<p>You can cause characters that normally function as metacharacters to be
interpreted literally by prefixing them with a <code class="inline"><span class="q">&quot;\&quot;</span></code>
, just like the
pattern's delimiter must be escaped if it also occurs within the
pattern.  Thus, <code class="inline"><span class="q">&quot;\.&quot;</span></code>
 matches just a literal dot, <code class="inline"><span class="q">&quot;.&quot;</span></code>
 instead of
its normal meaning.  This means that the backslash is also a
metacharacter, so <code class="inline"><span class="q">&quot;\\&quot;</span></code>
 matches a single <code class="inline"><span class="q">&quot;\&quot;</span></code>
.  And a sequence that
contains an escaped metacharacter matches the same sequence (but without
the escape) in the target string.  So, the pattern <code class="inline"><span class="q">/blur\\fl/</span></code>
 would
match any target string that contains the sequence <code class="inline"><span class="q">&quot;blur\fl&quot;</span></code>
.</p>
<p>The metacharacter <code class="inline"><span class="q">&quot;|&quot;</span></code>
 is used to match one thing or another.  Thus</p>
<pre class="verbatim"><ol><li> <span class="i">$foo</span> =~ <span class="q">m/this|that/</span></li></ol></pre><p>is TRUE if and only if <code class="inline"><span class="i">$foo</span></code>
 contains either the sequence <code class="inline"><span class="q">&quot;this&quot;</span></code>
 or
the sequence <code class="inline"><span class="q">&quot;that&quot;</span></code>
.  Like all metacharacters, prefixing the <code class="inline"><span class="q">&quot;|&quot;</span></code>

with a backslash makes it match the plain punctuation character; in its
case, the VERTICAL LINE.</p>
<pre class="verbatim"><ol><li> <span class="i">$foo</span> =~ <span class="q">m/this\|that/</span></li></ol></pre><p>is TRUE if and only if <code class="inline"><span class="i">$foo</span></code>
 contains the sequence <code class="inline"><span class="q">&quot;this|that&quot;</span></code>
.</p>
<p>You aren't limited to just a single <code class="inline"><span class="q">&quot;|&quot;</span></code>
.</p>
<pre class="verbatim"><ol><li> <span class="i">$foo</span> =~ <span class="q">m/fee|fie|foe|fum/</span></li></ol></pre><p>is TRUE if and only if <code class="inline"><span class="i">$foo</span></code>
 contains any of those 4 sequences from
the children's story "Jack and the Beanstalk".</p>
<p>As you can see, the <code class="inline"><span class="q">&quot;|&quot;</span></code>
 binds less tightly than a sequence of
ordinary characters.  We can override this by using the grouping
metacharacters, the parentheses <code class="inline"><span class="q">&quot;(&quot;</span></code>
 and <code class="inline"><span class="q">&quot;)&quot;</span></code>
.</p>
<pre class="verbatim"><ol><li> <span class="i">$foo</span> =~ <span class="q">m/th(is|at) thing/</span></li></ol></pre><p>is TRUE if and only if <code class="inline"><span class="i">$foo</span></code>
 contains either the sequence <code class="inline"><span class="q">&quot;this</span>
<span class="q">thing&quot;</span></code>
 or the sequence <code class="inline"><span class="q">&quot;that thing&quot;</span></code>
.  The portions of the string
that match the portions of the pattern enclosed in parentheses are
normally made available separately for use later in the pattern,
substitution, or program.  This is called "capturing", and it can get
complicated.  See <a href="#Capture-groups">Capture groups</a>.</p>
<p>The first alternative includes everything from the last pattern
delimiter (<code class="inline"><span class="q">&quot;(&quot;</span></code>
, <code class="inline"><span class="q">&quot;(?:&quot;</span></code>
 (described later), <i>etc</i>. or the beginning
of the pattern) up to the first <code class="inline"><span class="q">&quot;|&quot;</span></code>
, and the last alternative
contains everything from the last <code class="inline"><span class="q">&quot;|&quot;</span></code>
 to the next closing pattern
delimiter.  That's why it's common practice to include alternatives in
parentheses: to minimize confusion about where they start and end.</p>
<p>Alternatives are tried from left to right, so the first
alternative found for which the entire expression matches, is the one that
is chosen. This means that alternatives are not necessarily greedy. For
example: when matching <code class="inline"><span class="w">foo</span>|<span class="w">foot</span></code>
 against <code class="inline"><span class="q">&quot;barefoot&quot;</span></code>
, only the <code class="inline"><span class="q">&quot;foo&quot;</span></code>

part will match, as that is the first alternative tried, and it successfully
matches the target string. (This might not seem important, but it is
important when you are capturing matched text using parentheses.)</p>
<p>Besides taking away the special meaning of a metacharacter, a prefixed
backslash changes some letter and digit characters away from matching
just themselves to instead have special meaning.  These are called
"escape sequences", and all such are described in <a href="perlrebackslash.html">perlrebackslash</a>.  A
backslash sequence (of a letter or digit) that doesn't currently have
special meaning to Perl will raise a warning if warnings are enabled,
as those are reserved for potential future use.</p>
<p>One such sequence is <code class="inline">\<span class="w">b</span></code>
, which matches a boundary of some sort.
<code class="inline">\<span class="i">b</span><span class="s">{</span><span class="w">wb</span><span class="s">}</span></code>
 and a few others give specialized types of boundaries.
(They are all described in detail starting at
<a href="perlrebackslash.html#%5cb%7b%7d%2c-%5cb%2c-%5cB%7b%7d%2c-%5cB">\b{}, \b, \B{}, \B in perlrebackslash</a>.)  Note that these don't match
characters, but the zero-width spaces between characters.  They are an
example of a <a href="#Assertions">zero-width assertion</a>.  Consider again,</p>
<pre class="verbatim"><ol><li> <span class="i">$foo</span> =~ <span class="q">m/fee|fie|foe|fum/</span></li></ol></pre><p>It evaluates to TRUE if, besides those 4 words, any of the sequences
"feed", "field", "Defoe", "fume", and many others are in <code class="inline"><span class="i">$foo</span></code>
.  By
judicious use of <code class="inline">\<span class="w">b</span></code>
 (or better (because it is designed to handle
natural language) <code class="inline">\<span class="i">b</span><span class="s">{</span><span class="w">wb</span><span class="s">}</span></code>
), we can make sure that only the Giant's
words are matched:</p>
<pre class="verbatim"><ol><li> <span class="i">$foo</span> =~ <span class="q">m/\b(fee|fie|foe|fum)\b/</span></li><li> <span class="i">$foo</span> =~ <span class="q">m/\b{wb}(fee|fie|foe|fum)\b{wb}/</span></li></ol></pre><p>The final example shows that the characters <code class="inline"><span class="q">&quot;{&quot;</span></code>
 and <code class="inline"><span class="q">&quot;}&quot;</span></code>
 are
metacharacters.</p>
<p>Another use for escape sequences is to specify characters that cannot
(or which you prefer not to) be written literally.  These are described
in detail in <a href="perlrebackslash.html#Character-Escapes">Character Escapes in perlrebackslash</a>, but the next three
paragraphs briefly describe some of them.</p>
<p>Various control characters can be written in C language style: <code class="inline"><span class="q">&quot;\n&quot;</span></code>

matches a newline, <code class="inline"><span class="q">&quot;\t&quot;</span></code>
 a tab, <code class="inline"><span class="q">&quot;\r&quot;</span></code>
 a carriage return, <code class="inline"><span class="q">&quot;\f&quot;</span></code>
 a
form feed, <i>etc</i>.</p>
<p>More generally, <code class="inline">\<i>nnn</i></code>, where <i>nnn</i> is a string of three octal
digits, matches the character whose native code point is <i>nnn</i>.  You
can easily run into trouble if you don't have exactly three digits.  So
always use three, or since Perl 5.14, you can use <code class="inline">\<span class="i">o</span><span class="s">{</span>...<span class="s">}</span></code>
 to specify
any number of octal digits.</p>
<p>Similarly, <code class="inline">\x<i>nn</i></code>, where <i>nn</i> are hexadecimal digits, matches the
character whose native ordinal is <i>nn</i>.  Again, not using exactly two
digits is a recipe for disaster, but you can use <code class="inline">\<span class="i">x</span><span class="s">{</span>...<span class="s">}</span></code>
 to specify
any number of hex digits.</p>
<p>Besides being a metacharacter, the <code class="inline"><span class="q">&quot;.&quot;</span></code>
 is an example of a "character
class", something that can match any single character of a given set of
them.  In its case, the set is just about all possible characters.  Perl
predefines several character classes besides the <code class="inline"><span class="q">&quot;.&quot;</span></code>
; there is a
separate reference page about just these, <a href="perlrecharclass.html">perlrecharclass</a>.</p>
<p>You can define your own custom character classes, by putting into your
pattern in the appropriate place(s), a list of all the characters you
want in the set.  You do this by enclosing the list within <code class="inline"><span class="s">[</span><span class="s">]</span></code>
 bracket
characters.  These are called "bracketed character classes" when we are
being precise, but often the word "bracketed" is dropped.  (Dropping it
usually doesn't cause confusion.)  This means that the <code class="inline"><span class="q">&quot;[&quot;</span></code>
 character
is another metacharacter.  It doesn't match anything just by itelf; it
is used only to tell Perl that what follows it is a bracketed character
class.  If you want to match a literal left square bracket, you must
escape it, like <code class="inline"><span class="q">&quot;\[&quot;</span></code>
.  The matching <code class="inline"><span class="q">&quot;]&quot;</span></code>
 is also a metacharacter;
again it doesn't match anything by itself, but just marks the end of
your custom class to Perl.  It is an example of a "sometimes
metacharacter".  It isn't a metacharacter if there is no corresponding
<code class="inline"><span class="q">&quot;[&quot;</span></code>
, and matches its literal self:</p>
<pre class="verbatim"><ol><li> <a class="l_k" href="functions/print.html">print</a> <span class="q">&quot;]&quot;</span> =~ <span class="q">/]/</span><span class="sc">;</span>  <span class="c"># prints 1</span></li></ol></pre><p>The list of characters within the character class gives the set of
characters matched by the class.  <code class="inline"><span class="q">&quot;[abc]&quot;</span></code>
 matches a single "a" or "b"
or "c".  But if the first character after the <code class="inline"><span class="q">&quot;[&quot;</span></code>
 is <code class="inline"><span class="q">&quot;^&quot;</span></code>
, the
class matches any character not in the list.  Within a list, the <code class="inline"><span class="q">&quot;-&quot;</span></code>

character specifies a range of characters, so that <code class="inline"><span class="w">a</span>-z</code>
 represents all
characters between "a" and "z", inclusive.  If you want either <code class="inline"><span class="q">&quot;-&quot;</span></code>
 or
<code class="inline"><span class="q">&quot;]&quot;</span></code>
 itself to be a member of a class, put it at the start of the list
(possibly after a <code class="inline"><span class="q">&quot;^&quot;</span></code>
), or escape it with a backslash.  <code class="inline"><span class="q">&quot;-&quot;</span></code>
 is
also taken literally when it is at the end of the list, just before the
closing <code class="inline"><span class="q">&quot;]&quot;</span></code>
.  (The following all specify the same class of three
characters: <code class="inline"><span class="s">[</span>-<span class="w">az</span><span class="s">]</span></code>
, <code class="inline"><span class="s">[</span><span class="w">az</span>-<span class="s">]</span></code>
, and <code class="inline"><span class="s">[</span><span class="w">a</span>\-z<span class="s">]</span></code>
.  All are different from
<code class="inline"><span class="s">[</span><span class="w">a</span>-z<span class="s">]</span></code>
, which specifies a class containing twenty-six characters, even
on EBCDIC-based character sets.)</p>
<p>There is lots more to bracketed character classes; full details are in
<a href="perlrecharclass.html#Bracketed-Character-Classes">Bracketed Character Classes in perlrecharclass</a>.</p>
<a name="Metacharacters"></a><h3>Metacharacters

        </h3>
<p><a href="#The-Basics">The Basics</a> introduced some of the metacharacters.  This section
gives them all.  Most of them have the same meaning as in the <i>egrep</i>
command.</p>
<p>Only the <code class="inline"><span class="q">&quot;\&quot;</span></code>
 is always a metacharacter.  The others are metacharacters
just sometimes.  The following tables lists all of them, summarizes
their use, and gives the contexts where they are metacharacters.
Outside those contexts or if prefixed by a <code class="inline"><span class="q">&quot;\&quot;</span></code>
, they match their
corresponding punctuation character.  In some cases, their meaning
varies depending on various pattern modifiers that alter the default
behaviors.  See <a href="#Modifiers">Modifiers</a>.</p>
<pre class="verbatim"><ol><li>            <span class="w">PURPOSE</span>                                  <span class="w">WHERE</span></li><li> \   <span class="w">Escape</span> <span class="w">the</span> <a class="l_k" href="functions/next.html">next</a> <span class="j">character</span>                    <span class="w">Always</span><span class="cm">,</span> <span class="w">except</span> <a class="l_k" href="functions/when.html">when</a></li><li>                                                  <span class="w">escaped</span> <span class="w">by</span> <span class="w">another</span> \</li><li> ^   <span class="w">Match</span> <span class="w">the</span> <span class="w">beginning</span> <span class="w">of</span> <span class="w">the</span> <span class="w">string</span>            <span class="w">Not</span> <span class="w">in</span> <span class="s">[</span><span class="s">]</span></li><li>       <span class="s">(</span><a class="l_k" href="functions/or.html">or</a> <span class="w">line</span><span class="cm">,</span> <a class="l_k" href="functions/if.html">if</a> <span class="q">/m is used)</span></li><li> <span class="q"> ^   Complement the [] class                      At the beginning of []</span></li><li> <span class="q"> .   Match any single character except newline    Not in []</span></li><li>       <span class="q">       (under /s</span><span class="cm">,</span> <span class="w">includes</span> <span class="w">newline</span><span class="s">)</span></li><li> <span class="i">$   Match</span> <span class="w">the</span> <span class="w">end</span> <span class="w">of</span> <span class="w">the</span> <span class="w">string</span>                  <span class="w">Not</span> <span class="w">in</span> <span class="s">[</span><span class="s">]</span><span class="cm">,</span> <span class="w">but</span> <span class="i">can</span></li><li>       <span class="s">(</span><a class="l_k" href="functions/or.html">or</a> <span class="w">before</span> <span class="w">newline</span> <span class="w">at</span> <span class="w">the</span> <span class="w">end</span> <span class="w">of</span> <span class="w">the</span>       <span class="w">mean</span> <span class="w">interpolate</span> <span class="w">a</span></li><li>       <span class="w">string</span><span class="sc">;</span> <a class="l_k" href="functions/or.html">or</a> <span class="w">before</span> <span class="w">any</span> <span class="w">newline</span> <a class="l_k" href="functions/if.html">if</a> <span class="q">/m is     scalar</span></li><li>       <span class="q">       used)</span></li><li> <span class="q"> |   Alternation                                  Not in []</span></li><li> <span class="q"> ()  Grouping                                     Not in []</span></li><li> <span class="q"> [   Start Bracketed Character class              Not in []</span></li><li> <span class="q"> ]   End Bracketed Character class                Only in [], and</span></li><li>                                                    <span class="q">                                                    not first</span></li><li> <span class="q"> *   Matches the preceding element 0 or more      Not in []</span></li><li>       <span class="q">       times</span></li><li> <span class="q"> +   Matches the preceding element 1 or more      Not in []</span></li><li>       <span class="q">       times</span></li><li> <span class="q"> ?   Matches the preceding element 0 or 1         Not in []</span></li><li>       <span class="q">       times</span></li><li> <span class="q"> {   Starts a sequence that gives number(s)       Not in []</span></li><li>       <span class="q">       of times the preceding element can be</span></li><li>       <span class="q">       matched</span></li><li> <span class="q"> {   when following certain escape sequences</span></li><li>       <span class="q">       starts a modifier to the meaning of the</span></li><li>       <span class="q">       sequence</span></li><li> <span class="q"> }   End sequence started by {</span></li><li> <span class="q"> -   Indicates a range                            Only in [] interior</span></li></ol></pre><p>Notice that most of the metacharacters lose their special meaning when
they occur in a bracketed character class, except <code class="inline"><span class="q">&quot;^&quot;</span></code>
 has a different
meaning when it is at the beginning of such a class.  And <code class="inline"><span class="q">&quot;-&quot;</span></code>
 and <code class="inline"><span class="q">&quot;]&quot;</span></code>

are metacharacters only at restricted positions within bracketed
character classes; while <code class="inline"><span class="q">&quot;}&quot;</span></code>
 is a metacharacter only when closing a
special construct started by <code class="inline"><span class="q">&quot;{&quot;</span></code>
.</p>
<p>In double-quotish context, as is usually the case,  you need to be
careful about <code class="inline"><span class="q">&quot;$&quot;</span></code>
 and the non-metacharacter <code class="inline"><span class="q">&quot;@&quot;</span></code>
.  Those could
interpolate variables, which may or may not be what you intended.</p>
<p>These rules were designed for compactness of expression, rather than
legibility and maintainability.  The <a href="#%2fx-and-%2fxx">/x and /xx</a> pattern
modifiers allow you to insert white space to improve readability.  And
use of <code class="inline"><a href="re.html#'strict'-mode">re 'strict'</a></code> adds extra checking to
catch some typos that might silently compile into something unintended.</p>
<p>By default, the <code class="inline"><span class="q">&quot;^&quot;</span></code>
 character is guaranteed to match only the
beginning of the string, the <code class="inline"><span class="q">&quot;$&quot;</span></code>
 character only the end (or before the
newline at the end), and Perl does certain optimizations with the
assumption that the string contains only one line.  Embedded newlines
will not be matched by <code class="inline"><span class="q">&quot;^&quot;</span></code>
 or <code class="inline"><span class="q">&quot;$&quot;</span></code>
.  You may, however, wish to treat a
string as a multi-line buffer, such that the <code class="inline"><span class="q">&quot;^&quot;</span></code>
 will match after any
newline within the string (except if the newline is the last character in
the string), and <code class="inline"><span class="q">&quot;$&quot;</span></code>
 will match before any newline.  At the
cost of a little more overhead, you can do this by using the
<a href="#%2fm">/m</a> modifier on the pattern match operator.  (Older programs
did this by setting <code class="inline"><span class="i">$*</span></code>
, but this option was removed in perl 5.10.)
  </p>
<p>To simplify multi-line substitutions, the <code class="inline"><span class="q">&quot;.&quot;</span></code>
 character never matches a
newline unless you use the <a href="#s">/s</a> modifier, which in effect tells
Perl to pretend the string is a single line--even if it isn't.
 </p>
681
<a name="Modifiers"></a><h2>Modifiers</h2>
682 683 684 685 686
<a name="Overview"></a><h3>Overview</h3>
<p>The default behavior for matching can be changed, using various
modifiers.  Modifiers that relate to the interpretation of the pattern
are listed just below.  Modifiers that alter the way a pattern is used
by Perl are detailed in <a href="perlop.html#Regexp-Quote-Like-Operators">Regexp Quote-Like Operators in perlop</a> and
687
<a href="perlop.html#Gory-details-of-parsing-quoted-constructs">Gory details of parsing quoted constructs in perlop</a>.</p>
688
<ul>
689
<li><a name="*m*"></a><b><b><code class="inline"><a class="l_k" href="functions/m.html">m</a></code></b>
690
   </b>
691 692 693
<p>Treat the string being matched against as multiple lines.  That is, change <code class="inline"><span class="q">&quot;^&quot;</span></code>
 and <code class="inline"><span class="q">&quot;$&quot;</span></code>
 from matching
694 695
the start of the string's first line and the end of its last line to
matching the start and end of each line within the string.</p>
696
</li>
697
<li><a name="*s*"></a><b><b><code class="inline"><a class="l_k" href="functions/s.html">s</a></code></b>
698 699
  
</b>
700 701
<p>Treat the string as single line.  That is, change <code class="inline"><span class="q">&quot;.&quot;</span></code>
 to match any character
702
whatsoever, even a newline, which normally it would not match.</p>
703 704 705 706 707
<p>Used together, as <code class="inline">/ms</code>, they let the <code class="inline"><span class="q">&quot;.&quot;</span></code>
 match any character whatsoever,
while still allowing <code class="inline"><span class="q">&quot;^&quot;</span></code>
 and <code class="inline"><span class="q">&quot;$&quot;</span></code>
 to match, respectively, just after
708 709
and just before newlines within the string.</p>
</li>
710 711
<li><a name="*i*"></a><b><b><code class="inline"><span class="w">i</span></code>
</b>
712 713
  
</b>
714 715
<p>Do case-insensitive pattern matching.  For example, "A" will match "a"
under <code class="inline">/i</code>.</p>
716 717 718 719
<p>If locale matching rules are in effect, the case map is taken from the
current
locale for code points less than 255, and from Unicode rules for larger
code points.  However, matches that would cross the Unicode
720 721 722 723 724 725
rules/non-Unicode rules boundary (ords 255/256) will not succeed, unless
the locale is a UTF-8 one.  See <a href="perllocale.html">perllocale</a>.</p>
<p>There are a number of Unicode characters that match a sequence of
multiple characters under <code class="inline">/i</code>.  For example,
<code class="inline"><span class="w">LATIN</span> <span class="w">SMALL</span> <span class="w">LIGATURE</span> <span class="w">FI</span></code>
 should match the sequence <code class="inline"><span class="w">fi</span></code>
726 727 728
.  Perl is not
currently able to do this when the multiple characters are in the pattern and
are split between groupings, or when one or more are quantified.  Thus</p>
729 730 731 732 733
<pre class="verbatim"><ol><li> <span class="q">&quot;\N{LATIN SMALL LIGATURE FI}&quot;</span> =~ <span class="q">/fi/i</span><span class="sc">;</span>          <span class="c"># Matches</span></li><li> <span class="q">&quot;\N{LATIN SMALL LIGATURE FI}&quot;</span> =~ <span class="q">/[fi][fi]/i</span><span class="sc">;</span>    <span class="c"># Doesn&#39;t match!</span></li><li> <span class="q">&quot;\N{LATIN SMALL LIGATURE FI}&quot;</span> =~ <span class="q">/fi*/i</span><span class="sc">;</span>         <span class="c"># Doesn&#39;t match!</span></li><li></li><li> <span class="c"># The below doesn&#39;t match, and it isn&#39;t clear what $1 and $2 would</span></li><li> <span class="c"># be even if it did!!</span></li><li> <span class="q">&quot;\N{LATIN SMALL LIGATURE FI}&quot;</span> =~ <span class="q">/(f)(i)/i</span><span class="sc">;</span>      <span class="c"># Doesn&#39;t match!</span></li></ol></pre><p>Perl doesn't match multiple characters in a bracketed
character class unless the character that maps to them is explicitly
mentioned, and it doesn't match them at all if the character class is
inverted, which otherwise could be highly confusing.  See
<a href="perlrecharclass.html#Bracketed-Character-Classes">Bracketed Character Classes in perlrecharclass</a>, and
734
<a href="perlrecharclass.html#Negation">Negation in perlrecharclass</a>.</p>
735
</li>
736 737 738
<li><a name="*x*-and-*xx*"></a><b><b><code class="inline"><span class="w">x</span></code>
</b> and <b><code class="inline"><span class="w">xx</span></code>
</b>
739
</b>
740
<p>Extend your pattern's legibility by permitting whitespace and comments.
741
Details in <a href="#%2fx-and-%2fxx">/x and /xx</a></p>
742
</li>
743 744
<li><a name="*p*"></a><b><b><code class="inline"><span class="w">p</span></code>
</b>
745
  </b>
746 747 748 749 750
<p>Preserve the string matched such that <code class="inline"><span class="i">$</span>{<span class="w">^PREMATCH</span>}</code>
, <code class="inline"><span class="i">$</span>{<span class="w">^MATCH</span>}</code>
, and
<code class="inline"><span class="i">$</span>{<span class="w">^POSTMATCH</span>}</code>
 are available for use after matching.</p>
751
<p>In Perl 5.20 and higher this is ignored. Due to a new copy-on-write
752 753 754 755
mechanism, <code class="inline"><span class="i">$</span>{<span class="w">^PREMATCH</span>}</code>
, <code class="inline"><span class="i">$</span>{<span class="w">^MATCH</span>}</code>
, and <code class="inline"><span class="i">$</span>{<span class="w">^POSTMATCH</span>}</code>
 will be available
756
after the match regardless of the modifier.</p>
757
</li>
758 759 760 761 762
<li><a name="*a*%2c-*d*%2c-*l*%2c-and-*u*"></a><b><b><code class="inline"><span class="w">a</span></code>
</b>, <b><code class="inline"><span class="w">d</span></code>
</b>, <b><code class="inline"><span class="w">l</span></code>
</b>, and <b><code class="inline"><span class="w">u</span></code>
</b>
763
   </b>
764
<p>These modifiers, all new in 5.14, affect which character-set rules
765
(Unicode, <i>etc</i>.) are used, as described below in
766 767
<a href="#Character-set-modifiers">Character set modifiers</a>.</p>
</li>
768 769
<li><a name="*n*"></a><b><b><code class="inline"><span class="w">n</span></code>
</b>
770 771 772 773 774 775
  
</b>
<p>Prevent the grouping metacharacters <code class="inline"><span class="s">(</span><span class="s">)</span></code>
 from capturing. This modifier,
new in 5.22, will stop <code class="inline"><span class="i">$1</span></code>
, <code class="inline"><span class="i">$2</span></code>
776
, <i>etc</i>... from being filled in.</p>
777 778 779 780 781
<pre class="verbatim"><ol><li>  <span class="q">&quot;hello&quot;</span> =~ <span class="q">/(hi|hello)/</span><span class="sc">;</span>   <span class="c"># $1 is &quot;hello&quot;</span></li><li>  <span class="q">&quot;hello&quot;</span> =~ <span class="q">/(hi|hello)/</span><span class="w">n</span><span class="sc">;</span>  <span class="c"># $1 is undef</span></li></ol></pre><p>This is equivalent to putting <code class="inline">?:</code> at the beginning of every capturing group:</p>
<pre class="verbatim"><ol><li>  <span class="q">&quot;hello&quot;</span> =~ <span class="q">/(?:hi|hello)/</span><span class="sc">;</span> <span class="c"># $1 is undef</span></li></ol></pre><p><code class="inline"><span class="q">/n</span></code>
 can be negated on a per-group basis. Alternatively, named captures
may still be used.</p>
<pre class="verbatim"><ol><li>  <span class="q">&quot;hello&quot;</span> =~ <span class="q">/(?-n:(hi|hello))/</span><span class="w">n</span><span class="sc">;</span>   <span class="c"># $1 is &quot;hello&quot;</span></li><li>  <span class="q">&quot;hello&quot;</span> =~ <span class="q">/(?&lt;greet&gt;hi|hello)/</span><span class="w">n</span><span class="sc">;</span> <span class="c"># $1 is &quot;hello&quot;, $+{greet} is</span></li><li>                                    <span class="c"># &quot;hello&quot;</span></li></ol></pre></li>
782 783 784 785 786 787 788
<li><a name="Other-Modifiers"></a><b>Other Modifiers</b>
<p>There are a number of flags that can be found at the end of regular
expression constructs that are <i>not</i> generic regular expression flags, but
apply to the operation being performed, like matching or substitution (<code class="inline"><a class="l_k" href="functions/m.html">m//</a></code>
or <code class="inline"><a class="l_k" href="functions/s.html">s///</a></code> respectively).</p>
<p>Flags described further in
<a href="perlretut.html#Using-regular-expressions-in-Perl">Using regular expressions in Perl in perlretut</a> are:</p>
789 790
<pre class="verbatim"><ol><li>  <span class="w">c</span>  - <span class="w">keep</span> <span class="w">the</span> <span class="w">current</span> <span class="w">position</span> <span class="w">during</span> <span class="w">repeated</span> <span class="w">matching</span></li><li>  <span class="w">g</span>  - <span class="w">globally</span> <span class="w">match</span> <span class="w">the</span> <span class="w">pattern</span> <span class="w">repeatedly</span> <span class="w">in</span> <span class="w">the</span> <span class="w">string</span></li></ol></pre><p>Substitution-specific modifiers described in
<a href="perlop.html#s%2fPATTERN%2fREPLACEMENT%2fmsixpodualngcer">s/PATTERN/REPLACEMENT/msixpodualngcer in perlop</a> are:</p>
791
<pre class="verbatim"><ol><li>  <span class="w">e</span>  - <span class="w">evaluate</span> <span class="w">the</span> <span class="w">right</span>-<span class="w">hand</span> <span class="w">side</span> <span class="w">as</span> <span class="w">an</span> <span class="w">expression</span></li><li>  <span class="w">ee</span> - <span class="w">evaluate</span> <span class="w">the</span> <span class="w">right</span> <span class="w">side</span> <span class="w">as</span> <span class="w">a</span> <span class="w">string</span> <span class="w">then</span> <a class="l_k" href="functions/eval.html">eval</a> <span class="w">the</span> <span class="w">result</span></li><li>  <span class="w">o</span>  - <span class="w">pretend</span> <span class="w">to</span> <span class="w">optimize</span> <span class="w">your</span> <span class="w">code</span><span class="cm">,</span> <span class="w">but</span> <span class="w">actually</span> <span class="w">introduce</span> <span class="w">bugs</span></li><li>  <span class="w">r</span>  - <span class="w">perform</span> <span class="w">non</span>-<span class="w">destructive</span> <span class="w">substitution</span> <a class="l_k" href="functions/and.html">and</a> <a class="l_k" href="functions/return.html">return</a> <span class="w">the</span> <span class="w">new</span> <span class="w">value</span></li></ol></pre></li>
792
</ul>
793
<p>Regular expression modifiers are usually written in documentation
794 795
as <i>e.g.</i>, "the <code class="inline">/x</code> modifier", even though the delimiter
in question might not really be a slash.  The modifiers <code class="inline"><span class="q">/imnsxadlup</span></code>
796 797 798

may also be embedded within the regular expression itself using
the <code class="inline">(?...)</code> construct, see <a href="#Extended-Patterns">Extended Patterns</a> below.</p>
799 800 801 802 803 804
<a name="Details-on-some-modifiers"></a><h3>Details on some modifiers</h3>
<p>Some of the modifiers require more explanation than given in the
<a href="#Overview">Overview</a> above.</p>
<h4><code class="inline">/x</code> and  <code class="inline"><span class="q">/xx</span></code>
</h4>
<p>A single <code class="inline">/x</code> tells
805
the regular expression parser to ignore most whitespace that is neither
806
backslashed nor within a bracketed character class.  You can use this to
807 808
break up your regular expression into more readable parts.
Also, the <code class="inline"><span class="q">&quot;#&quot;</span></code>
809 810 811 812 813 814 815
 character is treated as a metacharacter introducing a
comment that runs up to the pattern's closing delimiter, or to the end
of the current line if the pattern extends onto the next line.  Hence,
this is very much like an ordinary Perl code comment.  (You can include
the closing delimiter within the comment only if you precede it with a
backslash, so be careful!)</p>
<p>Use of <code class="inline">/x</code> means that if you want real
816
whitespace or <code class="inline"><span class="q">&quot;#&quot;</span></code>
817 818
 characters in the pattern (outside a bracketed character
class, which is unaffected by <code class="inline">/x</code>), then you'll either have to
819
escape them (using backslashes or <code class="inline">\<span class="w">Q</span>...\<span class="w">E</span></code>
820 821
) or encode them using octal,
hex, or <code class="inline">\<span class="w">N</span><span class="s">{</span><span class="s">}</span></code>
822 823 824 825 826 827 828 829 830
 escapes.
It is ineffective to try to continue a comment onto the next line by
escaping the <code class="inline">\<span class="w">n</span></code>
 with a backslash or <code class="inline">\<span class="w">Q</span></code>
.</p>
<p>You can use <a href="#(%3f%23text)">(?#text)</a> to create a comment that ends earlier than the
end of the current line, but <code class="inline"><span class="w">text</span></code>
 also can't contain the closing
delimiter unless escaped with a backslash.</p>
831 832 833 834 835 836 837 838 839 840 841 842
<p>A common pitfall is to forget that <code class="inline"><span class="q">&quot;#&quot;</span></code>
 characters begin a comment under
<code class="inline">/x</code> and are not matched literally.  Just keep that in mind when trying
to puzzle out why a particular <code class="inline">/x</code> pattern isn't working as expected.</p>
<p>Starting in Perl v5.26, if the modifier has a second <code class="inline"><span class="q">&quot;x&quot;</span></code>
 within it,
it does everything that a single <code class="inline">/x</code> does, but additionally
non-backslashed SPACE and TAB characters within bracketed character
classes are also generally ignored, and hence can be added to make the
classes more readable.</p>
<pre class="verbatim"><ol><li>    <span class="q">/ [d-e g-i 3-7]/xx</span></li><li>    /<span class="s">[</span> ! <span class="i">@ &quot;</span> <span class="c"># $ % ^ &amp; * () = ? &lt;&gt; &#39; ]/xx</span></li></ol></pre><p>may be easier to grasp than the squashed equivalents</p>
<pre class="verbatim"><ol><li>    <span class="q">/[d-eg-i3-7]/</span></li><li>    /<span class="s">[</span>!<span class="i">@&quot;</span><span class="c">#$%^&amp;*()=?&lt;&gt;&#39;]/</span></li></ol></pre><p>Taken together, these features go a long way towards
843 844
making Perl's regular expressions more readable.  Here's an example:</p>
<pre class="verbatim"><ol><li>    <span class="c"># Delete (most) C comments.</span></li><li>    <span class="i">$program</span> =~ <span class="q">s {</span></li><li>	<span class="q">	/\*	# Match the opening delimiter.</span></li><li>	<span class="q">	.*?	# Match a minimal number of characters.</span></li><li>	<span class="q">	\*/	# Match the closing delimiter.</span></li><li>    <span class="q">    } []gsx</span><span class="sc">;</span></li></ol></pre><p>Note that anything inside
845
a <code class="inline">\<span class="w">Q</span>...\<span class="w">E</span></code>
846 847 848 849 850 851 852
 stays unaffected by <code class="inline">/x</code>.  And note that <code class="inline">/x</code> doesn't affect
space interpretation within a single multi-character construct.  For
example in <code class="inline">\<span class="i">x</span><span class="s">{</span>...<span class="s">}</span></code>
, regardless of the <code class="inline">/x</code> modifier, there can be no
spaces.  Same for a <a href="#Quantifiers">quantifier</a> such as <code class="inline"><span class="s">{</span><span class="n">3</span><span class="s">}</span></code>
 or
<code class="inline"><span class="s">{</span><span class="n">5</span><span class="cm">,</span><span class="s">}</span></code>
853 854 855 856
.  Similarly, <code class="inline">(?:...)</code> can't have a space between the <code class="inline"><span class="q">&quot;{&quot;</span></code>
,
<code class="inline"><span class="q">&quot;?&quot;</span></code>
, and <code class="inline"><span class="q">&quot;:&quot;</span></code>
857
.  Within any delimiters for such a
858 859 860 861 862 863 864 865
construct, allowed spaces are not affected by <code class="inline">/x</code>, and depend on the
construct.  For example, <code class="inline">\<span class="i">x</span><span class="s">{</span>...<span class="s">}</span></code>
 can't have spaces because hexadecimal
numbers don't have spaces in them.  But, Unicode properties can have spaces, so
in <code class="inline">\<span class="i">p</span><span class="s">{</span>...<span class="s">}</span></code>
 there can be spaces that follow the Unicode rules, for which see
<a href="perluniprops.html#Properties-accessible-through-%5cp%7b%7d-and-%5cP%7b%7d">Properties accessible through \p{} and \P{} in perluniprops</a>.
</p>
866 867
<p>The set of characters that are deemed whitespace are those that Unicode
calls "Pattern White Space", namely:</p>
868
<pre class="verbatim"><ol><li> <span class="w">U</span>+<span class="n">0009</span> <span class="w">CHARACTER</span> <span class="w">TABULATION</span></li><li> <span class="w">U</span>+<span class="n">000</span><span class="w">A</span> <span class="w">LINE</span> <span class="w">FEED</span></li><li> <span class="w">U</span>+<span class="n">000</span><span class="w">B</span> <span class="w">LINE</span> <span class="w">TABULATION</span></li><li> <span class="w">U</span>+<span class="n">000</span><span class="w">C</span> <span class="w">FORM</span> <span class="w">FEED</span></li><li> <span class="w">U</span>+<span class="n">000</span><span class="w">D</span> <span class="w">CARRIAGE</span> <span class="w">RETURN</span></li><li> <span class="w">U</span>+<span class="n">0020</span> <span class="w">SPACE</span></li><li> <span class="w">U</span>+<span class="n">0085</span> <span class="w">NEXT</span> <span class="w">LINE</span></li><li> <span class="w">U</span>+<span class="n">200</span><span class="w">E</span> <span class="w">LEFT</span>-<span class="w">TO</span>-<span class="w">RIGHT</span> <span class="w">MARK</span></li><li> <span class="w">U</span>+<span class="n">200</span><span class="w">F</span> <span class="w">RIGHT</span>-<span class="w">TO</span>-<span class="w">LEFT</span> <span class="w">MARK</span></li><li> <span class="w">U</span>+<span class="n">2028</span> <span class="w">LINE</span> <span class="w">SEPARATOR</span></li><li> <span class="w">U</span>+<span class="n">2029</span> <span class="w">PARAGRAPH</span> <span class="w">SEPARATOR</span></li></ol></pre><h4>Character set modifiers</h4>
869 870 871 872
<p><code class="inline">/d</code>, <code class="inline"><span class="q">/u</span></code>
, <code class="inline"><span class="q">/a</span></code>
, and <code class="inline"><span class="q">/l</span></code>
, available starting in 5.14, are called
873
the character set modifiers; they affect the character set rules
874
used for the regular expression.</p>
875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909
<p>The <code class="inline">/d</code>, <code class="inline"><span class="q">/u</span></code>
, and <code class="inline"><span class="q">/l</span></code>
 modifiers are not likely to be of much use
to you, and so you need not worry about them very much.  They exist for
Perl's internal use, so that complex regular expression data structures
can be automatically serialized and later exactly reconstituted,
including all their nuances.  But, since Perl can't keep a secret, and
there may be rare instances where they are useful, they are documented
here.</p>
<p>The <code class="inline"><span class="q">/a</span></code>
 modifier, on the other hand, may be useful.  Its purpose is to
allow code that is to work mostly on ASCII data to not have to concern
itself with Unicode.</p>
<p>Briefly, <code class="inline"><span class="q">/l</span></code>
 sets the character set to that of whatever <b>L</b>ocale is in
effect at the time of the execution of the pattern match.</p>
<p><code class="inline"><span class="q">/u</span></code>
 sets the character set to <b>U</b>nicode.</p>
<p><code class="inline"><span class="q">/a</span></code>
 also sets the character set to Unicode, BUT adds several
restrictions for <b>A</b>SCII-safe matching.</p>
<p><code class="inline">/d</code> is the old, problematic, pre-5.14 <b>D</b>efault character set
behavior.  Its only use is to force that old behavior.</p>
<p>At any given time, exactly one of these modifiers is in effect.  Their
existence allows Perl to keep the originally compiled behavior of a
regular expression, regardless of what rules are in effect when it is
actually executed.  And if it is interpolated into a larger regex, the
original's rules continue to apply to it, and only it.</p>
<p>The <code class="inline"><span class="q">/l</span></code>
 and <code class="inline"><span class="q">/u</span></code>
 modifiers are automatically selected for
regular expressions compiled within the scope of various pragmas,
and we recommend that in general, you use those pragmas instead of
specifying these modifiers explicitly.  For one thing, the modifiers
affect only pattern matching, and do not extend to even any replacement
910
done, whereas using the pragmas gives consistent results for all
911 912 913
appropriate operations within their scopes.  For example,</p>
<pre class="verbatim"><ol><li> <span class="q">s/foo/\Ubar/il</span></li></ol></pre><p>will match "foo" using the locale's rules for case-insensitive matching,
but the <code class="inline"><span class="q">/l</span></code>
914
 does not affect how the <code class="inline">\<span class="w">U</span></code>
915 916 917 918 919
 operates.  Most likely you
want both of them to use locale rules.  To do this, instead compile the
regular expression within the scope of <code class="inline"><a class="l_k" href="functions/use.html">use</a> <span class="w">locale</span></code>
.  This both
implicitly adds the <code class="inline"><span class="q">/l</span></code>
920
, and applies locale rules to the <code class="inline">\<span class="w">U</span></code>
921 922
.   The
lesson is to <code class="inline"><a class="l_k" href="functions/use.html">use</a> <span class="w">locale</span></code>
923
, and not <code class="inline"><span class="q">/l</span></code>
924 925
 explicitly.</p>
<p>Similarly, it would be better to use <code class="inline"><a class="l_k" href="functions/use.html">use</a> <span class="w">feature</span> <span class="q">&#39;unicode_strings&#39;</span></code>
926

927 928 929 930 931 932 933 934 935
instead of,</p>
<pre class="verbatim"><ol><li> <span class="q">s/foo/\Lbar/iu</span></li></ol></pre><p>to get Unicode rules, as the <code class="inline">\<span class="w">L</span></code>
 in the former (but not necessarily
the latter) would also use Unicode rules.</p>
<p>More detail on each of the modifiers follows.  Most likely you don't
need to know this detail for <code class="inline"><span class="q">/l</span></code>
, <code class="inline"><span class="q">/u</span></code>
, and <code class="inline">/d</code>, and can skip ahead
to <a href="#%2fa-(and-%2faa)">/a</a>.</p>
936 937 938 939 940 941 942 943 944 945 946
<h4>/l</h4>
<p>means to use the current locale's rules (see <a href="perllocale.html">perllocale</a>) when pattern
matching.  For example, <code class="inline">\<span class="w">w</span></code>
 will match the "word" characters of that
locale, and <code class="inline"><span class="q">&quot;/i&quot;</span></code>
 case-insensitive matching will match according to
the locale's case folding rules.  The locale used will be the one in
effect at the time of execution of the pattern match.  This may not be
the same as the compilation-time locale, and can differ from one match
to another if there is an intervening call of the
<a href="perllocale.html#The-setlocale-function">setlocale() function</a>.</p>
947 948 949 950 951
<p>Prior to v5.20, Perl did not support multi-byte locales.  Starting then,
UTF-8 locales are supported.  No other multi byte locales are ever
likely to be supported.  However, in all locales, one can have code
points above 255 and these will always be treated as Unicode no matter
what locale is in effect.</p>
952 953 954 955 956 957 958 959 960 961 962 963 964 965
<p>Under Unicode rules, there are a few case-insensitive matches that cross
the 255/256 boundary.  Except for UTF-8 locales in Perls v5.20 and
later, these are disallowed under <code class="inline"><span class="q">/l</span></code>
.  For example, 0xFF (on ASCII
platforms) does not caselessly match the character at 0x178, <code class="inline"><span class="w">LATIN</span>
<span class="w">CAPITAL</span> <span class="w">LETTER</span> <span class="w">Y</span> <span class="w">WITH</span> <span class="w">DIAERESIS</span></code>
, because 0xFF may not be <code class="inline"><span class="w">LATIN</span> <span class="w">SMALL</span>
<span class="w">LETTER</span> <span class="w">Y</span> <span class="w">WITH</span> <span class="w">DIAERESIS</span></code>
 in the current locale, and Perl has no way of
knowing if that character even exists in the locale, much less what code
point it is.</p>
<p>In a UTF-8 locale in v5.20 and later, the only visible difference
between locale and non-locale in regular expressions should be tainting
(see <a href="perlsec.html">perlsec</a>).</p>
966 967 968 969 970 971 972
<p>This modifier may be specified to be the default by <code class="inline"><a class="l_k" href="functions/use.html">use</a> <span class="w">locale</span></code>
, but
see <a href="#Which-character-set-modifier-is-in-effect%3f">Which character set modifier is in effect?</a>.
</p>
<h4>/u</h4>
<p>means to use Unicode rules when pattern matching.  On ASCII platforms,
this means that the code points between 128 and 255 take on their
973 974 975 976 977 978
Latin-1 (ISO-8859-1) meanings (which are the same as Unicode's).
(Otherwise Perl considers their meanings to be undefined.)  Thus,
under this modifier, the ASCII platform effectively becomes a Unicode
platform; and hence, for example, <code class="inline">\<span class="w">w</span></code>
 will match any of the more than
100_000 word characters in Unicode.</p>
979
<p>Unlike most locales, which are specific to a language and country pair,
980 981
Unicode classifies all the characters that are letters <i>somewhere</i> in
the world as
982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997
<code class="inline">\<span class="w">w</span></code>
.  For example, your locale might not think that <code class="inline"><span class="w">LATIN</span> <span class="w">SMALL</span>
<span class="w">LETTER</span> <span class="w">ETH</span></code>
 is a letter (unless you happen to speak Icelandic), but
Unicode does.  Similarly, all the characters that are decimal digits
somewhere in the world will match <code class="inline">\<span class="w">d</span></code>
; this is hundreds, not 10,
possible matches.  And some of those digits look like some of the 10
ASCII digits, but mean a different number, so a human could easily think
a number is a different quantity than it really is.  For example,
<code class="inline"><span class="w">BENGALI</span> <span class="w">DIGIT</span> <span class="w">FOUR</span></code>
 (U+09EA) looks very much like an
<code class="inline"><span class="w">ASCII</span> <span class="w">DIGIT</span> <span class="w">EIGHT</span></code>
 (U+0038).  And, <code class="inline">\<span class="w">d</span>+</code>
, may match strings of digits
that are a mixture from different writing systems, creating a security
998 999 1000 1001 1002 1003 1004
issue.  <a href="Unicode/UCD.html#num()">num() in Unicode::UCD</a> can be used to sort
this out.  Or the <code class="inline"><span class="q">/a</span></code>
 modifier can be used to force <code class="inline">\<span class="w">d</span></code>
 to match
just the ASCII 0 through 9.</p>
<p>Also, under this modifier, case-insensitive matching works on the full
set of Unicode
1005 1006 1007 1008 1009 1010 1011 1012 1013 1014
characters.  The <code class="inline"><span class="w">KELVIN</span> <span class="w">SIGN</span></code>
, for example matches the letters "k" and
"K"; and <code class="inline"><span class="w">LATIN</span> <span class="w">SMALL</span> <span class="w">LIGATURE</span> <span class="w">FF</span></code>
 matches the sequence "ff", which,
if you're not prepared, might make it look like a hexadecimal constant,
presenting another potential security issue.  See
<a href="http://unicode.org/reports/tr36">http://unicode.org/reports/tr36</a> for a detailed discussion of Unicode
security issues.</p>
<p>This modifier may be specified to be the default by <code class="inline"><a class="l_k" href="functions/use.html">use</a> <span class="w">feature</span>
<span class="q">&#39;unicode_strings</span></code>
1015 1016 1017 1018
, <code class="inline"><a class="l_k" href="functions/use.html">use</a> <span class="w">locale</span> <span class="q">&#39;:not_characters&#39;</span></code>
, or
<code class="inline"><a href="functions/use.html">use VERSION</a></code> (or higher),
but see <a href="#Which-character-set-modifier-is-in-effect%3f">Which character set modifier is in effect?</a>.
1019
</p>
1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040
<h4>/d</h4>
<p>This modifier means to use the "Default" native rules of the platform
except when there is cause to use Unicode rules instead, as follows:</p>
<dl>
<dt>1</dt><dd>
<p>the target string is encoded in UTF-8; or</p>
</dd>
<dt>2</dt><dd>
<p>the pattern is encoded in UTF-8; or</p>
</dd>
<dt>3</dt><dd>
<p>the pattern explicitly mentions a code point that is above 255 (say by
<code class="inline">\<span class="i">x</span><span class="s">{</span><span class="n">100</span><span class="s">}</span></code>
); or</p>
</dd>
<dt>4</dt><dd>
<p>the pattern uses a Unicode name (<code class="inline">\<span class="i">N</span><span class="s">{</span>...<span class="s">}</span></code>
);  or</p>
</dd>
<dt>5</dt><dd>
<p>the pattern uses a Unicode property (<code class="inline">\<span class="i">p</span><span class="s">{</span>...<span class="s">}</span></code>
1041
 or <code class="inline">\<span class="i">P</span><span class="s">{</span>...<span class="s">}</span></code>
1042 1043 1044
); or</p>
</dd>
<dt>6</dt><dd>
1045 1046 1047 1048 1049
<p>the pattern uses a Unicode break (<code class="inline">\<span class="i">b</span><span class="s">{</span>...<span class="s">}</span></code>
 or <code class="inline">\<span class="i">B</span><span class="s">{</span>...<span class="s">}</span></code>
); or</p>
</dd>
<dt>7</dt><dd>
1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063
<p>the pattern uses <a href="#(%3f%5b-%5d)">(?[ ])</a></p>
</dd>
</dl>
<p>Another mnemonic for this modifier is "Depends", as the rules actually
used depend on various things, and as a result you can get unexpected
results.  See <a href="perlunicode.html#The-%22Unicode-Bug%22">The Unicode Bug in perlunicode</a>.  The Unicode Bug has
become rather infamous, leading to yet another (printable) name for this
modifier, "Dodgy".</p>
<p>Unless the pattern or string are encoded in UTF-8, only ASCII characters
can match positively.</p>
<p>Here are some examples of how that works on an ASCII platform:</p>
<pre class="verbatim"><ol><li> <span class="i">$str</span> =  <span class="q">&quot;\xDF&quot;</span><span class="sc">;</span>      <span class="c"># $str is not in UTF-8 format.</span></li><li> <span class="i">$str</span> =~ <span class="q">/^\w/</span><span class="sc">;</span>       <span class="c"># No match, as $str isn&#39;t in UTF-8 format.</span></li><li> <span class="i">$str</span> .= <span class="q">&quot;\x{0e0b}&quot;</span><span class="sc">;</span>  <span class="c"># Now $str is in UTF-8 format.</span></li><li> <span class="i">$str</span> =~ <span class="q">/^\w/</span><span class="sc">;</span>       <span class="c"># Match! $str is now in UTF-8 format.</span></li><li> <a class="l_k" href="functions/chop.html">chop</a> <span class="i">$str</span><span class="sc">;</span></li><li> <span class="i">$str</span> =~ <span class="q">/^\w/</span><span class="sc">;</span>       <span class="c"># Still a match! $str remains in UTF-8 format.</span></li></ol></pre><p>This modifier is automatically selected by default when none of the
others are, so yet another name for it is "Default".</p>
<p>Because of the unexpected behaviors associated with this modifier, you
1064 1065
probably should only explicitly use it to maintain weird backward
compatibilities.</p>
1066
<h4>/a (and /aa)</h4>
1067 1068
<p>This modifier stands for ASCII-restrict (or ASCII-safe).  This modifier
may be doubled-up to increase its effect.</p>
1069
<p>When it appears singly, it causes the sequences <code class="inline">\<span class="w">d</span></code>
1070
, <code class="inline">\s</code>, <code class="inline">\<span class="w">w</span></code>
1071 1072 1073 1074 1075 1076
, and
the Posix character classes to match only in the ASCII range.  They thus
revert to their pre-5.6, pre-Unicode meanings.  Under <code class="inline"><span class="q">/a</span></code>
,  <code class="inline">\<span class="w">d</span></code>

always means precisely the digits <code class="inline"><span class="q">&quot;0&quot;</span></code>
1077
 to <code class="inline"><span class="q">&quot;9&quot;</span></code>
1078 1079
; <code class="inline">\s</code> means the five
characters <code class="inline"><span class="s">[</span> \<span class="w">f</span>\<span class="w">n</span>\<span class="w">r</span>\<span class="w">t</span><span class="s">]</span></code>
1080 1081
, and starting in Perl v5.18, the vertical tab;
<code class="inline">\<span class="w">w</span></code>
1082 1083 1084 1085 1086 1087 1088 1089 1090
 means the 63 characters
<code class="inline"><span class="s">[</span><span class="w">A</span>-<span class="w">Za</span>-<span class="w">z0</span>-<span class="n">9_</span><span class="s">]</span></code>
; and likewise, all the Posix classes such as
<code class="inline">[[:print:]]</code> match only the appropriate ASCII-range characters.</p>
<p>This modifier is useful for people who only incidentally use Unicode,
and who do not wish to be burdened with its complexities and security
concerns.</p>
<p>With <code class="inline"><span class="q">/a</span></code>
, one can write <code class="inline">\<span class="w">d</span></code>
1091 1092
 with confidence that it will only match
ASCII characters, and should the need arise to match beyond ASCII, you
1093 1094
can instead use <code class="inline">\<span class="i">p</span><span class="s">{</span><span class="w">Digit</span><span class="s">}</span></code>
 (or <code class="inline">\<span class="i">p</span><span class="s">{</span><span class="w">Word</span><span class="s">}</span></code>
1095
 for <code class="inline">\<span class="w">w</span></code>
1096 1097 1098 1099 1100 1101 1102 1103 1104 1105
).  There are
similar <code class="inline">\<span class="i">p</span><span class="s">{</span>...<span class="s">}</span></code>
 constructs that can match beyond ASCII both white
space (see <a href="perlrecharclass.html#Whitespace">Whitespace in perlrecharclass</a>), and Posix classes (see
<a href="perlrecharclass.html#POSIX-Character-Classes">POSIX Character Classes in perlrecharclass</a>).  Thus, this modifier
doesn't mean you can't use Unicode, it means that to get Unicode
matching you must explicitly use a construct (<code class="inline">\<span class="w">p</span><span class="s">{</span><span class="s">}</span></code>
, <code class="inline">\<span class="w">P</span><span class="s">{</span><span class="s">}</span></code>
) that
signals Unicode.</p>
1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123
<p>As you would expect, this modifier causes, for example, <code class="inline">\<span class="w">D</span></code>
 to mean
the same thing as <code class="inline"><span class="s">[</span>^<span class="n">0</span>-<span class="n">9</span><span class="s">]</span></code>
; in fact, all non-ASCII characters match
<code class="inline">\<span class="w">D</span></code>
, <code class="inline">\<span class="w">S</span></code>
, and <code class="inline">\<span class="w">W</span></code>
.  <code class="inline">\<span class="w">b</span></code>
 still means to match at the boundary
between <code class="inline">\<span class="w">w</span></code>
 and <code class="inline">\<span class="w">W</span></code>
, using the <code class="inline"><span class="q">/a</span></code>
 definitions of them (similarly
for <code class="inline">\<span class="w">B</span></code>
).</p>
<p>Otherwise, <code class="inline"><span class="q">/a</span></code>
 behaves like the <code class="inline"><span class="q">/u</span></code>
 modifier, in that
1124
case-insensitive matching uses Unicode rules; for example, "k" will
1125 1126 1127 1128 1129 1130
match the Unicode <code class="inline">\<span class="i">N</span><span class="s">{</span><span class="w">KELVIN</span> <span class="w">SIGN</span><span class="s">}</span></code>
 under <code class="inline">/i</code> matching, and code
points in the Latin1 range, above ASCII will have Unicode rules when it
comes to case-insensitive matching.</p>
<p>To forbid ASCII/non-ASCII matches (like "k" with <code class="inline">\<span class="i">N</span><span class="s">{</span><span class="w">KELVIN</span> <span class="w">SIGN</span><span class="s">}</span></code>
),
1131 1132
specify the <code class="inline"><span class="q">&quot;a&quot;</span></code>
 twice, for example <code class="inline"><span class="q">/aai</span></code>
1133
 or <code class="inline"><span class="q">/aia</span></code>
1134
.  (The first
1135 1136 1137
occurrence of <code class="inline"><span class="q">&quot;a&quot;</span></code>
 restricts the <code class="inline">\<span class="w">d</span></code>
, <i>etc</i>., and the second occurrence
1138 1139 1140 1141 1142
adds the <code class="inline">/i</code> restrictions.)  But, note that code points outside the
ASCII range will use Unicode rules for <code class="inline">/i</code> matching, so the modifier
doesn't really restrict things to just ASCII; it just forbids the
intermixing of ASCII and non-ASCII.</p>
<p>To summarize, this modifier provides protection for applications that
1143 1144 1145 1146 1147
don't wish to be exposed to all of Unicode.  Specifying it twice
gives added protection.</p>
<p>This modifier may be specified to be the default by <code class="inline"><a class="l_k" href="functions/use.html">use</a> <span class="w">re</span> <span class="q">&#39;/a&#39;</span></code>

or <code class="inline"><a class="l_k" href="functions/use.html">use</a> <span class="w">re</span> <span class="q">&#39;/aa&#39;</span></code>
1148 1149
.  If you do so, you may actually have occasion to use
the <code class="inline"><span class="q">/u</span></code>
1150
 modifier explicitly if there are a few regular expressions
1151 1152 1153 1154 1155
where you do want full Unicode rules (but even here, it's best if
everything were under feature <code class="inline"><span class="q">&quot;unicode_strings&quot;</span></code>
, along with the
<code class="inline"><a class="l_k" href="functions/use.html">use</a> <span class="w">re</span> <span class="q">&#39;/aa&#39;</span></code>
).  Also see <a href="#Which-character-set-modifier-is-in-effect%3f">Which character set modifier is in effect?</a>.
1156

1157
</p>
1158
<h4>Which character set modifier is in effect?</h4>
1159
<p>Which of these modifiers is in effect at any given point in a regular
1160 1161 1162
expression depends on a fairly complex set of interactions.  These have
been designed so that in general you don't have to worry about it, but
this section gives the gory details.  As
1163 1164 1165 1166 1167
explained below in <a href="#Extended-Patterns">Extended Patterns</a> it is possible to explicitly
specify modifiers that apply only to portions of a regular expression.
The innermost always has priority over any outer ones, and one applying
to the whole expression has priority over any of the default settings that are
described in the remainder of this section.</p>
1168
<p>The <code class="inline"><a href="re.html#'%2fflags'-mode">use re '/foo'</a></code> pragma can be used to set
1169 1170
default modifiers (including these) for regular expressions compiled
within its scope.  This pragma has precedence over the other pragmas
1171
listed below that also change the defaults.</p>
1172 1173
<p>Otherwise, <code class="inline"><a href="perllocale.html">use locale</a></code> sets the default modifier to <code class="inline"><span class="q">/l</span></code>
;
1174
and <code class="inline"><a href="feature.html">use feature 'unicode_strings</a></code>, or
1175 1176 1177
<code class="inline"><a href="functions/use.html">use VERSION</a></code> (or higher) set the default to
<code class="inline"><span class="q">/u</span></code>
 when not in the same scope as either <code class="inline"><a href="perllocale.html">use locale</a></code>
1178 1179 1180 1181 1182 1183
or <code class="inline"><a href="bytes.html">use bytes</a></code>.
(<code class="inline"><a href="perllocale.html#Unicode-and-UTF-8">use locale ':not_characters'</a></code> also
sets the default to <code class="inline"><span class="q">/u</span></code>
, overriding any plain <code class="inline"><a class="l_k" href="functions/use.html">use</a> <span class="w">locale</span></code>
.)
Unlike the mechanisms mentioned above, these
1184 1185 1186 1187
affect operations besides regular expressions pattern matching, and so
give more consistent results with other operators, including using
<code class="inline">\<span class="w">U</span></code>
, <code class="inline">\<span class="w">l</span></code>
1188
, <i>etc</i>. in substitution replacements.</p>
1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205
<p>If none of the above apply, for backwards compatibility reasons, the
<code class="inline">/d</code> modifier is the one in effect by default.  As this can lead to
unexpected results, it is best to specify which other rule set should be
used.</p>
<h4>Character set modifier behavior prior to Perl 5.14</h4>
<p>Prior to 5.14, there were no explicit modifiers, but <code class="inline"><span class="q">/l</span></code>
 was implied
for regexes compiled within the scope of <code class="inline"><a class="l_k" href="functions/use.html">use</a> <span class="w">locale</span></code>
, and <code class="inline">/d</code> was
implied otherwise.  However, interpolating a regex into a larger regex
would ignore the original compilation in favor of whatever was in effect
at the time of the second compilation.  There were a number of
inconsistencies (bugs) with the <code class="inline">/d</code> modifier, where Unicode rules
would be used when inappropriate, and vice versa.  <code class="inline">\<span class="w">p</span><span class="s">{</span><span class="s">}</span></code>
 did not imply
Unicode rules, and neither did all occurrences of <code class="inline">\<span class="w">N</span><span class="s">{</span><span class="s">}</span></code>
, until 5.12.</p>
1206
<a name="Regular-Expressions"></a><h2>Regular Expressions</h2>
1207
<a name="Quantifiers"></a><h3>Quantifiers</h3>
1208 1209 1210 1211
<p>Quantifiers are used when a particular portion of a pattern needs to
match a certain number (or numbers) of times.  If there isn't a
quantifier the number of times to match is exactly one.  The following
standard quantifiers are recognized:
1212
       </p>
1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228
<pre class="verbatim"><ol><li>    <span class="i">*           Match</span> <span class="n">0</span> or <span class="w">more</span> <a class="l_k" href="functions/times.html">times</a></li><li>    +           <span class="w">Match</span> <span class="n">1</span> or <span class="w">more</span> <a class="l_k" href="functions/times.html">times</a></li><li>    ?           <span class="w">Match</span> <span class="n">1</span> or <span class="n">0</span> <a class="l_k" href="functions/times.html">times</a></li><li>    <span class="s">{</span><span class="w">n</span><span class="s">}</span>         <span class="w">Match</span> <span class="w">exactly</span> <span class="w">n</span> <a class="l_k" href="functions/times.html">times</a></li><li>    <span class="s">{</span><span class="w">n</span><span class="cm">,</span><span class="s">}</span>        <span class="w">Match</span> <span class="w">at</span> <span class="w">least</span> <span class="w">n</span> <a class="l_k" href="functions/times.html">times</a></li><li>    <span class="s">{</span><span class="w">n</span><span class="cm">,</span><span class="q">m}       Match at least n but not more than m times</span></li></ol></pre><p>(If a non-escaped curly bracket occurs in a context other than one of
the quantifiers listed above, where it does not form part of a
backslashed sequence like <code class="inline">\<span class="i">x</span><span class="s">{</span>...<span class="s">}</span></code>
, it is either a fatal syntax error,
or treated as a regular character, generally with a deprecation warning
raised.  To escape it, you can precede it with a backslash (<code class="inline"><span class="q">&quot;\{&quot;</span></code>
) or
enclose it within square brackets  (<code class="inline"><span class="q">&quot;[{]&quot;</span></code>
).
This change will allow for future syntax extensions (like making the
lower bound of a quantifier optional), and better error checking of
quantifiers).</p>
<p>The <code class="inline"><span class="q">&quot;*&quot;</span></code>
 quantifier is equivalent to <code class="inline"><span class="s">{</span><span class="n">0</span><span class="cm">,</span><span class="s">}</span></code>
, the <code class="inline"><span class="q">&quot;+&quot;</span></code>

1229
quantifier to <code class="inline"><span class="s">{</span><span class="n">1</span><span class="cm">,</span><span class="s">}</span></code>
1230 1231 1232
, and the <code class="inline"><span class="q">&quot;?&quot;</span></code>
 quantifier to <code class="inline"><span class="s">{</span><span class="n">0</span><span class="cm">,</span><span class="n">1</span><span class="s">}</span></code>
.  <i>n</i> and <i>m</i> are limited
1233
to non-negative integral values less than a preset limit defined when perl is built.
1234 1235
This is usually 32766 on the most common platforms.  The actual limit can
be seen in the error message generated by code such as this:</p>
1236
<pre class="verbatim"><ol><li>    <span class="i">$_</span> **= <span class="i">$_</span> <span class="cm">,</span> <span class="q">/ {$_} /</span> for <span class="n">2</span> .. <span class="n">42</span><span class="sc">;</span></li></ol></pre><p>By default, a quantified subpattern is "greedy", that is, it will match as
1237 1238
many times as possible (given a particular starting location) while still
allowing the rest of the pattern to match.  If you want it to match the
1239 1240
minimum number of times possible, follow the quantifier with a <code class="inline"><span class="q">&quot;?&quot;</span></code>
.  Note
1241
that the meanings don't change, just the "greediness":
1242 1243
  
      </p>
1244
<pre class="verbatim"><ol><li>    <span class="i">*?</span>        <span class="w">Match</span> <span class="n">0</span> or <span class="w">more</span> <a class="l_k" href="functions/times.html">times</a><span class="cm">,</span> not <span class="w">greedily</span></li><li>    +<span class="q">?        Match 1 or more times, not greedily</span></li><li>    <span class="q">    ?</span>?        <span class="w">Match</span> <span class="n">0</span> or <span class="n">1</span> <a class="l_k" href="functions/time.html">time</a><span class="cm">,</span> not <span class="w">greedily</span></li><li>    <span class="s">{</span><span class="w">n</span><span class="s">}</span><span class="q">?      Match exactly n times, not greedily (redundant)</span></li><li>    <span class="q">    {n,}?</span>     <span class="w">Match</span> <span class="w">at</span> <span class="w">least</span> <span class="w">n</span> <a class="l_k" href="functions/times.html">times</a><span class="cm">,</span> not <span class="w">greedily</span></li><li>    <span class="s">{</span><span class="w">n</span><span class="cm">,</span><span class="q">m}?    Match at least n but not more than m times, not greedily</span></li></ol></pre><p>Normally when a quantified subpattern does not allow the rest of the
1245
overall pattern to match, Perl will backtrack. However, this behaviour is
1246
sometimes undesirable. Thus Perl provides the "possessive" quantifier form
1247
as well.</p>
1248
<pre class="verbatim"><ol><li> <span class="i">*+</span>     <span class="w">Match</span> <span class="n">0</span> or <span class="w">more</span> <a class="l_k" href="functions/times.html">times</a> and <span class="w">give</span> <span class="w">nothing</span> <span class="w">back</span></li><li> ++     <span class="w">Match</span> <span class="n">1</span> or <span class="w">more</span> <a class="l_k" href="functions/times.html">times</a> and <span class="w">give</span> <span class="w">nothing</span> <span class="w">back</span></li><li> ?+     <span class="w">Match</span> <span class="n">0</span> or <span class="n">1</span> <a class="l_k" href="functions/time.html">time</a> and <span class="w">give</span> <span class="w">nothing</span> <span class="w">back</span></li><li> <span class="s">{</span><span class="w">n</span><span class="s">}</span>+   <span class="w">Match</span> <span class="w">exactly</span> <span class="w">n</span> <a class="l_k" href="functions/times.html">times</a> and <span class="w">give</span> <span class="w">nothing</span> <span class="w">back</span> <span class="s">(</span><span class="w">redundant</span><span class="s">)</span></li><li> <span class="s">{</span><span class="w">n</span><span class="cm">,</span><span class="s">}</span>+  <span class="w">Match</span> <span class="w">at</span> <span class="w">least</span> <span class="w">n</span> <a class="l_k" href="functions/times.html">times</a> and <span class="w">give</span> <span class="w">nothing</span> <span class="w">back</span></li><li> <span class="s">{</span><span class="w">n</span><span class="cm">,</span><span class="q">m}+ Match at least n but not more than m times and give nothing back</span></li></ol></pre><p>For instance,</p>
1249
<pre class="verbatim"><ol><li>   <span class="q">'aaaa'</span> =~ <span class="q">/a++a/</span></li></ol></pre><p>will never match, as the <code class="inline"><span class="w">a</span>++</code>
1250
 will gobble up all the <code class="inline"><span class="q">&quot;a&quot;</span></code>
1251 1252
's in the
string and won't leave any for the remaining part of the pattern. This
1253
feature can be extremely useful to give perl hints about where it
1254 1255
shouldn't backtrack. For instance, the typical "match a double-quoted
string" problem can be most efficiently performed when written as:</p>
1256
<pre class="verbatim"><ol><li>   <span class="q">/&quot;(?:[^&quot;\\]++|\\.)*+&quot;/</span></li></ol></pre><p>as we know that if the final quote does not match, backtracking will not
1257 1258
help. See the independent subexpression
<a href="#(%3f%3epattern)">(?&gt;pattern)</a> for more details;
1259 1260
possessive quantifiers are just syntactic sugar for that construct. For
instance the above example could also be written as follows:</p>
1261 1262 1263 1264
<pre class="verbatim"><ol><li>   <span class="q">/&quot;(?&gt;(?:(?&gt;[^&quot;\\]+)|\\.)*)&quot;/</span></li></ol></pre><p>Note that the possessive quantifier modifier can not be be combined
with the non-greedy modifier. This is because it would make no sense.
Consider the follow equivalency table:</p>
<pre class="verbatim"><ol><li>    <span class="w">Illegal</span>         <span class="w">Legal</span></li><li>    ------------    ------</li><li>    <span class="w">X</span>?<span class="q">?+            X{0}</span></li><li>    <span class="q">    X+?</span>+            <span class="i">X</span><span class="s">{</span><span class="n">1</span><span class="s">}</span></li><li>    <span class="w">X</span><span class="s">{</span><span class="w">min</span><span class="cm">,</span><span class="w">max</span><span class="s">}</span><span class="q">?+    X{min}</span></li></ol></pre><a name="Escape-sequences"></a><h3>Escape sequences</h3>
1265 1266
<p>Because patterns are processed as double-quoted strings, the following
also work:</p>
1267
<pre class="verbatim"><ol><li> \<span class="w">t</span>          <span class="w">tab</span>                   <span class="s">(</span><span class="w">HT</span><span class="cm">,</span> <span class="w">TAB</span><span class="s">)</span></li><li> \<span class="w">n</span>          <span class="w">newline</span>               <span class="s">(</span><span class="w">LF</span><span class="cm">,</span> <span class="w">NL</span><span class="s">)</span></li><li> \<span class="w">r</span>          <a class="l_k" href="functions/return.html">return</a>                <span class="s">(</span><span class="w">CR</span><span class="s">)</span></li><li> \<span class="w">f</span>          <span class="w">form</span> <span class="w">feed</span>             <span class="s">(</span><span class="w">FF</span><span class="s">)</span></li><li> \<span class="w">a</span>          <a class="l_k" href="functions/alarm.html">alarm</a> <span class="s">(</span><span class="w">bell</span><span class="s">)</span>          <span class="s">(</span><span class="w">BEL</span><span class="s">)</span></li><li> \<span class="w">e</span>          <span class="w">escape</span> <span class="s">(</span><span class="w">think</span> <span class="w">troff</span><span class="s">)</span>  <span class="s">(</span><span class="w">ESC</span><span class="s">)</span></li><li> \<span class="w">cK</span>         <span class="w">control</span> <span class="w">char</span>          <span class="s">(</span><span class="w">example</span><span class="co">:</span> <span class="w">VT</span><span class="s">)</span></li><li> \<span class="w">x</span><span class="s">{</span><span class="s">}</span><span class="cm">,</span> \<span class="w">x00</span>  <span class="w">character</span> <span class="w">whose</span> <span class="w">ordinal</span> <span class="w">is</span> <span class="w">the</span> <a class="l_k" href="functions/given.html">given</a> <span class="w">hexadecimal</span> <span class="w">number</span></li><li> \<span class="i">N</span><span class="s">{</span><span class="w">name</span><span class="s">}</span>    <span class="w">named</span> <span class="w">Unicode</span> <span class="w">character</span> <a class="l_k" href="functions/or.html">or</a> <span class="w">character</span> <span class="w">sequence</span></li><li> \<span class="i">N</span><span class="s">{</span><span class="w">U</span>+<span class="n">263</span><span class="w">D</span><span class="s">}</span>  <span class="w">Unicode</span> <span class="w">character</span>     <span class="s">(</span><span class="w">example</span><span class="co">:</span> <span class="w">FIRST</span> <span class="w">QUARTER</span> <span class="w">MOON</span><span class="s">)</span></li><li> \<span class="w">o</span><span class="s">{</span><span class="s">}</span><span class="cm">,</span> \<span class="n">000</span>  <span class="w">character</span> <span class="w">whose</span> <span class="w">ordinal</span> <span class="w">is</span> <span class="w">the</span> <a class="l_k" href="functions/given.html">given</a> <span class="w">octal</span> <span class="w">number</span></li><li> \<span class="w">l</span>          <span class="w">lowercase</span> <a class="l_k" href="functions/next.html">next</a> <span class="j">char</span> <span class="s">(</span><span class="w">think</span> <span class="w">vi</span><span class="s">)</span></li><li> \<span class="w">u</span>          <span class="w">uppercase</span> <a class="l_k" href="functions/next.html">next</a> <span class="j">char</span> <span class="s">(</span><span class="w">think</span> <span class="w">vi</span><span class="s">)</span></li><li> \<span class="w">L</span>          <span class="w">lowercase</span> <a class="l_k" href="functions/until.html">until</a> \<span class="w">E</span> <span class="s">(</span><span class="w">think</span> <span class="w">vi</span><span class="s">)</span></li><li> \<span class="w">U</span>          <span class="w">uppercase</span> <a class="l_k" href="functions/until.html">until</a> \<span class="w">E</span> <span class="s">(</span><span class="w">think</span> <span class="w">vi</span><span class="s">)</span></li><li> \<span class="w">Q</span>          <span class="w">quote</span> <span class="s">(</span><span class="w">disable</span><span class="s">)</span> <span class="w">pattern</span> <span class="w">metacharacters</span> <a class="l_k" href="functions/until.html">until</a> \<span class="w">E</span></li><li> \<span class="w">E</span>          <span class="w">end</span> <span class="w">either</span> case <span class="w">modification</span> <a class="l_k" href="functions/or.html">or</a> <span class="w">quoted</span> <span class="w">section</span><span class="cm">,</span> <span class="w">think</span> <span class="w">vi</span></li></ol></pre><p>Details are in <a href="perlop.html#Quote-and-Quote-like-Operators">Quote and Quote-like Operators in perlop</a>.</p>
1268
<a name="Character-Classes-and-other-Special-Escapes"></a><h3>Character Classes and other Special Escapes</h3>
1269
<p>In addition, Perl defines the following:
1270
   </p>
1271
<pre class="verbatim"><ol><li> <span class="w">Sequence</span>   <span class="w">Note</span>    <span class="w">Description</span></li><li>  <span class="s">[</span>...<span class="s">]</span>     <span class="s">[</span><span class="n">1</span><span class="s">]</span>  <span class="w">Match</span> <span class="w">a</span> <span class="w">character</span> <span class="w">according</span> <span class="w">to</span> <span class="w">the</span> <span class="w">rules</span> <span class="w">of</span> <span class="w">the</span></li><li>                   <span class="w">bracketed</span> <span class="w">character</span> <span class="w">class</span> <a class="l_k" href="functions/defined.html">defined</a> <span class="w">by</span> <span class="w">the</span> <span class="q">&quot;...&quot;</span>.</li><li>                   <span class="w">Example</span><span class="co">:</span> <span class="s">[</span><span class="w">a</span>-z<span class="s">]</span> <span class="w">matches</span> <span class="q">&quot;a&quot;</span> <a class="l_k" href="functions/or.html">or</a> <span class="q">&quot;b&quot;</span> <a class="l_k" href="functions/or.html">or</a> <span class="q">&quot;c&quot;</span> ... <a class="l_k" href="functions/or.html">or</a> <span class="q">&quot;z&quot;</span></li><li>  <span class="s">[</span><span class="s">[</span><span class="co">:</span>...<span class="co">:</span><span class="s">]</span><span class="s">]</span> <span class="s">[</span><span class="n">2</span><span class="s">]</span>  <span class="w">Match</span> <span class="w">a</span> <span class="w">character</span> <span class="w">according</span> <span class="w">to</span> <span class="w">the</span> <span class="w">rules</span> <span class="w">of</span> <span class="w">the</span> <span class="w">POSIX</span></li><li>                   <span class="w">character</span> <span class="w">class</span> <span class="q">&quot;...&quot;</span> <span class="w">within</span> <span class="w">the</span> <span class="w">outer</span> <span class="w">bracketed</span></li><li>                   <span class="w">character</span> <span class="w">class</span>.  <span class="w">Example</span><span class="co">:</span> <span class="s">[</span><span class="s">[</span><span class="co">:</span><span class="w">upper</span><span class="co">:</span><span class="s">]</span><span class="s">]</span> <span class="w">matches</span> <span class="w">any</span></li><li>                   <span class="w">uppercase</span> <span class="w">character</span>.</li><li>  <span class="s">(</span><span class="q">?[...])  [8]  Extended bracketed character class</span></li><li>  <span class="q">  \w        [3]  Match a &quot;word&quot; character (alphanumeric plus &quot;_&quot;, plus</span></li><li>                   <span class="q">                   other connector punctuation chars plus Unicode</span></li><li>                   <span class="q">                   marks)</span></li><li>  <span class="q">  \W        [3]  Match a non-&quot;word&quot; character</span></li><li>  <span class="q">  \s        [3]  Match a whitespace character</span></li><li>  <span class="q">  \S        [3]  Match a non-whitespace character</span></li><li>  <span class="q">  \d        [3]  Match a decimal digit character</span></li><li>  <span class="q">  \D        [3]  Match a non-digit character</span></li><li>  <span class="q">  \pP       [3]  Match P, named property.  Use \p{Prop} for longer names</span></li><li>  <span class="q">  \PP       [3]  Match non-P</span></li><li>  <span class="q">  \X        [4]  Match Unicode &quot;eXtended grapheme cluster&quot;</span></li><li>  <span class="q">  \1        [5]  Backreference to a specific capture group or buffer.</span></li><li>                   <span class="q">                   &#39;1&#39; may actually be any positive integer.</span></li><li>  <span class="q">  \g1       [5]  Backreference to a specific or previous group,</span></li><li>  <span class="q">  \g{-1}    [5]  The number may be negative indicating a relative</span></li><li>                   <span class="q">                   previous group and may optionally be wrapped in</span></li><li>                   <span class="q">                   curly brackets for safer parsing.</span></li><li>  <span class="q">  \g{name}  [5]  Named backreference</span></li><li>  <span class="q">  \k&lt;name&gt;  [5]  Named backreference</span></li><li>  <span class="q">  \K        [6]  Keep the stuff left of the \K, don&#39;t include it in $&amp;</span></li><li>  <span class="q">  \N        [7]  Any character but \n.  Not affected by /s modifier</span></li><li>  <span class="q">  \v        [3]  Vertical whitespace</span></li><li>  <span class="q">  \V        [3]  Not vertical whitespace</span></li><li>  <span class="q">  \h        [3]  Horizontal whitespace</span></li><li>  <span class="q">  \H        [3]  Not horizontal whitespace</span></li><li>  <span class="q">  \R        [4]  Linebreak</span></li></ol></pre><ul>
1272
<li><a name="%5b1%5d"></a><b>[1]</b>
1273
<p>See <a href="perlrecharclass.html#Bracketed-Character-Classes">Bracketed Character Classes in perlrecharclass</a> for details.</p>
1274 1275
</li>
<li><a name="%5b2%5d"></a><b>[2]</b>
1276
<p>See <a href="perlrecharclass.html#POSIX-Character-Classes">POSIX Character Classes in perlrecharclass</a> for details.</p>
1277 1278
</li>
<li><a name="%5b3%5d"></a><b>[3]</b>
1279
<p>See <a href="perlrecharclass.html#Backslash-sequences">Backslash sequences in perlrecharclass</a> for details.</p>
1280
</li>
1281 1282
<li><a name="%5b4%5d"></a><b>[4]</b>
<p>See <a href="perlrebackslash.html#Misc">Misc in perlrebackslash</a> for details.</p>
1283
</li>
1284 1285
<li><a name="%5b5%5d"></a><b>[5]</b>
<p>See <a href="#Capture-groups">Capture groups</a> below for details.</p>
1286
</li>
1287 1288
<li><a name="%5b6%5d"></a><b>[6]</b>
<p>See <a href="#Extended-Patterns">Extended Patterns</a> below for details.</p>
1289
</li>
1290 1291 1292 1293 1294 1295 1296 1297 1298
<li><a name="%5b7%5d"></a><b>[7]</b>
<p>Note that <code class="inline">\<span class="w">N</span></code>
 has two meanings.  When of the form <code class="inline">\<span class="i">N</span><span class="s">{</span><span class="w">NAME</span><span class="s">}</span></code>
, it matches the
character or character sequence whose name is <code class="inline"><span class="w">NAME</span></code>
; and similarly
when of the form <code class="inline">\N{U+<i>hex</i>}</code>, it matches the character whose Unicode
code point is <i>hex</i>.  Otherwise it matches any character but <code class="inline">\<span class="w">n</span></code>
.</p>
1299
</li>
1300 1301 1302
<li><a name="%5b8%5d"></a><b>[8]</b>
<p>See <a href="perlrecharclass.html#Extended-Bracketed-Character-Classes">Extended Bracketed Character Classes in perlrecharclass</a> for details.</p>
</li>
1303
</ul>
1304
<a name="Assertions"></a><h3>Assertions</h3>
1305 1306
<p>Besides <a href="#Metacharacters">^ and &quot;$&quot; </a>, Perl defines the following
zero-width assertions:
1307 1308 1309 1310
  


     </p>
1311
<pre class="verbatim"><ol><li> \<span class="w">b</span><span class="s">{</span><span class="s">}</span>   <span class="w">Match</span> <span class="w">at</span> <span class="w">Unicode</span> <span class="w">boundary</span> <span class="w">of</span> <span class="w">specified</span> <span class="w">type</span></li><li> \<span class="w">B</span><span class="s">{</span><span class="s">}</span>   <span class="w">Match</span> <span class="w">where</span> <span class="w">corresponding</span> \<span class="w">b</span><span class="s">{</span><span class="s">}</span> <span class="w">doesn&#39;t</span> <span class="w">match</span></li><li> \<span class="w">b</span>     <span class="w">Match</span> <span class="w">a</span> \<span class="w">w</span>\<span class="w">W</span> <a class="l_k" href="functions/or.html">or</a> \<span class="w">W</span>\<span class="w">w</span> <span class="w">boundary</span></li><li> \<span class="w">B</span>     <span class="w">Match</span> <span class="w">except</span> <span class="w">at</span> <span class="w">a</span> \<span class="w">w</span>\<span class="w">W</span> <a class="l_k" href="functions/or.html">or</a> \<span class="w">W</span>\<span class="w">w</span> <span class="w">boundary</span></li><li> \<span class="w">A</span>     <span class="w">Match</span> <span class="w">only</span> <span class="w">at</span> <span class="w">beginning</span> <span class="w">of</span> <span class="w">string</span></li><li> \<span class="w">Z</span>     <span class="w">Match</span> <span class="w">only</span> <span class="w">at</span> <span class="w">end</span> <span class="w">of</span> <span class="w">string</span><span class="cm">,</span> <a class="l_k" href="functions/or.html">or</a> <span class="w">before</span> <span class="w">newline</span> <span class="w">at</span> <span class="w">the</span> <span class="w">end</span></li><li> \<span class="w">z</span>     <span class="w">Match</span> <span class="w">only</span> <span class="w">at</span> <span class="w">end</span> <span class="w">of</span> <span class="w">string</span></li><li> \<span class="w">G</span>     <span class="w">Match</span> <span class="w">only</span> <span class="w">at</span> <a class="l_k" href="functions/pos.html">pos</a><span class="s">(</span><span class="s">)</span> <span class="s">(</span><span class="w">e</span>.<span class="w">g</span>. <span class="w">at</span> <span class="w">the</span> <span class="w">end</span>-<span class="w">of</span>-<span class="w">match</span> <span class="w">position</span></li><li>        <span class="w">of</span> <span class="w">prior</span> <span class="q">m//g</span><span class="s">)</span></li></ol></pre><p>A Unicode boundary (<code class="inline">\<span class="w">b</span><span class="s">{</span><span class="s">}</span></code>
1312 1313 1314 1315 1316 1317
), available starting in v5.22, is a spot
between two characters, or before the first character in the string, or
after the final character in the string where certain criteria defined
by Unicode are met.  See <a href="perlrebackslash.html#%5cb%7b%7d%2c-%5cb%2c-%5cB%7b%7d%2c-%5cB">\b{}, \b, \B{}, \B in perlrebackslash</a> for
details.</p>
<p>A word boundary (<code class="inline">\<span class="w">b</span></code>
1318
) is a spot between two characters
1319 1320
that has a <code class="inline">\<span class="w">w</span></code>
 on one side of it and a <code class="inline">\<span class="w">W</span></code>
1321 1322
 on the other side
of it (in either order), counting the imaginary characters off the
1323
beginning and end of the string as matching a <code class="inline">\<span class="w">W</span></code>
1324
.  (Within
1325
character classes <code class="inline">\<span class="w">b</span></code>
1326 1327
 represents backspace rather than a word
boundary, just as it normally does in any double-quoted string.)
1328 1329
The <code class="inline">\<span class="w">A</span></code>
 and <code class="inline">\<span class="w">Z</span></code>
1330 1331 1332
 are just like <code class="inline"><span class="q">&quot;^&quot;</span></code>
 and <code class="inline"><span class="q">&quot;$&quot;</span></code>
, except that they
1333
won't match multiple times when the <code class="inline">/m</code> modifier is used, while
1334 1335 1336
<code class="inline"><span class="q">&quot;^&quot;</span></code>
 and <code class="inline"><span class="q">&quot;$&quot;</span></code>
 will match at every internal line boundary.  To match
1337
the actual end of the string and not ignore an optional trailing
1338
newline, use <code class="inline">\<span class="w">z</span></code>
1339 1340
.
    </p>
1341
<p>The <code class="inline">\<span class="w">G</span></code>
1342
 assertion can be used to chain global matches (using
1343
<code class="inline"><a class="l_k" href="functions/m.html">m//g</a></code>), as described in <a href="perlop.html#Regexp-Quote-Like-Operators">Regexp Quote-Like Operators in perlop</a>.
1344
It is also useful when writing <code class="inline"><span class="w">lex</span></code>
1345 1346
-like scanners, when you have
several patterns that you want to match against consequent substrings
1347
of your string; see the previous reference.  The actual location
1348
where <code class="inline">\<span class="w">G</span></code>
1349
 will match can also be influenced by using <code class="inline"><a class="l_k" href="functions/pos.html">pos()</a></code> as
1350
an lvalue: see <a href="functions/pos.html">pos</a>. Note that the rule for zero-length
1351 1352 1353
matches (see <a href="#Repeated-Patterns-Matching-a-Zero-length-Substring">Repeated Patterns Matching a Zero-length Substring</a>)
is modified somewhat, in that contents to the left of <code class="inline">\<span class="w">G</span></code>
 are
1354 1355
not counted when determining the length of the match. Thus the following
will not match forever:
1356
</p>
1357
<pre class="verbatim"><ol><li>     <a class="l_k" href="functions/my.html">my</a> <span class="i">$string</span> = <span class="q">&#39;ABC&#39;</span><span class="sc">;</span></li><li>     <a class="l_k" href="functions/pos.html">pos</a><span class="s">(</span><span class="i">$string</span><span class="s">)</span> = <span class="n">1</span><span class="sc">;</span></li><li>     while <span class="s">(</span><span class="i">$string</span> =~ <span class="q">/(.\G)/g</span><span class="s">)</span> <span class="s">{</span></li><li>         <a class="l_k" href="functions/print.html">print</a> <span class="i">$1</span><span class="sc">;</span></li><li>     <span class="s">}</span></li></ol></pre><p>It will print 'A' and then terminate, as it considers the match to
1358 1359 1360 1361 1362 1363
be zero-width, and thus will not match at the same position twice in a
row.</p>
<p>It is worth noting that <code class="inline">\<span class="w">G</span></code>
 improperly used can result in an infinite
loop. Take care when using patterns that include <code class="inline">\<span class="w">G</span></code>
 in an alternation.</p>
1364 1365 1366 1367 1368
<p>Note also that <code class="inline"><a class="l_k" href="functions/s.html">s///</a></code> will refuse to overwrite part of a substitution
that has already been replaced; so for example this will stop after the
first iteration, rather than iterating its way backwards through the
string:</p>
<pre class="verbatim"><ol><li>    <span class="i">$_</span> = <span class="q">&quot;123456789&quot;</span><span class="sc">;</span></li><li>    <a class="l_k" href="functions/pos.html">pos</a> = <span class="n">6</span><span class="sc">;</span></li><li>    <span class="q">s/.(?=.\G)/X/g</span><span class="sc">;</span></li><li>    <a class="l_k" href="functions/print.html">print</a><span class="sc">;</span> 	<span class="c"># prints 1234X6789, not XXXXX6789</span></li></ol></pre><a name="Capture-groups"></a><h3>Capture groups</h3>
1369
<p>The grouping construct <code class="inline"><span class="s">(</span> ... <span class="s">)</span></code>
1370 1371 1372 1373 1374 1375 1376 1377 1378
 creates capture groups (also referred to as
capture buffers). To refer to the current contents of a group later on, within
the same pattern, use <code class="inline">\<span class="w">g1</span></code>
 (or <code class="inline">\<span class="i">g</span><span class="s">{</span><span class="n">1</span><span class="s">}</span></code>
) for the first, <code class="inline">\<span class="w">g2</span></code>
 (or <code class="inline">\<span class="i">g</span><span class="s">{</span><span class="n">2</span><span class="s">}</span></code>
)
for the second, and so on.
This is called a <i>backreference</i>.
1379
 
1380
 
1381 1382 1383 1384 1385 1386 1387
 
 
    
 
 
  
There is no limit to the number of captured substrings that you may use.
1388
Groups are numbered with the leftmost open parenthesis being number 1, <i>etc</i>.  If
1389 1390 1391 1392 1393
a group did not match, the associated backreference won't match either. (This
can happen if the group is optional, or in a different branch of an
alternation.)
You can omit the <code class="inline"><span class="q">&quot;g&quot;</span></code>
, and write <code class="inline"><span class="q">&quot;\1&quot;</span></code>
1394
, <i>etc</i>, but there are some issues with
1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425
this form, described below.</p>
<p>You can also refer to capture groups relatively, by using a negative number, so
that <code class="inline">\<span class="w">g</span>-<span class="n">1</span></code>
 and <code class="inline">\<span class="i">g</span><span class="s">{</span><span class="n">-1</span><span class="s">}</span></code>
 both refer to the immediately preceding capture
group, and <code class="inline">\<span class="w">g</span>-<span class="n">2</span></code>
 and <code class="inline">\<span class="i">g</span><span class="s">{</span><span class="n">-2</span><span class="s">}</span></code>
 both refer to the group before it.  For
example:</p>
<pre class="verbatim"><ol><li>        <span class="q">/</span></li><li>         <span class="q">         (Y)            # group 1</span></li><li>         <span class="q">         (              # group 2</span></li><li>            <span class="q">            (X)         # group 3</span></li><li>            <span class="q">            \g{-1}      # backref to group 3</span></li><li>            <span class="q">            \g{-3}      # backref to group 1</span></li><li>         <span class="q">         )</span></li><li>        <span class="q">        /x</span></li></ol></pre><p>would match the same as <code class="inline"><span class="q">/(Y) ( (X) \g3 \g1 )/x</span></code>
.  This allows you to
interpolate regexes into larger regexes and not have to worry about the
capture groups being renumbered.</p>
<p>You can dispense with numbers altogether and create named capture groups.
The notation is <code class="inline">(?&lt;<i>name</i>&gt;...)</code> to declare and <code class="inline">\g{<i>name</i>}</code> to
reference.  (To be compatible with .Net regular expressions, <code class="inline">\g{<i>name</i>}</code> may
also be written as <code class="inline">\k{<i>name</i>}</code>, <code class="inline">\k&lt;<i>name</i>&gt;</code> or <code class="inline">\k'<i>name</i>'</code>.)
<i>name</i> must not begin with a number, nor contain hyphens.
When different groups within the same pattern have the same name, any reference
to that name assumes the leftmost defined group.  Named groups count in
absolute and relative numbering, and so can also be referred to by those
numbers.
(It's possible to do things with named capture groups that would otherwise
require <code class="inline"><span class="s">(</span><span class="q">??</span><span class="s">{</span><span class="s">}</span><span class="s">)</span></code>
.)</p>
<p>Capture group contents are dynamically scoped and available to you outside the
pattern until the end of the enclosing block or until the next successful
match, whichever comes first.  (See <a href="perlsyn.html#Compound-Statements">Compound Statements in perlsyn</a>.)
You can refer to them by absolute number (using <code class="inline"><span class="q">&quot;$1&quot;</span></code>
 instead of <code class="inline"><span class="q">&quot;\g1&quot;</span></code>
,
1426
<i>etc</i>); or by name via the <code class="inline"><span class="i">%+</span></code>
1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444
 hash, using <code class="inline">"$+{<i>name</i>}"</code>.</p>
<p>Braces are required in referring to named capture groups, but are optional for
absolute or relative numbered ones.  Braces are safer when creating a regex by
concatenating smaller strings.  For example if you have <code class="inline"><a class="l_k" href="functions/qr.html">qr/$a$b/</a></code>, and <code class="inline"><span class="i">$a</span></code>

contained <code class="inline"><span class="q">&quot;\g1&quot;</span></code>
, and <code class="inline"><span class="i">$b</span></code>
 contained <code class="inline"><span class="q">&quot;37&quot;</span></code>
, you would get <code class="inline"><span class="q">/\g137/</span></code>
 which
is probably not what you intended.</p>
<p>The <code class="inline">\<span class="w">g</span></code>
 and <code class="inline">\<span class="w">k</span></code>
 notations were introduced in Perl 5.10.0.  Prior to that
there were no named nor relative numbered capture groups.  Absolute numbered
groups were referred to using <code class="inline">\<span class="n">1</span></code>
,
<code class="inline">\<span class="n">2</span></code>
1445
, <i>etc</i>., and this notation is still
1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468
accepted (and likely always will be).  But it leads to some ambiguities if
there are more than 9 capture groups, as <code class="inline">\<span class="n">10</span></code>
 could mean either the tenth
capture group, or the character whose ordinal in octal is 010 (a backspace in
ASCII).  Perl resolves this ambiguity by interpreting <code class="inline">\<span class="n">10</span></code>
 as a backreference
only if at least 10 left parentheses have opened before it.  Likewise <code class="inline">\<span class="n">11</span></code>
 is
a backreference only if at least 11 left parentheses have opened before it.
And so on.  <code class="inline">\<span class="n">1</span></code>
 through <code class="inline">\<span class="n">9</span></code>
 are always interpreted as backreferences.
There are several examples below that illustrate these perils.  You can avoid
the ambiguity by always using <code class="inline">\<span class="w">g</span><span class="s">{</span><span class="s">}</span></code>
 or <code class="inline">\<span class="w">g</span></code>
 if you mean capturing groups;
and for octal constants always using <code class="inline">\<span class="w">o</span><span class="s">{</span><span class="s">}</span></code>
, or for <code class="inline">\<span class="n">077</span></code>
 and below, using 3
digits padded with leading zeros, since a leading zero implies an octal
constant.</p>
<p>The <code class="inline">\<i>digit</i></code> notation also works in certain circumstances outside
the pattern.  See <a href="#Warning-on-%5c1-Instead-of-%241">Warning on \1 Instead of $1</a> below for details.</p>
1469
<p>Examples:</p>
1470
<pre class="verbatim"><ol><li>    <span class="q">s/^([^ ]*) *([^ ]*)/$2 $1/</span><span class="sc">;</span>     <span class="c"># swap first two words</span></li><li></li><li>    <span class="q">/(.)\g1/</span>                        <span class="c"># find first doubled char</span></li><li>         and <a class="l_k" href="functions/print.html">print</a> <span class="q">&quot;&#39;$1&#39; is the first doubled character\n&quot;</span><span class="sc">;</span></li><li></li><li>    <span class="q">/(?&lt;char&gt;.)\k&lt;char&gt;/</span>            <span class="c"># ... a different way</span></li><li>         and <a class="l_k" href="functions/print.html">print</a> <span class="q">&quot;&#39;$+{char}&#39; is the first doubled character\n&quot;</span><span class="sc">;</span></li><li></li><li>    <span class="q">/(?&#39;char&#39;.)\g1/</span>                 <span class="c"># ... mix and match</span></li><li>         and <a class="l_k" href="functions/print.html">print</a> <span class="q">&quot;&#39;$1&#39; is the first doubled character\n&quot;</span><span class="sc">;</span></li><li></li><li>    if <span class="s">(</span><span class="q">/Time: (..):(..):(..)/</span><span class="s">)</span> <span class="s">{</span>   <span class="c"># parse out values</span></li><li>        <span class="i">$hours</span> = <span class="i">$1</span><span class="sc">;</span></li><li>        <span class="i">$minutes</span> = <span class="i">$2</span><span class="sc">;</span></li><li>        <span class="i">$seconds</span> = <span class="i">$3</span><span class="sc">;</span></li><li>    <span class="s">}</span></li><li></li><li>    <span class="q">/(.)(.)(.)(.)(.)(.)(.)(.)(.)\g10/</span>   <span class="c"># \g10 is a backreference</span></li><li>    /<span class="s">(</span>.<span class="s">)</span><span class="s">(</span>.<span class="s">)</span><span class="s">(</span>.<span class="s">)</span><span class="s">(</span>.<span class="s">)</span><span class="s">(</span>.<span class="s">)</span><span class="s">(</span>.<span class="s">)</span><span class="s">(</span>.<span class="s">)</span><span class="s">(</span>.<span class="s">)</span><span class="s">(</span>.<span class="s">)</span>\<span class="n">10</span>/    <span class="c"># \10 is octal</span></li><li>    <span class="q">/((.)(.)(.)(.)(.)(.)(.)(.)(.))\10/</span>  <span class="c"># \10 is a backreference</span></li><li>    /<span class="s">(</span><span class="s">(</span>.<span class="s">)</span><span class="s">(</span>.<span class="s">)</span><span class="s">(</span>.<span class="s">)</span><span class="s">(</span>.<span class="s">)</span><span class="s">(</span>.<span class="s">)</span><span class="s">(</span>.<span class="s">)</span><span class="s">(</span>.<span class="s">)</span><span class="s">(</span>.<span class="s">)</span><span class="s">(</span>.<span class="s">)</span><span class="s">)</span>\<span class="n">010</span>/ <span class="c"># \010 is octal</span></li><li></li><li>    <span class="i">$a</span> = <span class="q">&#39;(.)\1&#39;</span><span class="sc">;</span>        <span class="c"># Creates problems when concatenated.</span></li><li>    <span class="i">$b</span> = <span class="q">&#39;(.)\g{1}&#39;</span><span class="sc">;</span>     <span class="c"># Avoids the problems.</span></li><li>    <span class="q">&quot;aa&quot;</span> =~ <span class="q">/${a}/</span><span class="sc">;</span>      <span class="c"># True</span></li><li>    <span class="q">&quot;aa&quot;</span> =~ <span class="q">/${b}/</span><span class="sc">;</span>      <span class="c"># True</span></li><li>    <span class="q">&quot;aa0&quot;</span> =~ <span class="q">/${a}0/</span><span class="sc">;</span>    <span class="c"># False!</span></li><li>    <span class="q">&quot;aa0&quot;</span> =~ <span class="q">/${b}0/</span><span class="sc">;</span>    <span class="c"># True</span></li><li>    <span class="q">&quot;aa\x08&quot;</span> =~ <span class="q">/${a}0/</span><span class="sc">;</span>  <span class="c"># True!</span></li><li>    <span class="q">&quot;aa\x08&quot;</span> =~ <span class="q">/${b}0/</span><span class="sc">;</span>  <span class="c"># False</span></li></ol></pre><p>Several special variables also refer back to portions of the previous
1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484
match.  <code class="inline"><span class="i">$+</span></code>
 returns whatever the last bracket match matched.
<code class="inline"><span class="i">$&amp;</span></code>
 returns the entire matched string.  (At one point <code class="inline"><span class="i">$0</span></code>
 did
also, but now it returns the name of the program.)  <code class="inline"><span class="i">$`</span></code>
 returns
everything before the matched string.  <code class="inline"><span class="i">$&#39;</span></code>
 returns everything
after the matched string. And <code class="inline"><span class="i">$^N</span></code>
 contains whatever was matched by
the most-recently closed group (submatch). <code class="inline"><span class="i">$^N</span></code>
 can be used in
extended patterns (see below), for example to assign a submatch to a
1485
variable.
1486
    </p>
1487 1488 1489 1490