Files
GopherGate/target/doc/regex_syntax/utf8/index.html
2026-02-26 12:00:21 -05:00

55 lines
7.8 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
<!DOCTYPE html><html lang="en"><head><meta charset="utf-8"><meta name="viewport" content="width=device-width, initial-scale=1.0"><meta name="generator" content="rustdoc"><meta name="description" content="Converts ranges of Unicode scalar values to equivalent ranges of UTF-8 bytes."><title>regex_syntax::utf8 - Rust</title><script>if(window.location.protocol!=="file:")document.head.insertAdjacentHTML("beforeend","SourceSerif4-Regular-6b053e98.ttf.woff2,FiraSans-Italic-81dc35de.woff2,FiraSans-Regular-0fe48ade.woff2,FiraSans-MediumItalic-ccf7e434.woff2,FiraSans-Medium-e1aa3f0a.woff2,SourceCodePro-Regular-8badfe75.ttf.woff2,SourceCodePro-Semibold-aa29a496.ttf.woff2".split(",").map(f=>`<link rel="preload" as="font" type="font/woff2"href="../../static.files/${f}">`).join(""))</script><link rel="stylesheet" href="../../static.files/normalize-9960930a.css"><link rel="stylesheet" href="../../static.files/rustdoc-ca0dd0c4.css"><meta name="rustdoc-vars" data-root-path="../../" data-static-root-path="../../static.files/" data-current-crate="regex_syntax" data-themes="" data-resource-suffix="" data-rustdoc-version="1.93.1 (01f6ddf75 2026-02-11) (Arch Linux rust 1:1.93.1-1)" data-channel="1.93.1" data-search-js="search-9e2438ea.js" data-stringdex-js="stringdex-a3946164.js" data-settings-js="settings-c38705f0.js" ><script src="../../static.files/storage-e2aeef58.js"></script><script defer src="../sidebar-items.js"></script><script defer src="../../static.files/main-a410ff4d.js"></script><noscript><link rel="stylesheet" href="../../static.files/noscript-263c88ec.css"></noscript><link rel="alternate icon" type="image/png" href="../../static.files/favicon-32x32-eab170b8.png"><link rel="icon" type="image/svg+xml" href="../../static.files/favicon-044be391.svg"></head><body class="rustdoc mod"><!--[if lte IE 11]><div class="warning">This old browser is unsupported and will most likely display funky things.</div><![endif]--><rustdoc-topbar><h2><a href="#">Module utf8</a></h2></rustdoc-topbar><nav class="sidebar"><div class="sidebar-crate"><h2><a href="../../regex_syntax/index.html">regex_<wbr>syntax</a><span class="version">0.8.10</span></h2></div><div class="sidebar-elems"><section id="rustdoc-toc"><h2 class="location"><a href="#">Module utf8</a></h2><h3><a href="#">Sections</a></h3><ul class="block top-toc"><li><a href="#wait-what-is-this" title="Wait, what is this?">Wait, what is this?</a></li><li><a href="#lineage" title="Lineage">Lineage</a></li></ul><h3><a href="#structs">Module Items</a></h3><ul class="block"><li><a href="#structs" title="Structs">Structs</a></li><li><a href="#enums" title="Enums">Enums</a></li></ul></section><div id="rustdoc-modnav"><h2 class="in-crate"><a href="../index.html">In crate regex_<wbr>syntax</a></h2></div></div></nav><div class="sidebar-resizer" title="Drag to resize sidebar"></div><main><div class="width-limiter"><section id="main-content" class="content"><div class="main-heading"><div class="rustdoc-breadcrumbs"><a href="../index.html">regex_syntax</a></div><h1>Module <span>utf8</span>&nbsp;<button id="copy-path" title="Copy item path to clipboard">Copy item path</button></h1><rustdoc-toolbar></rustdoc-toolbar><span class="sub-heading"><a class="src" href="../../src/regex_syntax/utf8.rs.html#1-592">Source</a> </span></div><details class="toggle top-doc" open><summary class="hideme"><span>Expand description</span></summary><div class="docblock"><p>Converts ranges of Unicode scalar values to equivalent ranges of UTF-8 bytes.</p>
<p>This is sub-module is useful for constructing byte based automatons that need
to embed UTF-8 decoding. The most common use of this module is in conjunction
with the <a href="../hir/struct.ClassUnicodeRange.html" title="struct regex_syntax::hir::ClassUnicodeRange"><code>hir::ClassUnicodeRange</code></a> type.</p>
<p>See the documentation on the <code>Utf8Sequences</code> iterator for more details and
an example.</p>
<h2 id="wait-what-is-this"><a class="doc-anchor" href="#wait-what-is-this">§</a>Wait, what is this?</h2>
<p>This is simplest to explain with an example. Lets say you wanted to test
whether a particular byte sequence was a Cyrillic character. One possible
scalar value range is <code>[0400-04FF]</code>. The set of allowed bytes for this
range can be expressed as a sequence of byte ranges:</p>
<div class="example-wrap"><pre class="language-text"><code>[D0-D3][80-BF]</code></pre></div>
<p>This is simple enough: simply encode the boundaries, <code>0400</code> encodes to
<code>D0 80</code> and <code>04FF</code> encodes to <code>D3 BF</code>, and create ranges from each
corresponding pair of bytes: <code>D0</code> to <code>D3</code> and <code>80</code> to <code>BF</code>.</p>
<p>However, what if you wanted to add the Cyrillic Supplementary characters to
your range? Your range might then become <code>[0400-052F]</code>. The same procedure
as above doesnt quite work because <code>052F</code> encodes to <code>D4 AF</code>. The byte ranges
youd get from the previous transformation would be <code>[D0-D4][80-AF]</code>. However,
this isnt quite correct because this range doesnt capture many characters,
for example, <code>04FF</code> (because its last byte, <code>BF</code> isnt in the range <code>80-AF</code>).</p>
<p>Instead, you need multiple sequences of byte ranges:</p>
<div class="example-wrap"><pre class="language-text"><code>[D0-D3][80-BF] # matches codepoints 0400-04FF
[D4][80-AF] # matches codepoints 0500-052F</code></pre></div>
<p>This gets even more complicated if you want bigger ranges, particularly if
they naively contain surrogate codepoints. For example, the sequence of byte
ranges for the basic multilingual plane (<code>[0000-FFFF]</code>) look like this:</p>
<div class="example-wrap"><pre class="language-text"><code>[0-7F]
[C2-DF][80-BF]
[E0][A0-BF][80-BF]
[E1-EC][80-BF][80-BF]
[ED][80-9F][80-BF]
[EE-EF][80-BF][80-BF]</code></pre></div>
<p>Note that the byte ranges above will <em>not</em> match any erroneous encoding of
UTF-8, including encodings of surrogate codepoints.</p>
<p>And, of course, for all of Unicode (<code>[000000-10FFFF]</code>):</p>
<div class="example-wrap"><pre class="language-text"><code>[0-7F]
[C2-DF][80-BF]
[E0][A0-BF][80-BF]
[E1-EC][80-BF][80-BF]
[ED][80-9F][80-BF]
[EE-EF][80-BF][80-BF]
[F0][90-BF][80-BF][80-BF]
[F1-F3][80-BF][80-BF][80-BF]
[F4][80-8F][80-BF][80-BF]</code></pre></div>
<p>This module automates the process of creating these byte ranges from ranges of
Unicode scalar values.</p>
<h2 id="lineage"><a class="doc-anchor" href="#lineage">§</a>Lineage</h2>
<p>I got the idea and general implementation strategy from Russ Cox in his
<a href="https://web.archive.org/web/20160404141123/https://swtch.com/~rsc/regexp/regexp3.html">article on regexps</a> and RE2.
Russ Cox got it from Ken Thompsons <code>grep</code> (no source, folk lore?).
I also got the idea from
<a href="https://github.com/apache/lucene-solr/blob/ae93f4e7ac6a3908046391de35d4f50a0d3c59ca/lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java">Lucene</a>,
which uses it for executing automata on their term index.</p>
</div></details><h2 id="structs" class="section-header">Structs<a href="#structs" class="anchor">§</a></h2><dl class="item-table"><dt><a class="struct" href="struct.Utf8Range.html" title="struct regex_syntax::utf8::Utf8Range">Utf8<wbr>Range</a></dt><dd>A single inclusive range of UTF-8 bytes.</dd><dt><a class="struct" href="struct.Utf8Sequences.html" title="struct regex_syntax::utf8::Utf8Sequences">Utf8<wbr>Sequences</a></dt><dd>An iterator over ranges of matching UTF-8 byte sequences.</dd></dl><h2 id="enums" class="section-header">Enums<a href="#enums" class="anchor">§</a></h2><dl class="item-table"><dt><a class="enum" href="enum.Utf8Sequence.html" title="enum regex_syntax::utf8::Utf8Sequence">Utf8<wbr>Sequence</a></dt><dd>Utf8Sequence represents a sequence of byte ranges.</dd></dl></section></div></main></body></html>