helper.regex

Helper functions with regex capabilities

from fastcore.test import *
from trouver.helper.tests import _test_directory

source

find_regex_in_text

 find_regex_in_text (text:str, pattern:Union[str,Pattern[str]])

Return ranges in text where pattern occurs.

	Type	Details
text	str	Text in which to find regex patter
pattern	Union	The regex pattern
Returns	list	Each tuple is of the form `(a,b)` where `text[a:b]` is the regex match.

The following example finds the occurrence of the Markdown footnote:

regex_pattern = r'\[\^\d\]'
text = '[^1]: asdf'

output = find_regex_in_text(text, regex_pattern)
test_eq(output, [(0,4)])

start, end = output[0]
test_eq(text[start:end], '[^1]')

If there are multiple matches for the regex pattern, then they are all included in the outputted list.

regex_pattern = r'\d+'  # Searches for one or more consecutive digits
text = '9000 is a big number. But you know what is bigger? 9001.'

output = find_regex_in_text(text, regex_pattern)
test_eq(len(output), 2)

start, end = output[0]
test_eq(text[start:end], '9000')

start, end = output[1]
test_eq(text[start:end], '9001')

The following example detects YAML frontmatter text as used in Obsidian. This regex pattern is also used in markdown.markdown.file.find_front_matter_meta_in_markdown_text.

The regex pattern used is able to detect the frontmatter even when it is empty.

sample_regex = r'---\n([\S\s]*?)?(?(1)\n|)---'
sample_str = '---\n---'
sample_output = find_regex_in_text(sample_str, sample_regex)
assert sample_output == [(0,7)]

sample_str = '---\naliases: [this_is_an_aliases_for_the_Obsidian_note]\n---'
sample_output = find_regex_in_text(sample_str, sample_regex)
assert sample_output == [(0, len(sample_str))]  # The entire sample_str is detected.

Contrast the regex pattern above with the pattern ---\n[\S\s]*?\n---, which does not detect empty YAML frontmatter text.

sample_regex = r'---\n[\S\s]*?\n---'
sample_str = '---\n---'
sample_output = find_regex_in_text(sample_str, sample_regex)
assert not sample_output

source

separate_indices_from_str

 separate_indices_from_str (text:str, indices:list[tuple[int,int]])

*Divide text into parts along the substrings specified by indices.

Assumes that the pairs of indices specified by indices are in order from first to last and the ranges specified by these pairs are all disjoint.

''.join(output) should recover text.*

	Type	Details
text	str
indices	list	The indices for substrings in `text` to separate.
Returns	list	Each str is a substring of `text`, either a substring of `text` specified by `indices`, or substrings in between the substrings specified by `indices`.

Here is a basic example of separate_indices_from_str:

text = 'hello asdf asdf'
sample_output = separate_indices_from_str(text, [(0,5), (10,11)])
print(sample_output)
test_eq(''.join(sample_output), text)

['', 'hello', ' asdf', ' ', 'asdf']

source

replace_string_by_indices

 replace_string_by_indices (string:str,
                            replace_ranges:Sequence[Union[Sequence[int],in
                            t]], replace_with:Union[Sequence[str],str])

*Replace parts of string at the specified locations”

Use this with find_regex_in_text.

Parameters

string - str
replace_ranges - Sequence[Sequence[int] | int]
- Either a list of lists/tuples of one or two int’s. A list/tuple [a,b] or (a,b) means that string[a:b] is to be replaced. [a] or (a) means that string[a:] is to be replaced. The ranges should not overlap and should be arranged in chronological order.
replace_with - Sequence[str] | str
- The str’s which will replace the parts represented by replace_ranges. replace_ranges and replace_with must be both lists or both not lists. If they are lists, they must be of the same length.

Returns

str*

	Type	Details
string	str	String in which to make replacemenets
replace_ranges	Sequence	A list of lists/tuples of int’s or a single list/tuple of int’s. Each
replace_with	Union	The str(s) which will replace the substrings at `replace_ranges` in `string`. `replace_with` must be a str exactly when `replace_ranges` is a Sequence of a single Sequence of int.
Returns	str	The str obtained by replacing the substrings at `replace_range` in `string` by the strs specified by `replace_with`.

The following are basic examples of replace_strings_by_indices:

test_eq(replace_string_by_indices('hello world', replace_ranges=(0,5), replace_with='hi'), 'hi world')
test_eq(replace_string_by_indices('hello somebody', replace_ranges=[(0,1), (6,10)], replace_with=['', '']), 'ello body')

If replace_ranges and replace_with are of different length, then a ValueError is raised:

with ExceptionExpected(ex=ValueError, regex="are different"):
    replace_string_by_indices('hello world', replace_ranges = [(0,5), (6,10)], replace_with = [''])

Finding LaTeX string

source

inline_latex_indices

 inline_latex_indices (text:str)

*Returns the indices in the text containing inline LaTeX str surrounded by $$.

This may not work correctly if the text has a LaTeX formatting issue or if any LaTeX string has a dollar sign \$.

Parameters

text - str

Returns

tuple[int]
- Each tuple is of the form (start, end) where text[start:end] is a LaTeX string, including any leading trailing dollar signs ($$).*

source

latex_indices

 latex_indices (text:str)

*Returns the indices in the text containing LaTeX str.

This may not work correctly if the text has a LaTeX formatting issue.

Parameters

text - str

Returns

tuple[int]
- Each tuple is of the form (start, end) where text[start:end] is a LaTeX string, including any leading trailing dollar signs ($ or $$).*

Here are some basic uses of the latex_indices function:

text = r'$$5 \neq 7$$ is a LaTeX equation.'
listy = latex_indices(text)
assert len(listy) == 1
start, end = listy[0]
test_eq(text[start:end], r'$$5 \neq 7$$')

text = r'$\mathcal{O}_X$ denotes the structure sheaf.'
listy = latex_indices(text)
assert len(listy) == 1
start, end = listy[0]
test_eq(text[start:end], r'$\mathcal{O}_X$')

text = r'$$\n5 \neq 7\n$$'
listy = latex_indices(text)
assert len(listy) == 1

If there is a dollar sign symbol \$ outside of a LaTeX string, then the latex_indices function works as expected; the dollar signs are not considered to be part of any LaTeX string:

text = r'\$6.2.4 helo blah $15+6+21$'  # Avoid detecting \$ as latex start/end
listy = latex_indices(text)
start, end = listy[0]
test_eq(text[start:end], r'$15+6+21$')

In the following example, the text has dollar sign symbols \$ which do not surround math mode text

text = r'\$6.2.4 helo blah $\$37$ are needed for stuff.' 
listy = latex_indices(text)
start, end = listy[0]
test_eq(len(listy), 1)
print(text[listy[0][0]:listy[0][1]])  # This should print `$\$`, which is at the start of `$\$37$`.
test_eq(text[start:end], r'$\$37$')

$\$37$

In the following example, note that \$S.10 is (correctly) not recognized as a LaTeX math mode string. Moreover, multi-line math mode strings are also recognized.

text = r"""
\$S.10 We have some latex string $a$ $hi$

$$
asdf
$$
"""
latex_indices(text)

[(34, 37), (38, 42), (44, 54)]

print(text[34:37])
print(text[38:42])
print(text[44:54])

$a$
$hi$
$$
asdf
$$

The inline_latex_indices function finds the indices only for in-line LaTeX math mode strings (which are surrounded by $$)

text = r"""
\$S.10 We have some latex string $a$ $hi$

$$
asdf
$$
"""
inline_latex_indices(text)

[(44, 54)]

print(text[44:54])

$$
asdf
$$