helper.regex

Helper functions with regex capabilities
from fastcore.test import *
from trouver.helper.tests import _test_directory

source

find_regex_in_text

 find_regex_in_text (text:str, pattern:Union[str,Pattern[str]])

Return ranges in text where pattern occurs.

Type Details
text str Text in which to find regex patter
pattern Union The regex pattern
Returns list Each tuple is of the form (a,b) where text[a:b] is the regex match.

The following example finds the occurrence of the Markdown footnote:

regex_pattern = r'\[\^\d\]'
text = '[^1]: asdf'

output = find_regex_in_text(text, regex_pattern)
test_eq(output, [(0,4)])

start, end = output[0]
test_eq(text[start:end], '[^1]')

If there are multiple matches for the regex pattern, then they are all included in the outputted list.

regex_pattern = r'\d+'  # Searches for one or more consecutive digits
text = '9000 is a big number. But you know what is bigger? 9001.'

output = find_regex_in_text(text, regex_pattern)
test_eq(len(output), 2)

start, end = output[0]
test_eq(text[start:end], '9000')

start, end = output[1]
test_eq(text[start:end], '9001')

The following example detects YAML frontmatter text as used in Obsidian. This regex pattern is also used in markdown.markdown.file.find_front_matter_meta_in_markdown_text.

The regex pattern used is able to detect the frontmatter even when it is empty.

sample_regex = r'---\n([\S\s]*?)?(?(1)\n|)---'
sample_str = '---\n---'
sample_output = find_regex_in_text(sample_str, sample_regex)
assert sample_output == [(0,7)]

sample_str = '---\naliases: [this_is_an_aliases_for_the_Obsidian_note]\n---'
sample_output = find_regex_in_text(sample_str, sample_regex)
assert sample_output == [(0, len(sample_str))]  # The entire sample_str is detected.

Contrast the regex pattern above with the pattern ---\n[\S\s]*?\n---, which does not detect empty YAML frontmatter text.

sample_regex = r'---\n[\S\s]*?\n---'
sample_str = '---\n---'
sample_output = find_regex_in_text(sample_str, sample_regex)
assert not sample_output

source

separate_indices_from_str

 separate_indices_from_str (text:str, indices:list[tuple[int,int]])

*Divide text into parts along the substrings specified by indices.

Assumes that the pairs of indices specified by indices are in order from first to last and the ranges specified by these pairs are all disjoint.

''.join(output) should recover text.*

Type Details
text str
indices list The indices for substrings in text to separate.
Returns list Each str is a substring of text, either a substring of text specified by indices, or substrings in between the substrings specified by indices.

Here is a basic example of separate_indices_from_str:

text = 'hello asdf asdf'
sample_output = separate_indices_from_str(text, [(0,5), (10,11)])
print(sample_output)
test_eq(''.join(sample_output), text)
['', 'hello', ' asdf', ' ', 'asdf']

source

replace_string_by_indices

 replace_string_by_indices (string:str,
                            replace_ranges:Sequence[Union[Sequence[int],in
                            t]], replace_with:Union[Sequence[str],str])

*Replace parts of string at the specified locations”

Use this with find_regex_in_text.

Parameters

  • string - str
  • replace_ranges - Sequence[Sequence[int] | int]
    • Either a list of lists/tuples of one or two int’s. A list/tuple [a,b] or (a,b) means that string[a:b] is to be replaced. [a] or (a) means that string[a:] is to be replaced. The ranges should not overlap and should be arranged in chronological order.
  • replace_with - Sequence[str] | str
    • The str’s which will replace the parts represented by replace_ranges. replace_ranges and replace_with must be both lists or both not lists. If they are lists, they must be of the same length.

Returns

  • str*
Type Details
string str String in which to make replacemenets
replace_ranges Sequence A list of lists/tuples of int’s or a single list/tuple of int’s. Each
replace_with Union The str(s) which will replace the substrings at replace_ranges in string. replace_with must be a str exactly when replace_ranges is a Sequence of a single Sequence of int.
Returns str The str obtained by replacing the substrings at replace_range in string by the strs specified by replace_with.

The following are basic examples of replace_strings_by_indices:

test_eq(replace_string_by_indices('hello world', replace_ranges=(0,5), replace_with='hi'), 'hi world')
test_eq(replace_string_by_indices('hello somebody', replace_ranges=[(0,1), (6,10)], replace_with=['', '']), 'ello body')

If replace_ranges and replace_with are of different length, then a ValueError is raised:

with ExceptionExpected(ex=ValueError, regex="are different"):
    replace_string_by_indices('hello world', replace_ranges = [(0,5), (6,10)], replace_with = [''])

Finding LaTeX string


source

inline_latex_indices

 inline_latex_indices (text:str)

*Returns the indices in the text containing inline LaTeX str surrounded by $$.

This may not work correctly if the text has a LaTeX formatting issue or if any LaTeX string has a dollar sign \$.

Parameters

  • text - str

Returns

  • tuple[int]
    • Each tuple is of the form (start, end) where text[start:end] is a LaTeX string, including any leading trailing dollar signs ($$).*

source

latex_indices

 latex_indices (text:str)

*Returns the indices in the text containing LaTeX str.

This may not work correctly if the text has a LaTeX formatting issue.

Parameters

  • text - str

Returns

  • tuple[int]
    • Each tuple is of the form (start, end) where text[start:end] is a LaTeX string, including any leading trailing dollar signs ($ or $$).*

Here are some basic uses of the latex_indices function:

text = r'$$5 \neq 7$$ is a LaTeX equation.'
listy = latex_indices(text)
assert len(listy) == 1
start, end = listy[0]
test_eq(text[start:end], r'$$5 \neq 7$$')

text = r'$\mathcal{O}_X$ denotes the structure sheaf.'
listy = latex_indices(text)
assert len(listy) == 1
start, end = listy[0]
test_eq(text[start:end], r'$\mathcal{O}_X$')

text = r'$$\n5 \neq 7\n$$'
listy = latex_indices(text)
assert len(listy) == 1

If there is a dollar sign symbol \$ outside of a LaTeX string, then the latex_indices function works as expected; the dollar signs are not considered to be part of any LaTeX string:

text = r'\$6.2.4 helo blah $15+6+21$'  # Avoid detecting \$ as latex start/end
listy = latex_indices(text)
start, end = listy[0]
test_eq(text[start:end], r'$15+6+21$')

In the following example, the text has dollar sign symbols \$ which do not surround math mode text

text = r'\$6.2.4 helo blah $\$37$ are needed for stuff.' 
listy = latex_indices(text)
start, end = listy[0]
test_eq(len(listy), 1)
print(text[listy[0][0]:listy[0][1]])  # This should print `$\$`, which is at the start of `$\$37$`.
test_eq(text[start:end], r'$\$37$')
$\$37$

In the following example, note that \$S.10 is (correctly) not recognized as a LaTeX math mode string. Moreover, multi-line math mode strings are also recognized.

text = r"""
\$S.10 We have some latex string $a$ $hi$

$$
asdf
$$
"""
latex_indices(text)
[(34, 37), (38, 42), (44, 54)]
print(text[34:37])
print(text[38:42])
print(text[44:54])
$a$
$hi$
$$
asdf
$$

The inline_latex_indices function finds the indices only for in-line LaTeX math mode strings (which are surrounded by $$)

text = r"""
\$S.10 We have some latex string $a$ $hi$

$$
asdf
$$
"""
inline_latex_indices(text)
[(44, 54)]
print(text[44:54])
$$
asdf
$$