from fastcore.test import *
from trouver.helper.tests import _test_directory
helper.regex
find_regex_in_text
find_regex_in_text (text:str, pattern:Union[str,Pattern[str]])
Return ranges in text
where pattern
occurs.
Type | Details | |
---|---|---|
text | str | Text in which to find regex patter |
pattern | Union | The regex pattern |
Returns | list | Each tuple is of the form (a,b) where text[a:b] is the regex match. |
The following example finds the occurrence of the Markdown footnote:
= r'\[\^\d\]'
regex_pattern = '[^1]: asdf'
text
= find_regex_in_text(text, regex_pattern)
output 0,4)])
test_eq(output, [(
= output[0]
start, end '[^1]') test_eq(text[start:end],
If there are multiple matches for the regex pattern, then they are all included in the outputted list.
= r'\d+' # Searches for one or more consecutive digits
regex_pattern = '9000 is a big number. But you know what is bigger? 9001.'
text
= find_regex_in_text(text, regex_pattern)
output len(output), 2)
test_eq(
= output[0]
start, end '9000')
test_eq(text[start:end],
= output[1]
start, end '9001') test_eq(text[start:end],
The following example detects YAML frontmatter text as used in Obsidian. This regex pattern is also used in markdown.markdown.file.find_front_matter_meta_in_markdown_text
.
The regex pattern used is able to detect the frontmatter even when it is empty.
= r'---\n([\S\s]*?)?(?(1)\n|)---'
sample_regex = '---\n---'
sample_str = find_regex_in_text(sample_str, sample_regex)
sample_output assert sample_output == [(0,7)]
= '---\naliases: [this_is_an_aliases_for_the_Obsidian_note]\n---'
sample_str = find_regex_in_text(sample_str, sample_regex)
sample_output assert sample_output == [(0, len(sample_str))] # The entire sample_str is detected.
Contrast the regex pattern above with the pattern ---\n[\S\s]*?\n---
, which does not detect empty YAML frontmatter text.
= r'---\n[\S\s]*?\n---'
sample_regex = '---\n---'
sample_str = find_regex_in_text(sample_str, sample_regex)
sample_output assert not sample_output
separate_indices_from_str
separate_indices_from_str (text:str, indices:list[tuple[int,int]])
*Divide text
into parts along the substrings specified by indices
.
Assumes that the pairs of indices specified by indices
are in order from first to last and the ranges specified by these pairs are all disjoint.
''.join(output)
should recover text
.*
Type | Details | |
---|---|---|
text | str | |
indices | list | The indices for substrings in text to separate. |
Returns | list | Each str is a substring of text , either a substring of text specified by indices , or substrings in between the substrings specified by indices . |
Here is a basic example of separate_indices_from_str
:
= 'hello asdf asdf'
text = separate_indices_from_str(text, [(0,5), (10,11)])
sample_output print(sample_output)
''.join(sample_output), text) test_eq(
['', 'hello', ' asdf', ' ', 'asdf']
replace_string_by_indices
replace_string_by_indices (string:str, replace_ranges:Sequence[Union[Sequence[int],in t]], replace_with:Union[Sequence[str],str])
*Replace parts of string
at the specified locations”
Use this with find_regex_in_text
.
Parameters
string
-str
replace_ranges
-Sequence[Sequence[int] | int]
- Either a list of lists/tuples of one or two int’s. A list/tuple
[a,b]
or(a,b)
means thatstring[a:b]
is to be replaced.[a]
or(a)
means thatstring[a:]
is to be replaced. The ranges should not overlap and should be arranged in chronological order.
- Either a list of lists/tuples of one or two int’s. A list/tuple
replace_with
-Sequence[str] | str
- The str’s which will replace the parts represented by
replace_ranges
.replace_ranges
andreplace_with
must be both lists or both not lists. If they are lists, they must be of the same length.
- The str’s which will replace the parts represented by
Returns
- str*
Type | Details | |
---|---|---|
string | str | String in which to make replacemenets |
replace_ranges | Sequence | A list of lists/tuples of int’s or a single list/tuple of int’s. Each |
replace_with | Union | The str(s) which will replace the substrings at replace_ranges in string . replace_with must be a str exactly when replace_ranges is a Sequence of a single Sequence of int. |
Returns | str | The str obtained by replacing the substrings at replace_range in string by the strs specified by replace_with . |
The following are basic examples of replace_strings_by_indices
:
'hello world', replace_ranges=(0,5), replace_with='hi'), 'hi world')
test_eq(replace_string_by_indices('hello somebody', replace_ranges=[(0,1), (6,10)], replace_with=['', '']), 'ello body') test_eq(replace_string_by_indices(
If replace_ranges
and replace_with
are of different length, then a ValueError
is raised:
with ExceptionExpected(ex=ValueError, regex="are different"):
'hello world', replace_ranges = [(0,5), (6,10)], replace_with = ['']) replace_string_by_indices(
Finding LaTeX string
inline_latex_indices
inline_latex_indices (text:str)
*Returns the indices in the text containing inline LaTeX str surrounded by $$
.
This may not work correctly if the text has a LaTeX formatting issue or if any LaTeX string has a dollar sign \$
.
Parameters
- text - str
Returns
- tuple[int]
- Each tuple is of the form
(start, end)
wheretext[start:end]
is a LaTeX string, including any leading trailing dollar signs ($$
).*
- Each tuple is of the form
latex_indices
latex_indices (text:str)
*Returns the indices in the text containing LaTeX str.
This may not work correctly if the text has a LaTeX formatting issue.
Parameters
- text - str
Returns
- tuple[int]
- Each tuple is of the form
(start, end)
wheretext[start:end]
is a LaTeX string, including any leading trailing dollar signs ($
or$$
).*
- Each tuple is of the form
Here are some basic uses of the latex_indices
function:
= r'$$5 \neq 7$$ is a LaTeX equation.'
text = latex_indices(text)
listy assert len(listy) == 1
= listy[0]
start, end r'$$5 \neq 7$$')
test_eq(text[start:end],
= r'$\mathcal{O}_X$ denotes the structure sheaf.'
text = latex_indices(text)
listy assert len(listy) == 1
= listy[0]
start, end r'$\mathcal{O}_X$')
test_eq(text[start:end],
= r'$$\n5 \neq 7\n$$'
text = latex_indices(text)
listy assert len(listy) == 1
If there is a dollar sign symbol \$
outside of a LaTeX string, then the latex_indices
function works as expected; the dollar signs are not considered to be part of any LaTeX string:
= r'\$6.2.4 helo blah $15+6+21$' # Avoid detecting \$ as latex start/end
text = latex_indices(text)
listy = listy[0]
start, end r'$15+6+21$') test_eq(text[start:end],
In the following example, the text has dollar sign symbols \$
which do not surround math mode text
= r'\$6.2.4 helo blah $\$37$ are needed for stuff.'
text = latex_indices(text)
listy = listy[0]
start, end len(listy), 1)
test_eq(print(text[listy[0][0]:listy[0][1]]) # This should print `$\$`, which is at the start of `$\$37$`.
r'$\$37$') test_eq(text[start:end],
$\$37$
In the following example, note that \$S.10
is (correctly) not recognized as a LaTeX math mode string. Moreover, multi-line math mode strings are also recognized.
= r"""
text \$S.10 We have some latex string $a$ $hi$
$$
asdf
$$
"""
latex_indices(text)
[(34, 37), (38, 42), (44, 54)]
print(text[34:37])
print(text[38:42])
print(text[44:54])
$a$
$hi$
$$
asdf
$$
The inline_latex_indices
function finds the indices only for in-line LaTeX math mode strings (which are surrounded by $$
)
= r"""
text \$S.10 We have some latex string $a$ $hi$
$$
asdf
$$
"""
inline_latex_indices(text)
[(44, 54)]
print(text[44:54])
$$
asdf
$$