# TODO: Create a new module dedicated to definition and notation identification and move approparite functions over there.
markdown.obsidian.personal.machine_learning.tokenize.def_and_notat_token_classification
Previous, trouver
just had functionalities for using ML models to identify newly introduced notations in text and for gathering data to train such models. Moreover, such models were merely classification models, and using these models to identify newly introduced notations had a lot of computational redundancies.
This module aims to provide the same functionalities for both definitions and notations by training and using token classification models instead.
from unittest import mock
import shutil
import tempfile
from datasets import ClassLabel, Dataset, Features, Sequence, Value
from transformers import AutoTokenizer
from fastcore.test import *
from trouver.helper.tests import _test_directory
Gather ML data from information notes
html_data_from_note
html_data_from_note (note_or_mf:Union[trouver.markdown.obsidian.vault.Vau ltNote,trouver.markdown.markdown.file.MarkdownFile], vault:Optional[os.PathLike]=None, note_name:Optional[str]=None)
*Obtain html data for token classification from the information note.
Currently, the token types mainly revolve around definitions and notations.
If note
has the tag _auto/def_and_notat_identified
, then the data in the note is assumed to be auto-generated and not verified and None
is returned.
Returns - Union[dict, None] - The keys-value pairs are - "Note name"
- The name of the note - "Raw text"
- The raw text to include in the data. - "Tag data"
- The list with HTML tags carrying definition/notation data and their locations in the Raw text. See the second output to the function remove_html_tags_in_text
. - Each element of the list is a tuple consisting of a bs4.element.Tag
and two ints.*
Type | Default | Details | |
---|---|---|---|
note_or_mf | Union | Either a VaultNote object to a note or a [ MarkdownFile](https://hyunjongkimmath.github.io/trouver/markdown.markdown.file.html#markdownfile) object from which to extra html data. | | vault | Optional | None | If vault to use when processing the [ MarkdownFile](https://hyunjongkimmath.github.io/trouver/markdown.markdown.file.html#markdownfile) objects (if note_of_mfis a [ VaultNote](https://hyunjongkimmath.github.io/trouver/markdown.obsidian.vault.html#vaultnote), then this [ MarkdownFile](https://hyunjongkimmath.github.io/trouver/markdown.markdown.file.html#markdownfile) object is created from the text of the note), cf. the [ process_standard_information_note](https://hyunjongkimmath.github.io/trouver/markdown.obsidian.personal.note_processing.html#process_standard_information_note) function. | | note_name | Optional | None | If note_or_mfis a [ MarkdownFile](https://hyunjongkimmath.github.io/trouver/markdown.markdown.file.html#markdownfile), note_nameshould be the name of the note from which the [ MarkdownFile](https://hyunjongkimmath.github.io/trouver/markdown.markdown.file.html#markdownfile) comes from if applicable. If note_or_mfis a [ VaultNote](https://hyunjongkimmath.github.io/trouver/markdown.obsidian.vault.html#vaultnote) object, then note_nameis ignored and note_or_mf.nameis used instead. | | **Returns** | **Optional** | | **The keys to the dict are "Note name", "Raw text", "Tag data". However, Noneis returned if note` does not exist or the note is marked with auto-generated, unverified data.** |
In the following example, we mock a VaultNote
whose content is that of mf
in the example for the raw_text_with_html_tags_from_markdownfile
function. Note that there is some text surrounded by double within mf
surrounded by double asterisks **
and some text surrounded by HTML tags to indicate definitions and notations introduced.
= MarkdownFile.from_string(
mf r"""---
aliases: []
tags: []
---
# Galois group of a separable and normal finite field extension
Let $L/K$ be a separable and normal finite field extension. Its <b definition="Galois group of a separable and normal finite field extension">Galois group</b> <span notation="">$\operatorname{Gal}(L/K)$</span> is...
# Galois group of a separable and normal profinite field extension
In fact, the notion of a Galois group can be defined for profinite field extensions. Given a separable and normal profinite field extension $L/K$, say that
$L = \varinjlim_i L_i$ where $L_i/K$ are finite extensions. Its **Galois group** **$\operatorname{Gal}(L/K)$**
# See Also
# Meta
## References and Citations
""")
with (mock.patch('__main__.VaultNote') as mock_VaultNote,
'__main__.MarkdownFile.from_vault_note') as mock_from_vault_note,
mock.patch('__main__.isinstance') as mock_isinstance):
mock.patch(= True
mock_VaultNote.exists.return_value = "Note's name"
mock_VaultNote.name = mf
mock_from_vault_note.return_value = True
mock_isinstance.return_value
print(f"The following is the text from mf:\n\n{str(mf)}")
= html_data_from_note(mock_VaultNote, None)
html_data print(html_data)
'Note name'], "Note's name")
test_eq(html_data[assert '**' not in html_data['Raw text']
assert '<' not in html_data['Raw text'] # Test the lack of HTML tags in the raw text
print(html_data['Tag data'])
len(html_data['Tag data']), 4)
test_eq(assert isinstance(html_data['Tag data'][0][0], bs4.element.Tag)
assert html_data['Tag data'][0][0].has_attr('definition')
assert not html_data['Tag data'][0][0].has_attr('notation')
assert html_data['Tag data'][1][0].has_attr('notation')
assert not html_data['Tag data'][1][0].has_attr('definition')
assert html_data['Tag data'][2][0].has_attr('definition')
assert not html_data['Tag data'][2][0].has_attr('notation')
assert html_data['Tag data'][3][0].has_attr('notation')
assert not html_data['Tag data'][3][0].has_attr('definition')
The following is the text from mf:
---
aliases: []
tags: []
---
# Galois group of a separable and normal finite field extension
Let $L/K$ be a separable and normal finite field extension. Its <b definition="Galois group of a separable and normal finite field extension">Galois group</b> <span notation="">$\operatorname{Gal}(L/K)$</span> is...
# Galois group of a separable and normal profinite field extension
In fact, the notion of a Galois group can be defined for profinite field extensions. Given a separable and normal profinite field extension $L/K$, say that
$L = \varinjlim_i L_i$ where $L_i/K$ are finite extensions. Its **Galois group** **$\operatorname{Gal}(L/K)$**
# See Also
# Meta
## References and Citations
{'Note name': "Note's name", 'Raw text': 'Let $L/K$ be a separable and normal finite field extension. Its Galois group $\\operatorname{Gal}(L/K)$ is...\n\nIn fact, the notion of a Galois group can be defined for profinite field extensions. Given a separable and normal profinite field extension $L/K$, say that\n$L = \\varinjlim_i L_i$ where $L_i/K$ are finite extensions. Its Galois group $\\operatorname{Gal}(L/K)$\n', 'Tag data': [(<b definition="Galois group of a separable and normal finite field extension">Galois group</b>, 64, 76), (<span notation="">$\operatorname{Gal}(L/K)$</span>, 77, 102), (<b definition="">Galois group</b>, 330, 342), (<span notation="">$\operatorname{Gal}(L/K)$</span>, 343, 368)]}
[(<b definition="Galois group of a separable and normal finite field extension">Galois group</b>, 64, 76), (<span notation="">$\operatorname{Gal}(L/K)$</span>, 77, 102), (<b definition="">Galois group</b>, 330, 342), (<span notation="">$\operatorname{Gal}(L/K)$</span>, 343, 368)]
We can also just pass a MarkdonwFile
object instead of a VaultNote
object. In this case, we can specify the note_name
parameter to indicate which note the MarkdownFile
object came from, if applicable.
= html_data_from_note(mf, vault=None, note_name="Note's name")
html_data print(html_data)
'Note name'], "Note's name")
test_eq(html_data[assert '**' not in html_data['Raw text']
assert '<' not in html_data['Raw text'] # Test the lack of HTML tags in the raw text
print(html_data['Tag data'])
len(html_data['Tag data']), 4)
test_eq(assert isinstance(html_data['Tag data'][0][0], bs4.element.Tag)
assert html_data['Tag data'][0][0].has_attr('definition')
assert not html_data['Tag data'][0][0].has_attr('notation')
assert html_data['Tag data'][1][0].has_attr('notation')
assert not html_data['Tag data'][1][0].has_attr('definition')
assert html_data['Tag data'][2][0].has_attr('definition')
assert not html_data['Tag data'][2][0].has_attr('notation')
assert html_data['Tag data'][3][0].has_attr('notation')
assert not html_data['Tag data'][3][0].has_attr('definition')
{'Note name': "Note's name", 'Raw text': 'Let $L/K$ be a separable and normal finite field extension. Its Galois group $\\operatorname{Gal}(L/K)$ is...\n\nIn fact, the notion of a Galois group can be defined for profinite field extensions. Given a separable and normal profinite field extension $L/K$, say that\n$L = \\varinjlim_i L_i$ where $L_i/K$ are finite extensions. Its Galois group $\\operatorname{Gal}(L/K)$\n', 'Tag data': [(<b definition="Galois group of a separable and normal finite field extension">Galois group</b>, 64, 76), (<span notation="">$\operatorname{Gal}(L/K)$</span>, 77, 102), (<b definition="">Galois group</b>, 330, 342), (<span notation="">$\operatorname{Gal}(L/K)$</span>, 343, 368)]}
[(<b definition="Galois group of a separable and normal finite field extension">Galois group</b>, 64, 76), (<span notation="">$\operatorname{Gal}(L/K)$</span>, 77, 102), (<b definition="">Galois group</b>, 330, 342), (<span notation="">$\operatorname{Gal}(L/K)$</span>, 343, 368)]
If we do not specify note_name
, then None
is used for the 'Note name'
key in the output:
= html_data_from_note(mf, vault=None, note_name=None)
html_data print(html_data)
assert html_data['Note name'] is None
{'Note name': None, 'Raw text': 'Let $L/K$ be a separable and normal finite field extension. Its Galois group $\\operatorname{Gal}(L/K)$ is...\n\nIn fact, the notion of a Galois group can be defined for profinite field extensions. Given a separable and normal profinite field extension $L/K$, say that\n$L = \\varinjlim_i L_i$ where $L_i/K$ are finite extensions. Its Galois group $\\operatorname{Gal}(L/K)$\n', 'Tag data': [(<b definition="Galois group of a separable and normal finite field extension">Galois group</b>, 64, 76), (<span notation="">$\operatorname{Gal}(L/K)$</span>, 77, 102), (<b definition="">Galois group</b>, 330, 342), (<span notation="">$\operatorname{Gal}(L/K)$</span>, 343, 368)]}
For the following example, the note has an HTML tag already with extra data (attributes other than 'definition'
or 'notation'
). We assert that the extra data is preserved.
with (mock.patch('__main__.VaultNote') as mock_VaultNote,
'__main__.MarkdownFile.from_vault_note') as mock_from_vault_note,
mock.patch('__main__.isinstance') as mock_isinstance):
mock.patch(= True
mock_VaultNote.exists.return_value = "Note's name"
mock_VaultNote.name = True
mock_isinstance.return_value
= 'Let $X$ be a topological space and let $U \subseteq X$ be an subspace. The <b definition="Closure of a subspace of a topological space" typo="dosure of $U$">closure of $U$</b> is defined as...'
text = MarkdownFile.from_string(text)
mf = mf
mock_from_vault_note.return_value print(f"The following is the text of the mocked note: \n\n {text}\n\n")
= html_data_from_note(mock_VaultNote, None)
html_data print(html_data)
assert html_data['Tag data'][0][0].has_attr('typo')
'Tag data'][0][0].attrs['typo'], 'dosure of $U$') test_eq(html_data[
The following is the text of the mocked note:
Let $X$ be a topological space and let $U \subseteq X$ be an subspace. The <b definition="Closure of a subspace of a topological space" typo="dosure of $U$">closure of $U$</b> is defined as...
{'Note name': "Note's name", 'Raw text': 'Let $X$ be a topological space and let $U \\subseteq X$ be an subspace. The closure of $U$ is defined as...', 'Tag data': [(<b definition="Closure of a subspace of a topological space" typo="dosure of $U$">closure of $U$</b>, 75, 89)]}
In the following example, the (mocked) note has the #_auto/def_and_notats_identified
tag to indicate that its definition and notation markings were auto-generated by a model (trained with data processed by the tokenize_html_data
function) using the auto_mark_def_and_notats
function. In this case, the html_data_from_note
function returns None
to prevent gathering data that is unverified and auto-generated by a model.
# with (mock.patch('__main__.VaultNote') as mock_VaultNote,
# mock.patch('__main__.MarkdownFile.from_vault_note') as mock_from_vault_note):
# mock_VaultNote.exists.return_value = True
# mock_VaultNote.name = "Note's name"
= r'''---
text tags: [_auto/def_and_notat_identified]
---
Let $X$ be a topological space and let $U \subseteq X$ be an subspace. The <b definition="Closure of a subspace of a topological space" typo="dosure of $U$">closure of $U$</b> is defined as...'''
= MarkdownFile.from_string(text)
mf = mf
mock_from_vault_note.return_value print(f"The following is the text of the mocked note: \n\n{text}\n\n")
= html_data_from_note(note_or_mf=mf)
html_data assert(html_data is None)
The following is the text of the mocked note:
---
tags: [_auto/def_and_notat_identified]
---
Let $X$ be a topological space and let $U \subseteq X$ be an subspace. The <b definition="Closure of a subspace of a topological space" typo="dosure of $U$">closure of $U$</b> is defined as...
def_or_notat_from_html_tag
def_or_notat_from_html_tag (tag:bs4.element.Tag)
*Can be passed as the ner_tag_from_html_tag
argument in tokenize_html_data
for the purposes of compiling a dataset for definition and notation identification.
The strings f”I-{output}” and f”B-{output}” are valid ner_tags. To use for*
tokenize_html_data
tokenize_html_data (html_locus:dict, tokenizer:Union[transformers.tokeniz ation_utils.PreTrainedTokenizer,transformers.tokeniza tion_utils_fast.PreTrainedTokenizerFast], max_length:int, ner_tag_from_html_tag:<built- infunctioncallable>, label2id:dict[str,int], default_label:str='O')
*Actually tokenize the html data outputted by html_data_from_note
.
To account for the possibility that the raw text is long, this function uses the tokenizer.batch_encode_plus
function to tokenize the text into sequences.*
Type | Default | Details | |
---|---|---|---|
html_locus | dict | An output of html_data_from_note |
|
tokenizer | Union | ||
max_length | int | Max length for each sequence of tokens | |
ner_tag_from_html_tag | callable | takes in a bs4.element.Tag and outputs the ner_tag (as a string or None ) |
|
label2id | dict | The keys ner_tag’s of the form f”I-{output}” or f”B-{output}” where output is an output of ner_tag_from_html_tag . |
|
default_label | str | O | The default label for the NER tagging. |
Returns | tuple | The first list consists of the tokens and the second list consists of the named entity recognition tags. |
We continue with an example using the HTML data from the example for the html_data_from_note
function.
= MarkdownFile.from_string(
mf r"""---
aliases: []
tags: []
---
# Galois group of a separable and normal finite field extension
Let $L/K$ be a separable and normal finite field extension. Its <b definition="Galois group of a separable and normal finite field extension">Galois group</b> <span notation="">$\operatorname{Gal}(L/K)$</span> is...
# Galois group of a separable and normal profinite field extension
In fact, the notion of a Galois group can be defined for profinite field extensions. Given a separable and normal profinite field extension $L/K$, say that
$L = \varinjlim_i L_i$ where $L_i/K$ are finite extensions. Its **Galois group** **$\operatorname{Gal}(L/K)$**
# See Also
# Meta
## References and Citations
""")
= html_data_from_note(mf, vault=None, note_name=None)
html_data print(html_data)
# assert html_data['Note name'] is None
{'Note name': None, 'Raw text': 'Let $L/K$ be a separable and normal finite field extension. Its Galois group $\\operatorname{Gal}(L/K)$ is...\n\nIn fact, the notion of a Galois group can be defined for profinite field extensions. Given a separable and normal profinite field extension $L/K$, say that\n$L = \\varinjlim_i L_i$ where $L_i/K$ are finite extensions. Its Galois group $\\operatorname{Gal}(L/K)$\n', 'Tag data': [(<b definition="Galois group of a separable and normal finite field extension">Galois group</b>, 64, 76), (<span notation="">$\operatorname{Gal}(L/K)$</span>, 77, 102), (<b definition="">Galois group</b>, 330, 342), (<span notation="">$\operatorname{Gal}(L/K)$</span>, 343, 368)]}
'Raw text']
html_data["Tag data"] html_data[
[(<b definition="Galois group of a separable and normal finite field extension">Galois group</b>,
64,
76),
(<span notation="">$\operatorname{Gal}(L/K)$</span>, 77, 102),
(<b definition="">Galois group</b>, 330, 342),
(<span notation="">$\operatorname{Gal}(L/K)$</span>, 343, 368)]
= AutoTokenizer.from_pretrained("distilbert-base-uncased") tokenizer
c:\Users\hyunj\Documents\Development\Python\trouver_py310_venv\lib\site-packages\transformers\tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
= {
label2id "O": 0,
"B-definition": 1,
"I-definition": 2,
"B-notation": 3,
"I-notation": 4
}= tokenize_html_data(html_data, tokenizer, 510, def_or_notat_from_html_tag, label2id) tokens, ner_tag_ids
For this example, max_length
is set to 510 (tokens). The string (“Raw text”) is not very long, so only one sequence should be present.
len(tokens), 1)
test_eq(len(ner_tag_ids), 1) test_eq(
Now let us see what has been tagged:
= {value: key for key, value in label2id.items()}
id2label id2label
{0: 'O',
1: 'B-definition',
2: 'I-definition',
3: 'B-notation',
4: 'I-notation'}
for token, ner_tag in zip(tokens[0], ner_tag_ids[0]):
if ner_tag != 0:
print(f"{token}\t\t{id2label[ner_tag]}")
gal B-definition
##ois I-definition
group I-definition
$ B-notation
\ I-notation
operator I-notation
##name I-notation
{ I-notation
gal I-notation
} I-notation
( I-notation
l I-notation
/ I-notation
k I-notation
) I-notation
$ I-notation
gal B-definition
##ois I-definition
group I-definition
$ B-notation
\ I-notation
operator I-notation
##name I-notation
{ I-notation
gal I-notation
} I-notation
( I-notation
l I-notation
/ I-notation
k I-notation
) I-notation
$ I-notation
Let us set max_length
to be shorter to observe an example of a tokenization of a single text across multiple sequences (Of course, in practice, the max token length would be set to be longer, say around 512 or 1024.):
= tokenize_html_data(html_data, tokenizer, 20, def_or_notat_from_html_tag, label2id) token_ids, ner_tag_ids
print(len(token_ids))
print(len(ner_tag_ids))
7
7
ner_tag_ids
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2],
[2, 2, 2, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 3, 4, 4],
[4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 0]]
Gathering data
The following is sample code to then gather data for definition/notation identification
# TODO: test
= [] # Replace with actual notes
notes = '' # Replace with actual vault
vault
# vault = 'C:' # Replace with actual vault
# notes = [] # Replace with actual notes
= [html_data_from_note(note, vault) for note in notes]
html_data = 1022
max_length
= [tokenize_html_data(html_locus, tokenizer, max_length, def_or_notat_from_html_tag, label2id) for html_locus in html_data]
tokenized_html_data = [token_ids for token_ids, _ in tokenized_html_data]
token_id_data = [ner_tag_ids for _, ner_tag_ids in tokenized_html_data]
ner_tag_data = [token_seq for token_seq in token_ids for token_ids in token_id_data]
token_seqs = [ner_tag_seq for ner_tag_seq in ner_tag_ids for ner_tag_ids in ner_tag_data] ner_tag_seqs
= AutoTokenizer.from_pretrained("distilbert-base-uncased")
tokenizer = 1022
max_length = {
label2id "O": 0,
"B-definition": 1,
"I-definition": 2,
"B-notation": 3,
"I-notation": 4
} = {value: key for key, value in label2id.items()} id2label
= [], [], []
note_names, token_seqs, ner_tag_seqs for html_locus, (token_ids, ner_tag_ids) in zip(html_data, tokenized_html_data):
"Note name"]] * len(token_ids))
note_names.extend([html_locus[
token_seqs.extend(token_ids) ner_tag_seqs.extend(ner_tag_ids)
# ner_tags = ClassLabel(names=list(label2id))
# ds = Dataset.from_dict(
# {"note_name": note_names,
# "tokens": token_ids,
# "ner_tags": ner_tag_ids},
# features=Features(
# {
# "note_name": Value(dtype='string'),
# "tokens": Sequence(Value(dtype='string')),
# "ner_tags": Sequence(ner_tags)}
# ))
# ds.save_to_disk(".")
# ds.load_from_disk(".")
Use the trained model
See https://huggingface.co/docs/transformers/tasks/token_classification for training a token classification model.
# Helper functions
= bs4.BeautifulSoup('', 'html.parser')
soup = soup.new_tag('b', style="border-width:1px;border-style:solid;padding:3px", definition="")
tag = 'hi'
tag.string tag
<b definition="" style="border-width:1px;border-style:solid;padding:3px">hi</b>
def_and_notat_preds_by_model
def_and_notat_preds_by_model (text:str, pipeline)
*Predict where definitions and notations occur in text
This function uses some of the same helper functions as auto_mark_def_and_notats
, but does not raise warning messages as in auto_mark_def_and_notats
.*
Type | Details | |
---|---|---|
text | str | |
pipeline | The pipeline object created using the token classification model and its tokenizer | |
Returns | list | Each tuple consists of an HTML tag carrying the data of the prediction and ints marking where in text the definition or notation is at. |
# TODO: test
auto_mark_def_and_notats
auto_mark_def_and_notats (note:trouver.markdown.obsidian.vault.VaultNote, pipeline:transformers.pipelines.token_classific ation.TokenClassificationPipeline, excessive_space_threshold:int=2, add_boxing_att r_to_existing_def_and_notat_markings:bool=True)
*Predict and mark where definitions and notation occur in a note using a token classification ML model.
Assumes that the note is a standard information note that does not have a lot of “user modifications”, such as footnotes, links, and HTML tags. If there are many modifications, then these might be deleted.
Assumes that the paragraphs in the text of the note are “not too long”. Currently, this means that the paragraphs in the number of tokens in the text of the note should (roughly) not exceed pipeline.tokenizer.model_max_length
.
Existing markings for definition and notation data (i.e. by surrounding with double asterisks or by HTML tags) are preserved (and turned into HTML tags), unless the markings overlap with predictions, in which case the original is preserved (and still turned into an HTML tag if possible)
Since the model can make “invalid” predictions (mostly those which start or end within a LaTeX math mode str), the actual markings are not necessarily direct translates from the model’s predictions. See the helper function _consolidate_token_preds
for more details on how this is implemented.
Raises Warning messages (UserWarning
) are printed in the following situations:
- There are two consecutive tokens within the
pipeline
’s predictions of different entity types (e.g. one is predicted to belong within a definition and the other within a notation), but the latter token’s predicted'entity'
more specifically begins with'I-'
(i.e. is'I-definition'
or'I-notation'
) as opposed to'B-'
.note
’s name, and path are included in the warning message in this case.
- There are two consecutive tokens within the
pipeline
’s predictions which the pipeline predicts to belong to the same entity, and yet there is excessive space (specified byexcessive_space_threshold
) between the end of the first token and the start of the second.*
Type | Default | Details | |
---|---|---|---|
note | VaultNote | The standard information note in which to find the definitions and notations. | |
pipeline | TokenClassificationPipeline | The token classification pipeline that is used to predict whether tokens are part of definitions or notations introduced in the text. | |
excessive_space_threshold | int | 2 | remove_existing_def_and_notat_markings: bool = False, # If True , remove definition and notation markings (both via surrounding by double asterisks ** as per the legacy method and via HTML tags) |
add_boxing_attr_to_existing_def_and_notat_markings | bool | True | If True , then nice attributes are added to the existing notation HTML tags, if not already present. |
Returns | None |
In the following examples, we mock pipeline objects instead of using actual ones.
In the below example, we run the auto_mark_def_and_notats
function on a note that has double asterisks **
surrounding parts of the text that introduced definitions or notations. In these cases, appropriate HTML tags replace the double asterisks instead.
with (tempfile.TemporaryDirectory(prefix='temp_dir', dir=os.getcwd()) as temp_dir,
'__main__.pipelines.token_classification.TokenClassificationPipeline') as mock_pipeline):
mock.patch(= Path(temp_dir) / 'test_vault_6'
temp_vault / 'test_vault_6', temp_vault)
shutil.copytree(_test_directory()
= 512
mock_pipeline.tokenizer.model_max_length
= VaultNote(temp_vault, name='reference_with_tag_labels_Definition 2')
vn print("Text before:\n\n")
print(vn.text())
print("\n\n\nText after:\n")
auto_mark_def_and_notats(vn, mock_pipeline)print(vn.text())
= MarkdownFile.from_vault_note(vn)
mf assert mf.has_tag('_auto/def_and_notat_identified')
Text before:
---
cssclass: clean-embeds
aliases: []
tags: [_meta/literature_note, _meta/definition, _meta/notation]
---
# Ring of integers modulo $n$[^1]
Let $n \geq 1$ be an integer. The **ring of integers modulo $n$**, denoted by **$\mathbb{Z}/n\mathbb{Z}$**, is, informally, the ring whose elements are represented by the integers with the understanding that $0$ and $n$ are equal.
More precisely, $\mathbb{Z}/n\mathbb{Z}$ has the elements $0,1,\ldots,n-1$.
...
# See Also
- [[reference_with_tag_labels_Exercise 1|reference_with_tag_labels_Z_nZ_is_a_ring]]
# Meta
## References
## Citations and Footnotes
[^1]: Kim, Definition 2
Text after:
---
cssclass: clean-embeds
aliases: []
tags: [_meta/notation, _auto/def_and_notat_identified, _meta/literature_note, _meta/definition]
---
# Ring of integers modulo $n$[^1]
Let $n \geq 1$ be an integer. The <b definition="" style="border-width:1px;border-style:solid;padding:3px">ring of integers modulo $n$</b>, denoted by <span notation="" style="border-width:1px;border-style:solid;padding:3px">$\mathbb{Z}/n\mathbb{Z}$</span>, is, informally, the ring whose elements are represented by the integers with the understanding that $0$ and $n$ are equal.
More precisely, $\mathbb{Z}/n\mathbb{Z}$ has the elements $0,1,\ldots,n-1$.
...
# See Also
- [[reference_with_tag_labels_Exercise 1|reference_with_tag_labels_Z_nZ_is_a_ring]]
# Meta
## References
## Citations and Footnotes
[^1]: Kim, Definition 2
# TODO: more examples with pipeline mocking actual outputs