markdown.obsidian.personal.machine_learning.tokenize.def_and_notat_token_classification

Functions for gathering and processing tokenization data and for using ML models trained with such data.

Previous, trouver just had functionalities for using ML models to identify newly introduced notations in text and for gathering data to train such models. Moreover, such models were merely classification models, and using these models to identify newly introduced notations had a lot of computational redundancies.

This module aims to provide the same functionalities for both definitions and notations by training and using token classification models instead.

# TODO: Create a new module dedicated to definition and notation identification and move approparite functions over there.

from unittest import mock
import shutil
import tempfile

from datasets import ClassLabel, Dataset, Features, Sequence, Value
from transformers import AutoTokenizer
from fastcore.test import *

from trouver.helper.tests import _test_directory

Gather ML data from information notes

source

convert_double_asterisks_to_html_tags

 convert_double_asterisks_to_html_tags (text:str)

Replace the double asterisks, which signify definitions and notations, in text with HTML tags.

print(convert_double_asterisks_to_html_tags("**hi**. Here is a notation **$asdf$**"))
test_eq(convert_double_asterisks_to_html_tags("**hi**. Here is a notation **$asdf$**"), '<b definition="">hi</b>. Here is a notation <span notation="">$asdf$</span>')

<b definition="">hi</b>. Here is a notation <span notation="">$asdf$</span>

source

raw_text_with_html_tags_from_markdownfile

 raw_text_with_html_tags_from_markdownfile
                                            (mf:trouver.markdown.markdown.
                                            file.MarkdownFile,
                                            vault:os.PathLike)

Process the MarkdownFile, replacing the double asterisk surrounded text indicating definitions and notations to be HTML tags instead.

In the following example, let mf be the following MarkdownFile:

print(str(mf))

---
aliases: []
tags: []
---
# Galois group of a separable and normal finite field extension

Let $L/K$ be a separable and normal finite field extension. Its <b definition="Galois group of a separable and normal finite field extension">Galois group</b> <span notation="">$\operatorname{Gal}(L/K)$</span> is...

# Galois group of a separable and normal profinite field extension

In fact, the notion of a Galois group can be defined for profinite field extensions. Given a separable and normal profinite field extension $L/K$, say that
$L = \varinjlim_i L_i$ where $L_i/K$ are finite extensions. Its **Galois group** **$\operatorname{Gal}(L/K)$**

# See Also
# Meta
## References and Citations

The raw_text_with_html_tags_from_markdownfile function processes the MarkdownFile much in the same way as the process_standard_information_note function, except it 1. preserves HTML tags, and 2. replaces text surrounded by double asterisks ** with HTML tags signifiying whether the text displays a definition or a notation.

In the below example, note that the vault parameter is set to None; this is fine for this example becaues the process_standard_information_note function only needs a vault argument when embedded links need to be replaced with text (via the MarkdownFile.replace_embedded_links_with_text function), but mf has no embedded links.

print(raw_text_with_html_tags_from_markdownfile(mf, None))

Let $L/K$ be a separable and normal finite field extension. Its <b definition="Galois group of a separable and normal finite field extension">Galois group</b> <span notation="">$\operatorname{Gal}(L/K)$</span> is...

In fact, the notion of a Galois group can be defined for profinite field extensions. Given a separable and normal profinite field extension $L/K$, say that
$L = \varinjlim_i L_i$ where $L_i/K$ are finite extensions. Its <b definition="">Galois group</b> <span notation="">$\operatorname{Gal}(L/K)$</span>

source

html_data_from_note

 html_data_from_note (note_or_mf:Union[trouver.markdown.obsidian.vault.Vau
                      ltNote,trouver.markdown.markdown.file.MarkdownFile],
                      vault:Optional[os.PathLike]=None,
                      note_name:Optional[str]=None)

*Obtain html data for token classification from the information note.

Currently, the token types mainly revolve around definitions and notations.

If note has the tag _auto/def_and_notat_identified, then the data in the note is assumed to be auto-generated and not verified and None is returned.

Returns - Union[dict, None] - The keys-value pairs are - "Note name" - The name of the note - "Raw text" - The raw text to include in the data. - "Tag data" - The list with HTML tags carrying definition/notation data and their locations in the Raw text. See the second output to the function remove_html_tags_in_text. - Each element of the list is a tuple consisting of a bs4.element.Tag and two ints.*

	Type	Default	Details
note_or_mf	Union		Either a `VaultNoteobject to a note or a [`MarkdownFile`](https://hyunjongkimmath.github.io/trouver/markdown.markdown.file.html#markdownfile) object from which to extra html data. \| \| vault \| Optional \| None \| If vault to use when processing the [`MarkdownFile`](https://hyunjongkimmath.github.io/trouver/markdown.markdown.file.html#markdownfile) objects (if`note_of_mf`is a [`VaultNote`](https://hyunjongkimmath.github.io/trouver/markdown.obsidian.vault.html#vaultnote), then this [`MarkdownFile`](https://hyunjongkimmath.github.io/trouver/markdown.markdown.file.html#markdownfile) object is created from the text of the note), cf. the [`process_standard_information_note`](https://hyunjongkimmath.github.io/trouver/markdown.obsidian.personal.note_processing.html#process_standard_information_note) function. \| \| note_name \| Optional \| None \| If`note_or_mf`is a [`MarkdownFile`](https://hyunjongkimmath.github.io/trouver/markdown.markdown.file.html#markdownfile),`note_name`should be the name of the note from which the [`MarkdownFile`](https://hyunjongkimmath.github.io/trouver/markdown.markdown.file.html#markdownfile) comes from if applicable. If`note_or_mf`is a [`VaultNote`](https://hyunjongkimmath.github.io/trouver/markdown.obsidian.vault.html#vaultnote) object, then`note_name`is ignored and`note_or_mf.name`is used instead. \| \| Returns \| Optional \| \| The keys to the dict are "Note name", "Raw text", "Tag data". However,`None`is returned if`note` does not exist or the note is marked with auto-generated, unverified data.

In the following example, we mock a VaultNote whose content is that of mf in the example for the raw_text_with_html_tags_from_markdownfile function. Note that there is some text surrounded by double within mf surrounded by double asterisks ** and some text surrounded by HTML tags to indicate definitions and notations introduced.

mf = MarkdownFile.from_string(
    r"""---
aliases: []
tags: []
---
# Galois group of a separable and normal finite field extension

Let $L/K$ be a separable and normal finite field extension. Its <b definition="Galois group of a separable and normal finite field extension">Galois group</b> <span notation="">$\operatorname{Gal}(L/K)$</span> is...

# Galois group of a separable and normal profinite field extension

In fact, the notion of a Galois group can be defined for profinite field extensions. Given a separable and normal profinite field extension $L/K$, say that
$L = \varinjlim_i L_i$ where $L_i/K$ are finite extensions. Its **Galois group** **$\operatorname{Gal}(L/K)$**

# See Also
# Meta
## References and Citations
""")

with (mock.patch('__main__.VaultNote') as mock_VaultNote,
      mock.patch('__main__.MarkdownFile.from_vault_note') as mock_from_vault_note,
      mock.patch('__main__.isinstance') as mock_isinstance):
    mock_VaultNote.exists.return_value = True
    mock_VaultNote.name = "Note's name"
    mock_from_vault_note.return_value = mf
    mock_isinstance.return_value = True

    print(f"The following is the text from mf:\n\n{str(mf)}")

    html_data = html_data_from_note(mock_VaultNote, None)
    print(html_data)

    test_eq(html_data['Note name'], "Note's name")
    assert '**' not in html_data['Raw text']
    assert '<' not in html_data['Raw text']  # Test the lack of HTML tags in the raw text

    print(html_data['Tag data'])
    test_eq(len(html_data['Tag data']), 4)
    assert isinstance(html_data['Tag data'][0][0], bs4.element.Tag)
    assert html_data['Tag data'][0][0].has_attr('definition')
    assert not html_data['Tag data'][0][0].has_attr('notation')
    assert html_data['Tag data'][1][0].has_attr('notation')
    assert not html_data['Tag data'][1][0].has_attr('definition')
    assert html_data['Tag data'][2][0].has_attr('definition')
    assert not html_data['Tag data'][2][0].has_attr('notation')
    assert html_data['Tag data'][3][0].has_attr('notation')
    assert not html_data['Tag data'][3][0].has_attr('definition')

The following is the text from mf:

---
aliases: []
tags: []
---
# Galois group of a separable and normal finite field extension

Let $L/K$ be a separable and normal finite field extension. Its <b definition="Galois group of a separable and normal finite field extension">Galois group</b> <span notation="">$\operatorname{Gal}(L/K)$</span> is...

# Galois group of a separable and normal profinite field extension

In fact, the notion of a Galois group can be defined for profinite field extensions. Given a separable and normal profinite field extension $L/K$, say that
$L = \varinjlim_i L_i$ where $L_i/K$ are finite extensions. Its **Galois group** **$\operatorname{Gal}(L/K)$**

# See Also
# Meta
## References and Citations
{'Note name': "Note's name", 'Raw text': 'Let $L/K$ be a separable and normal finite field extension. Its Galois group $\\operatorname{Gal}(L/K)$ is...\n\nIn fact, the notion of a Galois group can be defined for profinite field extensions. Given a separable and normal profinite field extension $L/K$, say that\n$L = \\varinjlim_i L_i$ where $L_i/K$ are finite extensions. Its Galois group $\\operatorname{Gal}(L/K)$\n', 'Tag data': [(<b definition="Galois group of a separable and normal finite field extension">Galois group</b>, 64, 76), (<span notation="">$\operatorname{Gal}(L/K)$</span>, 77, 102), (<b definition="">Galois group</b>, 330, 342), (<span notation="">$\operatorname{Gal}(L/K)$</span>, 343, 368)]}
[(<b definition="Galois group of a separable and normal finite field extension">Galois group</b>, 64, 76), (<span notation="">$\operatorname{Gal}(L/K)$</span>, 77, 102), (<b definition="">Galois group</b>, 330, 342), (<span notation="">$\operatorname{Gal}(L/K)$</span>, 343, 368)]

We can also just pass a MarkdonwFile object instead of a VaultNote object. In this case, we can specify the note_name parameter to indicate which note the MarkdownFile object came from, if applicable.

html_data = html_data_from_note(mf, vault=None, note_name="Note's name")
print(html_data)

test_eq(html_data['Note name'], "Note's name")
assert '**' not in html_data['Raw text']
assert '<' not in html_data['Raw text']  # Test the lack of HTML tags in the raw text

print(html_data['Tag data'])
test_eq(len(html_data['Tag data']), 4)
assert isinstance(html_data['Tag data'][0][0], bs4.element.Tag)
assert html_data['Tag data'][0][0].has_attr('definition')
assert not html_data['Tag data'][0][0].has_attr('notation')
assert html_data['Tag data'][1][0].has_attr('notation')
assert not html_data['Tag data'][1][0].has_attr('definition')
assert html_data['Tag data'][2][0].has_attr('definition')
assert not html_data['Tag data'][2][0].has_attr('notation')
assert html_data['Tag data'][3][0].has_attr('notation')
assert not html_data['Tag data'][3][0].has_attr('definition')

{'Note name': "Note's name", 'Raw text': 'Let $L/K$ be a separable and normal finite field extension. Its Galois group $\\operatorname{Gal}(L/K)$ is...\n\nIn fact, the notion of a Galois group can be defined for profinite field extensions. Given a separable and normal profinite field extension $L/K$, say that\n$L = \\varinjlim_i L_i$ where $L_i/K$ are finite extensions. Its Galois group $\\operatorname{Gal}(L/K)$\n', 'Tag data': [(<b definition="Galois group of a separable and normal finite field extension">Galois group</b>, 64, 76), (<span notation="">$\operatorname{Gal}(L/K)$</span>, 77, 102), (<b definition="">Galois group</b>, 330, 342), (<span notation="">$\operatorname{Gal}(L/K)$</span>, 343, 368)]}
[(<b definition="Galois group of a separable and normal finite field extension">Galois group</b>, 64, 76), (<span notation="">$\operatorname{Gal}(L/K)$</span>, 77, 102), (<b definition="">Galois group</b>, 330, 342), (<span notation="">$\operatorname{Gal}(L/K)$</span>, 343, 368)]

If we do not specify note_name, then None is used for the 'Note name' key in the output:

html_data = html_data_from_note(mf, vault=None, note_name=None)
print(html_data)

assert html_data['Note name'] is None

{'Note name': None, 'Raw text': 'Let $L/K$ be a separable and normal finite field extension. Its Galois group $\\operatorname{Gal}(L/K)$ is...\n\nIn fact, the notion of a Galois group can be defined for profinite field extensions. Given a separable and normal profinite field extension $L/K$, say that\n$L = \\varinjlim_i L_i$ where $L_i/K$ are finite extensions. Its Galois group $\\operatorname{Gal}(L/K)$\n', 'Tag data': [(<b definition="Galois group of a separable and normal finite field extension">Galois group</b>, 64, 76), (<span notation="">$\operatorname{Gal}(L/K)$</span>, 77, 102), (<b definition="">Galois group</b>, 330, 342), (<span notation="">$\operatorname{Gal}(L/K)$</span>, 343, 368)]}

For the following example, the note has an HTML tag already with extra data (attributes other than 'definition' or 'notation'). We assert that the extra data is preserved.

with (mock.patch('__main__.VaultNote') as mock_VaultNote,
      mock.patch('__main__.MarkdownFile.from_vault_note') as mock_from_vault_note,
      mock.patch('__main__.isinstance') as mock_isinstance):
    mock_VaultNote.exists.return_value = True
    mock_VaultNote.name = "Note's name"
    mock_isinstance.return_value = True

    text = 'Let $X$ be a topological space and let $U \subseteq X$ be an subspace. The <b definition="Closure of a subspace of a topological space" typo="dosure of $U$">closure of $U$</b> is defined as...'
    mf = MarkdownFile.from_string(text)
    mock_from_vault_note.return_value = mf
    print(f"The following is the text of the mocked note: \n\n {text}\n\n")

    html_data = html_data_from_note(mock_VaultNote, None)
    print(html_data)
    assert html_data['Tag data'][0][0].has_attr('typo')
    test_eq(html_data['Tag data'][0][0].attrs['typo'], 'dosure of $U$')

The following is the text of the mocked note: 

 Let $X$ be a topological space and let $U \subseteq X$ be an subspace. The <b definition="Closure of a subspace of a topological space" typo="dosure of $U$">closure of $U$</b> is defined as...


{'Note name': "Note's name", 'Raw text': 'Let $X$ be a topological space and let $U \\subseteq X$ be an subspace. The closure of $U$ is defined as...', 'Tag data': [(<b definition="Closure of a subspace of a topological space" typo="dosure of $U$">closure of $U$</b>, 75, 89)]}

In the following example, the (mocked) note has the #_auto/def_and_notats_identified tag to indicate that its definition and notation markings were auto-generated by a model (trained with data processed by the tokenize_html_data function) using the auto_mark_def_and_notats function. In this case, the html_data_from_note function returns None to prevent gathering data that is unverified and auto-generated by a model.

# with (mock.patch('__main__.VaultNote') as mock_VaultNote,
#       mock.patch('__main__.MarkdownFile.from_vault_note') as mock_from_vault_note):
#     mock_VaultNote.exists.return_value = True
#     mock_VaultNote.name = "Note's name"
text = r'''---
tags: [_auto/def_and_notat_identified]
---
Let $X$ be a topological space and let $U \subseteq X$ be an subspace. The <b definition="Closure of a subspace of a topological space" typo="dosure of $U$">closure of $U$</b> is defined as...'''

mf = MarkdownFile.from_string(text)
mock_from_vault_note.return_value = mf
print(f"The following is the text of the mocked note: \n\n{text}\n\n")

html_data = html_data_from_note(note_or_mf=mf)
assert(html_data is None)

The following is the text of the mocked note: 

---
tags: [_auto/def_and_notat_identified]
---
Let $X$ be a topological space and let $U \subseteq X$ be an subspace. The <b definition="Closure of a subspace of a topological space" typo="dosure of $U$">closure of $U$</b> is defined as...

source

def_or_notat_from_html_tag

 def_or_notat_from_html_tag (tag:bs4.element.Tag)

*Can be passed as the ner_tag_from_html_tag argument in tokenize_html_data for the purposes of compiling a dataset for definition and notation identification.

The strings f”I-{output}” and f”B-{output}” are valid ner_tags. To use for*

source

tokenize_html_data

 tokenize_html_data (html_locus:dict, tokenizer:Union[transformers.tokeniz
                     ation_utils.PreTrainedTokenizer,transformers.tokeniza
                     tion_utils_fast.PreTrainedTokenizerFast],
                     max_length:int, ner_tag_from_html_tag:<built-
                     infunctioncallable>, label2id:dict[str,int],
                     default_label:str='O')

*Actually tokenize the html data outputted by html_data_from_note.

To account for the possibility that the raw text is long, this function uses the tokenizer.batch_encode_plus function to tokenize the text into sequences.*

	Type	Default	Details
html_locus	dict		An output of `html_data_from_note`
tokenizer	Union
max_length	int		Max length for each sequence of tokens
ner_tag_from_html_tag	callable		takes in a bs4.element.Tag and outputs the ner_tag (as a string or `None`)
label2id	dict		The keys ner_tag’s of the form f”I-{output}” or f”B-{output}” where `output` is an output of `ner_tag_from_html_tag`.
default_label	str	O	The default label for the NER tagging.
Returns	tuple		The first list consists of the tokens and the second list consists of the named entity recognition tags.

We continue with an example using the HTML data from the example for the html_data_from_note function.

mf = MarkdownFile.from_string(
    r"""---
aliases: []
tags: []
---
# Galois group of a separable and normal finite field extension

Let $L/K$ be a separable and normal finite field extension. Its <b definition="Galois group of a separable and normal finite field extension">Galois group</b> <span notation="">$\operatorname{Gal}(L/K)$</span> is...

# Galois group of a separable and normal profinite field extension

In fact, the notion of a Galois group can be defined for profinite field extensions. Given a separable and normal profinite field extension $L/K$, say that
$L = \varinjlim_i L_i$ where $L_i/K$ are finite extensions. Its **Galois group** **$\operatorname{Gal}(L/K)$**

# See Also
# Meta
## References and Citations
""")

html_data = html_data_from_note(mf, vault=None, note_name=None)
print(html_data)

# assert html_data['Note name'] is None

{'Note name': None, 'Raw text': 'Let $L/K$ be a separable and normal finite field extension. Its Galois group $\\operatorname{Gal}(L/K)$ is...\n\nIn fact, the notion of a Galois group can be defined for profinite field extensions. Given a separable and normal profinite field extension $L/K$, say that\n$L = \\varinjlim_i L_i$ where $L_i/K$ are finite extensions. Its Galois group $\\operatorname{Gal}(L/K)$\n', 'Tag data': [(<b definition="Galois group of a separable and normal finite field extension">Galois group</b>, 64, 76), (<span notation="">$\operatorname{Gal}(L/K)$</span>, 77, 102), (<b definition="">Galois group</b>, 330, 342), (<span notation="">$\operatorname{Gal}(L/K)$</span>, 343, 368)]}

html_data['Raw text']
html_data["Tag data"]

[(<b definition="Galois group of a separable and normal finite field extension">Galois group</b>,
  64,
  76),
 (<span notation="">$\operatorname{Gal}(L/K)$</span>, 77, 102),
 (<b definition="">Galois group</b>, 330, 342),
 (<span notation="">$\operatorname{Gal}(L/K)$</span>, 343, 368)]

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

c:\Users\hyunj\Documents\Development\Python\trouver_py310_venv\lib\site-packages\transformers\tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(

label2id = {
    "O": 0,
    "B-definition": 1,
    "I-definition": 2,
    "B-notation": 3,
    "I-notation": 4
}
tokens, ner_tag_ids = tokenize_html_data(html_data, tokenizer, 510, def_or_notat_from_html_tag, label2id)

For this example, max_length is set to 510 (tokens). The string (“Raw text”) is not very long, so only one sequence should be present.

test_eq(len(tokens), 1)
test_eq(len(ner_tag_ids), 1)

Now let us see what has been tagged:

id2label = {value: key for key, value in label2id.items()}
id2label

{0: 'O',
 1: 'B-definition',
 2: 'I-definition',
 3: 'B-notation',
 4: 'I-notation'}

for token, ner_tag in zip(tokens[0], ner_tag_ids[0]):
    if ner_tag != 0:
        print(f"{token}\t\t{id2label[ner_tag]}")

gal     B-definition
##ois       I-definition
group       I-definition
$       B-notation
\       I-notation
operator        I-notation
##name      I-notation
{       I-notation
gal     I-notation
}       I-notation
(       I-notation
l       I-notation
/       I-notation
k       I-notation
)       I-notation
$       I-notation
gal     B-definition
##ois       I-definition
group       I-definition
$       B-notation
\       I-notation
operator        I-notation
##name      I-notation
{       I-notation
gal     I-notation
}       I-notation
(       I-notation
l       I-notation
/       I-notation
k       I-notation
)       I-notation
$       I-notation

Let us set max_length to be shorter to observe an example of a tokenization of a single text across multiple sequences (Of course, in practice, the max token length would be set to be longer, say around 512 or 1024.):

token_ids, ner_tag_ids = tokenize_html_data(html_data, tokenizer, 20, def_or_notat_from_html_tag, label2id)

print(len(token_ids))
print(len(ner_tag_ids))

7
7

ner_tag_ids

[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2],
 [2, 2, 2, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 3, 4, 4],
 [4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 0]]

Gathering data

The following is sample code to then gather data for definition/notation identification

# TODO: test

notes = [] # Replace with actual notes
vault = '' # Replace with actual vault

# vault = 'C:' # Replace with actual vault
# notes = [] # Replace with actual notes

html_data = [html_data_from_note(note, vault) for note in notes]
max_length = 1022

tokenized_html_data = [tokenize_html_data(html_locus, tokenizer, max_length, def_or_notat_from_html_tag, label2id) for html_locus in html_data]
token_id_data = [token_ids for token_ids, _ in tokenized_html_data]
ner_tag_data = [ner_tag_ids for _, ner_tag_ids in tokenized_html_data]
token_seqs = [token_seq for token_seq in token_ids for token_ids in token_id_data]
ner_tag_seqs = [ner_tag_seq for ner_tag_seq in ner_tag_ids for ner_tag_ids in ner_tag_data]

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
max_length = 1022
label2id = {
    "O": 0,
    "B-definition": 1,
    "I-definition": 2,
    "B-notation": 3,
    "I-notation": 4
} 
id2label = {value: key for key, value in label2id.items()}

note_names, token_seqs, ner_tag_seqs = [], [], []
for html_locus, (token_ids, ner_tag_ids) in zip(html_data, tokenized_html_data):
    note_names.extend([html_locus["Note name"]] * len(token_ids))
    token_seqs.extend(token_ids)
    ner_tag_seqs.extend(ner_tag_ids)

# ner_tags = ClassLabel(names=list(label2id))

# ds = Dataset.from_dict(
#         {"note_name": note_names,
#         "tokens": token_ids,
#         "ner_tags": ner_tag_ids},
#         features=Features(
#             {
#              "note_name": Value(dtype='string'),
#              "tokens": Sequence(Value(dtype='string')),
#              "ner_tags": Sequence(ner_tags)}
#         ))

# ds.save_to_disk(".")

# ds.load_from_disk(".")

Use the trained model

See https://huggingface.co/docs/transformers/tasks/token_classification for training a token classification model.

# Helper functions

soup = bs4.BeautifulSoup('', 'html.parser')
tag = soup.new_tag('b', style="border-width:1px;border-style:solid;padding:3px", definition="")
tag.string = 'hi'
tag

<b definition="" style="border-width:1px;border-style:solid;padding:3px">hi</b>

source

def_and_notat_preds_by_model

 def_and_notat_preds_by_model (text:str, pipeline)

*Predict where definitions and notations occur in text

This function uses some of the same helper functions as auto_mark_def_and_notats, but does not raise warning messages as in auto_mark_def_and_notats.*

	Type	Details
text	str
pipeline		The pipeline object created using the token classification model and its tokenizer
Returns	list	Each tuple consists of an HTML tag carrying the data of the prediction and ints marking where in `text` the definition or notation is at.

# TODO: test

source

auto_mark_def_and_notats

 auto_mark_def_and_notats (note:trouver.markdown.obsidian.vault.VaultNote,
                           pipeline:transformers.pipelines.token_classific
                           ation.TokenClassificationPipeline,
                           excessive_space_threshold:int=2, add_boxing_att
                           r_to_existing_def_and_notat_markings:bool=True)

*Predict and mark where definitions and notation occur in a note using a token classification ML model.

Assumes that the note is a standard information note that does not have a lot of “user modifications”, such as footnotes, links, and HTML tags. If there are many modifications, then these might be deleted.

Assumes that the paragraphs in the text of the note are “not too long”. Currently, this means that the paragraphs in the number of tokens in the text of the note should (roughly) not exceed pipeline.tokenizer.model_max_length.

Existing markings for definition and notation data (i.e. by surrounding with double asterisks or by HTML tags) are preserved (and turned into HTML tags), unless the markings overlap with predictions, in which case the original is preserved (and still turned into an HTML tag if possible)

Since the model can make “invalid” predictions (mostly those which start or end within a LaTeX math mode str), the actual markings are not necessarily direct translates from the model’s predictions. See the helper function _consolidate_token_preds for more details on how this is implemented.

Raises Warning messages (UserWarning) are printed in the following situations:

There are two consecutive tokens within the pipeline’s predictions of different entity types (e.g. one is predicted to belong within a definition and the other within a notation), but the latter token’s predicted 'entity' more specifically begins with 'I-' (i.e. is 'I-definition' or 'I-notation') as opposed to 'B-'.
- note’s name, and path are included in the warning message in this case.
There are two consecutive tokens within the pipeline’s predictions which the pipeline predicts to belong to the same entity, and yet there is excessive space (specified by excessive_space_threshold) between the end of the first token and the start of the second.*

	Type	Default	Details
note	VaultNote		The standard information note in which to find the definitions and notations.
pipeline	TokenClassificationPipeline		The token classification pipeline that is used to predict whether tokens are part of definitions or notations introduced in the text.
excessive_space_threshold	int	2	remove_existing_def_and_notat_markings: bool = False, # If `True`, remove definition and notation markings (both via surrounding by double asterisks `**` as per the legacy method and via HTML tags)
add_boxing_attr_to_existing_def_and_notat_markings	bool	True	If `True`, then nice attributes are added to the existing notation HTML tags, if not already present.
Returns	None

In the following examples, we mock pipeline objects instead of using actual ones.

In the below example, we run the auto_mark_def_and_notats function on a note that has double asterisks ** surrounding parts of the text that introduced definitions or notations. In these cases, appropriate HTML tags replace the double asterisks instead.

with (tempfile.TemporaryDirectory(prefix='temp_dir', dir=os.getcwd()) as temp_dir,
      mock.patch('__main__.pipelines.token_classification.TokenClassificationPipeline') as mock_pipeline):
    temp_vault = Path(temp_dir) / 'test_vault_6'
    shutil.copytree(_test_directory() / 'test_vault_6', temp_vault)

    mock_pipeline.tokenizer.model_max_length = 512

    vn = VaultNote(temp_vault, name='reference_with_tag_labels_Definition 2')
    print("Text before:\n\n")
    print(vn.text())
    print("\n\n\nText after:\n")
    auto_mark_def_and_notats(vn, mock_pipeline)
    print(vn.text())
    mf = MarkdownFile.from_vault_note(vn)
    assert mf.has_tag('_auto/def_and_notat_identified')

Text before:


---
cssclass: clean-embeds
aliases: []
tags: [_meta/literature_note, _meta/definition, _meta/notation]
---
# Ring of integers modulo $n$[^1]

Let $n \geq 1$ be an integer. The **ring of integers modulo $n$**, denoted by **$\mathbb{Z}/n\mathbb{Z}$**, is, informally, the ring whose elements are represented by the integers with the understanding that $0$ and $n$ are equal.

More precisely, $\mathbb{Z}/n\mathbb{Z}$ has the elements $0,1,\ldots,n-1$.

...


# See Also
- [[reference_with_tag_labels_Exercise 1|reference_with_tag_labels_Z_nZ_is_a_ring]]
# Meta
## References

## Citations and Footnotes
[^1]: Kim, Definition 2



Text after:

---
cssclass: clean-embeds
aliases: []
tags: [_meta/notation, _auto/def_and_notat_identified, _meta/literature_note, _meta/definition]
---
# Ring of integers modulo $n$[^1]

Let $n \geq 1$ be an integer. The <b definition="" style="border-width:1px;border-style:solid;padding:3px">ring of integers modulo $n$</b>, denoted by <span notation="" style="border-width:1px;border-style:solid;padding:3px">$\mathbb{Z}/n\mathbb{Z}$</span>, is, informally, the ring whose elements are represented by the integers with the understanding that $0$ and $n$ are equal.

More precisely, $\mathbb{Z}/n\mathbb{Z}$ has the elements $0,1,\ldots,n-1$.

...


# See Also
- [[reference_with_tag_labels_Exercise 1|reference_with_tag_labels_Z_nZ_is_a_ring]]
# Meta
## References

## Citations and Footnotes
[^1]: Kim, Definition 2

# TODO: more examples with pipeline mocking actual outputs