from unittest import mock
import shutil
import tempfile
from fastcore.test import *
from pathvalidate import validate_filename
from torch import Tensor
from trouver.helper.tests import _test_directorymarkdown.obisidian.personal.machine_learning.notation_identification
Get notation data
Given information notes with notations marked with double asterisks **, we extract the data of these double asterisks organize them for machine learning.
Ultimately, we would like to have a ML model that can find the locations where notations are newly introduced in a note. The approach here is to train a categorization model which takes an input a text with a single double asterisk pair surrounding a LaTeX math mode string and outputs whether the LaTeX math mode string contains a notation. We then use the categorization model to find all LaTeX math mode strings containing notations one by one.
add_one_double_asts_to_line
add_one_double_asts_to_line (line:str, start:int, end:int)
*Return line with only one double asterisks ** surrounded text.
Used in _definition_data_from_line*
| Type | Details | |
|---|---|---|
| line | str | The text to which to add the double asterisks ** |
| start | int | The first double asterisks are added in between line[start-1] and line[start]. |
| end | int | The second double asterisks are added in between line[end-1] and line[end]. |
| Returns | str | The str obtained from line by surrounding the substring line[start:end] with double asterisks. |
test_eq(add_one_double_asts_to_line("I will add just one double ast pair.", 2,6), 'I **will** add just one double ast pair.')notation_data_from_text
notation_data_from_text (with_double_asts:str)
*Extracts data on the locations of notations in a text with double asterisks.
Used in notation_data_from_note
Returns
- tuple[str, list[tuple[int, int, bool]]]
The str is the str
no_double_asts, which is the same aswith_double_asts, except with the double asterisks removed.Each list represents a data point for a LaTeX math-mode string in
no_double_astsand consists ofThe indices
start, endwhere the data point considers whether or not the LaTeX math-mode substringline_no_double_asts[start:end]is surrounded by double-asterisks (and hene is supposed to introduce a notation).A bool which is
True, if the data-point represents a str with double-asterisks surrounding a notation andFalseotherwise.*
| Type | Details | |
|---|---|---|
| with_double_asts | str | May or may not have double asterisks to signify definitions and notations |
| Returns | tuple |
sample_output = notation_data_from_text(
r'**here is a double ast text**. It is not a LaTeX math mode string,'
r'so it will not be included as a data point.'
r'On the other hand, **$\operatorname{Gal}(L/K)$** and $\mathbb{Z}/2\mathbb{Z}$'
r'are both included LaTeX math mode strings and are included as data points.'
r'The bool for the former is `True`, whereas the bool for the latter is `False`.')
assert '**' not in sample_output[0]
start, end, is_notation = sample_output[1][0]
test_eq(sample_output[0][start:end], r'$\operatorname{Gal}(L/K)$')
start, end, is_notation = sample_output[1][1]
test_eq(sample_output[0][start:end], r'$\mathbb{Z}/2\mathbb{Z}$')
print(sample_output)('here is a double ast text. It is not a LaTeX math mode string,so it will not be included as a data point.On the other hand, $\\operatorname{Gal}(L/K)$ and $\\mathbb{Z}/2\\mathbb{Z}$are both included LaTeX math mode strings and are included as data points.The bool for the former is `True`, whereas the bool for the latter is `False`.', [(124, 149, True), (154, 178, False)])
notation_data_from_note
notation_data_from_note (note:trouver.markdown.obsidian.vault.VaultNote, vault:os.PathLike)
*Obtain notation data from a note.
Note that the lists of str might not be in any particular order.
Returns
- list[tuple[str, str, bool]]
- Each list consists of
- The name of
note, - The processed str of
notewith only a single double asterisk surrounded LaTeX text. Note that the processed str merges display math mode text into single lines, cf.process_standard_information_note. - A bool that is
Trueif the LaTeX text contains notation.*
- The name of
- Each list consists of
We first set up an example:
test_vault = _test_directory() / 'test_vault_6'
vn = VaultNote(test_vault, name='reference_with_tag_labels_Definition 2')
print(vn.text())---
cssclass: clean-embeds
aliases: []
tags: [_meta/literature_note, _meta/definition, _meta/notation]
---
# Ring of integers modulo $n$[^1]
Let $n \geq 1$ be an integer. The **ring of integers modulo $n$**, denoted by **$\mathbb{Z}/n\mathbb{Z}$**, is, informally, the ring whose elements are represented by the integers with the understanding that $0$ and $n$ are equal.
More precisely, $\mathbb{Z}/n\mathbb{Z}$ has the elements $0,1,\ldots,n-1$.
...
# See Also
- [[reference_with_tag_labels_Exercise 1|reference_with_tag_labels_Z_nZ_is_a_ring]]
# Meta
## References
## Citations and Footnotes
[^1]: Kim, Definition 2
sample_output = notation_data_from_note(vn, test_vault)
total_count_for_is_notation = 0
for name, with_one_double_asts, is_notation in sample_output:
test_eq(name, vn.name)
test_eq(with_one_double_asts.count('**'), 2)
if is_notation:
total_count_for_is_notation += 1
test_eq(total_count_for_is_notation, 1)
sample_outputC:\Users\hyunj\Documents\Development\Python\trouver\trouver\helper\html.py:81: MarkupResemblesLocatorWarning: The input looks more like a filename than markup. You may want to open this file and pass the filehandle into Beautiful Soup.
parsed_soup = BeautifulSoup(text, 'html.parser')
[('reference_with_tag_labels_Definition 2',
'Let $n \\geq 1$ be an integer. The ring of integers modulo $n$, denoted by **$\\mathbb{Z}/n\\mathbb{Z}$**, is, informally, the ring whose elements are represented by the integers with the understanding that $0$ and $n$ are equal.\n\nMore precisely, $\\mathbb{Z}/n\\mathbb{Z}$ has the elements $0,1,\\ldots,n-1$.\n\n...\n',
True),
('reference_with_tag_labels_Definition 2',
'Let **$n \\geq 1$** be an integer. The ring of integers modulo $n$, denoted by $\\mathbb{Z}/n\\mathbb{Z}$, is, informally, the ring whose elements are represented by the integers with the understanding that $0$ and $n$ are equal.\n\nMore precisely, $\\mathbb{Z}/n\\mathbb{Z}$ has the elements $0,1,\\ldots,n-1$.\n\n...\n',
False),
('reference_with_tag_labels_Definition 2',
'Let $n \\geq 1$ be an integer. The ring of integers modulo **$n$**, denoted by $\\mathbb{Z}/n\\mathbb{Z}$, is, informally, the ring whose elements are represented by the integers with the understanding that $0$ and $n$ are equal.\n\nMore precisely, $\\mathbb{Z}/n\\mathbb{Z}$ has the elements $0,1,\\ldots,n-1$.\n\n...\n',
False),
('reference_with_tag_labels_Definition 2',
'Let $n \\geq 1$ be an integer. The ring of integers modulo $n$, denoted by $\\mathbb{Z}/n\\mathbb{Z}$, is, informally, the ring whose elements are represented by the integers with the understanding that **$0$** and $n$ are equal.\n\nMore precisely, $\\mathbb{Z}/n\\mathbb{Z}$ has the elements $0,1,\\ldots,n-1$.\n\n...\n',
False),
('reference_with_tag_labels_Definition 2',
'Let $n \\geq 1$ be an integer. The ring of integers modulo $n$, denoted by $\\mathbb{Z}/n\\mathbb{Z}$, is, informally, the ring whose elements are represented by the integers with the understanding that $0$ and **$n$** are equal.\n\nMore precisely, $\\mathbb{Z}/n\\mathbb{Z}$ has the elements $0,1,\\ldots,n-1$.\n\n...\n',
False),
('reference_with_tag_labels_Definition 2',
'Let $n \\geq 1$ be an integer. The ring of integers modulo $n$, denoted by $\\mathbb{Z}/n\\mathbb{Z}$, is, informally, the ring whose elements are represented by the integers with the understanding that $0$ and $n$ are equal.\n\nMore precisely, **$\\mathbb{Z}/n\\mathbb{Z}$** has the elements $0,1,\\ldots,n-1$.\n\n...\n',
False),
('reference_with_tag_labels_Definition 2',
'Let $n \\geq 1$ be an integer. The ring of integers modulo $n$, denoted by $\\mathbb{Z}/n\\mathbb{Z}$, is, informally, the ring whose elements are represented by the integers with the understanding that $0$ and $n$ are equal.\n\nMore precisely, $\\mathbb{Z}/n\\mathbb{Z}$ has the elements **$0,1,\\ldots,n-1$**.\n\n...\n',
False)]
Make database of notation data
append_notation_data_to_database
append_notation_data_to_database (vault:os.PathLike, file:os.PathLike, notes:list[trouver.markdown.obsidian.va ult.VaultNote], backup:bool=True)
| Type | Default | Details | |
|---|---|---|---|
| vault | PathLike | The vault from which the data is drawn | |
| file | PathLike | The path to a CSV file | |
| notes | list | The notes to add to the database | |
| backup | bool | True | If True, makes a copy of file in the same directoy and with the same name, except with an added extension of .bak. |
| Returns | None |
# TODO: exampleUse ML categorization model to find and mark notations in notes
automatically_mark_notations
automatically_mark_notations (vn:trouver.markdown.obsidian.vault.VaultNo te, learn:fastai.text.learner.TextLearner, create_notation_notes:bool=False, reference_name:str='')
*Predict and mark where notations occur in a note, and optionally create a notation note, and add the notation note to the See Also section of the note.
Assumes that no double asterisks are already in the contents of vn.
This function Removes links, headings, footnotes, etc. from the original note and merges multi-line display math mode LaTeX text into single lines. Use with caution.*
| Type | Default | Details | |
|---|---|---|---|
| vn | VaultNote | The information note to which to mark notations. | |
| learn | TextLearner | The ML model which predicts where notation notes should occur. This is a classifier which takes as input a str with double asterisks surrounding LaTeX text. The model outputs whether or not the single double asterisk pair surrounds a LaTeX text with notation. | |
| create_notation_notes | bool | False | If True, creates the notations notes for the predicted notations and links them to the ‘See Also’ sections of the information notes. |
| reference_name | str | The name of the reference that vn belongs to; this is only relevant when create_notation_notes=True so that the created notation notes have file names starting with the reference name. |
|
| Returns | None |
# TODO: Testwith tempfile.TemporaryDirectory(prefix='tmp_dir_', dir=os.getcwd()) as tmp_dir:
tmp_dir = Path(tmp_dir)
temp_vault = tmp_dir / 'test_vault_6'
shutil.copytree('_tests/test_vault_6', temp_vault)
note = VaultNote(temp_vault, name='number_theory_reference_1_Definition 15')
with mock.patch('__main__.TextLearner') as mock_textlearner_class:
mock_textlearner = mock_textlearner_class.return_value
mock_textlearner.predict.side_effect = [
('False', Tensor([0]), Tensor([1, 0])),
('True', Tensor([0]), Tensor([0, 1])),
('False', Tensor([0]), Tensor([1, 0])),
('False', Tensor([0]), Tensor([1, 0])),
]
automatically_mark_notations(note, mock_textlearner)
print('The following is the note after the double asterisks are added, '
'assuming that the ML model predictions are as above:')
print(note.text())
assert r'**$\operatorname{Gal}(L/K)$**' in note.text()The following is the note after the double asterisks are added, assuming that the ML model predictions are as above:
---
cssclass: clean-embeds
aliases: []
tags: [_meta/literature_note, _meta/definition, _meta/notation]
---
# Topic[^1]
%%This is an example file to which `automatcally_mark_notations` will be applied.%%
Let $L/K$ be a Galois field extension. Its Galois group **$\operatorname{Gal}(L/K)$** is defined as the group of automorphisms of $L$ fixing $K$ pointwise.
# See Also
# Meta
## References
## Citations and Footnotes
[^1]: Kim,
# TODO: test 'w' after implementing `overwrite.`# TODO: test 'a' after implementing `overwrite.`# TODO: test `None` after implementing `overwrite.`