from unittest import mock
import shutil
import tempfile
from fastcore.test import *
from pathvalidate import validate_filename
from torch import Tensor
from trouver.helper import _test_directory
markdown.obisidian.personal.machine_learning.notation_identification
Get notation data
Given information notes with notations marked with double asterisks **
, we extract the data of these double asterisks organize them for machine learning.
Ultimately, we would like to have a ML model that can find the locations where notations are newly introduced in a note. The approach here is to train a categorization model which takes an input a text with a single double asterisk pair surrounding a LaTeX math mode string and outputs whether the LaTeX math mode string contains a notation. We then use the categorization model to find all LaTeX math mode strings containing notations one by one.
add_one_double_asts_to_line
add_one_double_asts_to_line (line:str, start:int, end:int)
Return line
with only one double asterisks **
surrounded text.
Used in _definition_data_from_line
Type | Details | |
---|---|---|
line | str | The text to which to add the double asterisks ** |
start | int | The first double asterisks are added in between line[start-1] and line[start] . |
end | int | The second double asterisks are added in between line[end-1] and line[end] . |
Returns | str | The str obtained from line by surrounding the substring line[start:end] with double asterisks. |
"I will add just one double ast pair.", 2,6), 'I **will** add just one double ast pair.') test_eq(add_one_double_asts_to_line(
notation_data_from_text
notation_data_from_text (with_double_asts:str)
Extracts data on the locations of notations in a text with double asterisks.
Used in notation_data_from_note
Returns
- tuple[str, list[tuple[int, int, bool]]]
The str is the str
no_double_asts
, which is the same aswith_double_asts
, except with the double asterisks removed.Each list represents a data point for a LaTeX math-mode string in
no_double_asts
and consists ofThe indices
start, end
where the data point considers whether or not the LaTeX math-mode substringline_no_double_asts[start:end]
is surrounded by double-asterisks (and hene is supposed to introduce a notation).A bool which is
True
, if the data-point represents a str with double-asterisks surrounding a notation andFalse
otherwise.
Type | Details | |
---|---|---|
with_double_asts | str | May or may not have double asterisks to signify definitions and notations |
Returns | tuple |
= notation_data_from_text(
sample_output r'**here is a double ast text**. It is not a LaTeX math mode string,'
r'so it will not be included as a data point.'
r'On the other hand, **$\operatorname{Gal}(L/K)$** and $\mathbb{Z}/2\mathbb{Z}$'
r'are both included LaTeX math mode strings and are included as data points.'
r'The bool for the former is `True`, whereas the bool for the latter is `False`.')
assert '**' not in sample_output[0]
= sample_output[1][0]
start, end, is_notation 0][start:end], r'$\operatorname{Gal}(L/K)$')
test_eq(sample_output[= sample_output[1][1]
start, end, is_notation 0][start:end], r'$\mathbb{Z}/2\mathbb{Z}$')
test_eq(sample_output[print(sample_output)
('here is a double ast text. It is not a LaTeX math mode string,so it will not be included as a data point.On the other hand, $\\operatorname{Gal}(L/K)$ and $\\mathbb{Z}/2\\mathbb{Z}$are both included LaTeX math mode strings and are included as data points.The bool for the former is `True`, whereas the bool for the latter is `False`.', [(124, 149, True), (154, 178, False)])
notation_data_from_note
notation_data_from_note (note:trouver.markdown.obsidian.vault.VaultNote, vault:os.PathLike)
Obtain notation data from a note.
Note that the lists of str might not be in any particular order.
Returns
- list[tuple[str, str, bool]]
- Each list consists of
- The name of
note
, - The processed str of
note
with only a single double asterisk surrounded LaTeX text. Note that the processed str merges display math mode text into single lines, cf.process_standard_information_note
. - A bool that is
True
if the LaTeX text contains notation.
- The name of
- Each list consists of
We first set up an example:
= _test_directory() / 'test_vault_6'
test_vault = VaultNote(test_vault, name='reference_with_tag_labels_Definition 2')
vn print(vn.text())
---
cssclass: clean-embeds
aliases: []
tags: [_meta/literature_note, _meta/definition, _meta/notation]
---
# Ring of integers modulo $n$[^1]
Let $n \geq 1$ be an integer. The **ring of integers modulo $n$**, denoted by **$\mathbb{Z}/n\mathbb{Z}$**, is, informally, the ring whose elements are represented by the integers with the understanding that $0$ and $n$ are equal.
More precisely, $\mathbb{Z}/n\mathbb{Z}$ has the elements $0,1,\ldots,n-1$.
...
# See Also
- [[reference_with_tag_labels_Exercise 1|reference_with_tag_labels_Z_nZ_is_a_ring]]
# Meta
## References
## Citations and Footnotes
[^1]: Kim, Definition 2
= notation_data_from_note(vn, test_vault)
sample_output = 0
total_count_for_is_notation for name, with_one_double_asts, is_notation in sample_output:
test_eq(name, vn.name)'**'), 2)
test_eq(with_one_double_asts.count(if is_notation:
+= 1
total_count_for_is_notation 1)
test_eq(total_count_for_is_notation, sample_output
[('reference_with_tag_labels_Definition 2',
'Let $n \\geq 1$ be an integer. The ring of integers modulo $n$, denoted by **$\\mathbb{Z}/n\\mathbb{Z}$**, is, informally, the ring whose elements are represented by the integers with the understanding that $0$ and $n$ are equal.\n\nMore precisely, $\\mathbb{Z}/n\\mathbb{Z}$ has the elements $0,1,\\ldots,n-1$.\n\n...\n',
True),
('reference_with_tag_labels_Definition 2',
'Let **$n \\geq 1$** be an integer. The ring of integers modulo $n$, denoted by $\\mathbb{Z}/n\\mathbb{Z}$, is, informally, the ring whose elements are represented by the integers with the understanding that $0$ and $n$ are equal.\n\nMore precisely, $\\mathbb{Z}/n\\mathbb{Z}$ has the elements $0,1,\\ldots,n-1$.\n\n...\n',
False),
('reference_with_tag_labels_Definition 2',
'Let $n \\geq 1$ be an integer. The ring of integers modulo **$n$**, denoted by $\\mathbb{Z}/n\\mathbb{Z}$, is, informally, the ring whose elements are represented by the integers with the understanding that $0$ and $n$ are equal.\n\nMore precisely, $\\mathbb{Z}/n\\mathbb{Z}$ has the elements $0,1,\\ldots,n-1$.\n\n...\n',
False),
('reference_with_tag_labels_Definition 2',
'Let $n \\geq 1$ be an integer. The ring of integers modulo $n$, denoted by $\\mathbb{Z}/n\\mathbb{Z}$, is, informally, the ring whose elements are represented by the integers with the understanding that **$0$** and $n$ are equal.\n\nMore precisely, $\\mathbb{Z}/n\\mathbb{Z}$ has the elements $0,1,\\ldots,n-1$.\n\n...\n',
False),
('reference_with_tag_labels_Definition 2',
'Let $n \\geq 1$ be an integer. The ring of integers modulo $n$, denoted by $\\mathbb{Z}/n\\mathbb{Z}$, is, informally, the ring whose elements are represented by the integers with the understanding that $0$ and **$n$** are equal.\n\nMore precisely, $\\mathbb{Z}/n\\mathbb{Z}$ has the elements $0,1,\\ldots,n-1$.\n\n...\n',
False),
('reference_with_tag_labels_Definition 2',
'Let $n \\geq 1$ be an integer. The ring of integers modulo $n$, denoted by $\\mathbb{Z}/n\\mathbb{Z}$, is, informally, the ring whose elements are represented by the integers with the understanding that $0$ and $n$ are equal.\n\nMore precisely, **$\\mathbb{Z}/n\\mathbb{Z}$** has the elements $0,1,\\ldots,n-1$.\n\n...\n',
False),
('reference_with_tag_labels_Definition 2',
'Let $n \\geq 1$ be an integer. The ring of integers modulo $n$, denoted by $\\mathbb{Z}/n\\mathbb{Z}$, is, informally, the ring whose elements are represented by the integers with the understanding that $0$ and $n$ are equal.\n\nMore precisely, $\\mathbb{Z}/n\\mathbb{Z}$ has the elements **$0,1,\\ldots,n-1$**.\n\n...\n',
False)]
Make database of notation data
append_notation_data_to_database
append_notation_data_to_database (vault:os.PathLike, file:os.PathLike, notes:list[trouver.markdown.obsidian.va ult.VaultNote], backup:bool=True)
Type | Default | Details | |
---|---|---|---|
vault | PathLike | The vault from which the data is drawn | |
file | PathLike | The path to a CSV file | |
notes | list | The notes to add to the database | |
backup | bool | True | If True , makes a copy of file in the same directoy and with the same name, except with an added extension of .bak . |
Returns | None |
# TODO: example
Use ML categorization model to find and mark notations in notes
automatically_mark_notations
automatically_mark_notations (vn:trouver.markdown.obsidian.vault.VaultNo te, learn:fastai.text.learner.TextLearner, create_notation_notes:bool=False, reference_name:str='')
Predict and mark where notations occur in a note, and optionally create a notation note, and add the notation note to the See Also
section of the note.
Assumes that no double asterisks are already in the contents of vn
.
This function Removes links, headings, footnotes, etc. from the original note and merges multi-line display math mode LaTeX text into single lines. Use with caution.
Type | Default | Details | |
---|---|---|---|
vn | VaultNote | The information note to which to mark notations. | |
learn | TextLearner | The ML model which predicts where notation notes should occur. This is a classifier which takes as input a str with double asterisks surrounding LaTeX text. The model outputs whether or not the single double asterisk pair surrounds a LaTeX text with notation. | |
create_notation_notes | bool | False | If True , creates the notations notes for the predicted notations and links them to the ‘See Also’ sections of the information notes. |
reference_name | str | The name of the reference that vn belongs to; this is only relevant when create_notation_notes=True so that the created notation notes have file names starting with the reference name. |
|
Returns | None |
# TODO: Test
with tempfile.TemporaryDirectory(prefix='tmp_dir_', dir=os.getcwd()) as tmp_dir:
= Path(tmp_dir)
tmp_dir = tmp_dir / 'test_vault_6'
temp_vault '_tests/test_vault_6', temp_vault)
shutil.copytree(
= VaultNote(temp_vault, name='number_theory_reference_1_Definition 15')
note
with mock.patch('__main__.TextLearner') as mock_textlearner_class:
= mock_textlearner_class.return_value
mock_textlearner = [
mock_textlearner.predict.side_effect 'False', Tensor([0]), Tensor([1, 0])),
('True', Tensor([0]), Tensor([0, 1])),
('False', Tensor([0]), Tensor([1, 0])),
('False', Tensor([0]), Tensor([1, 0])),
(
]
automatically_mark_notations(note, mock_textlearner)print('The following is the note after the double asterisks are added, '
'assuming that the ML model predictions are as above:')
print(note.text())
assert r'**$\operatorname{Gal}(L/K)$**' in note.text()
The following is the note after the double asterisks are added, assuming that the ML model predictions are as above:
---
cssclass: clean-embeds
aliases: []
tags: [_meta/literature_note, _meta/definition, _meta/notation]
---
# Topic[^1]
%%This is an example file to which `automatcally_mark_notations` will be applied.%%
Let $L/K$ be a Galois field extension. Its Galois group **$\operatorname{Gal}(L/K)$** is defined as the group of automorphisms of $L$ fixing $K$ pointwise.
# See Also
# Meta
## References
## Citations and Footnotes
[^1]: Kim,
# TODO: test 'w' after implementing `overwrite.`
# TODO: test 'a' after implementing `overwrite.`
# TODO: test `None` after implementing `overwrite.`