markdown.obsidian.personal.machine_learning.definition_identification

Functions for finding definitions

Gather ML data from information notes


source

definitions_in_text

 definitions_in_text (text:str)

Return the list of str with the definitions in the text.

# TODO: exmaples

source

definition_identification_data_from_note

 definition_identification_data_from_note
                                           (note:trouver.markdown.obsidian
                                           .vault.VaultNote,
                                           vault:os.PathLike)

Obtain definition identification data from the information note.

Type Details
note VaultNote
vault PathLike
Returns typing.Optional[dict[str, str]] The keys to the dict are “Note name”, “Raw text”, “Definitions”. However, None is returned if note does not exist.
# TODO: examples

source

gather_definition_identification_data

 gather_definition_identification_data (vault:os.PathLike,
                                        notes:list[trouver.markdown.obsidi
                                        an.vault.VaultNote])

Return a pandas.DataFrame encapsulating the data of definition identifications.

cf. definition_identification_data_from_note, which is the function with which the definition identification data is drawn.

This function is mainly used in append_to_definition_identification_database.

Type Details
vault PathLike
notes list
Returns DataFrame
# TODO: examples
def append_to_definition_identification_database(
        vault: PathLike, # The vault from which the data is drawn
        file: PathLike, # The path to a CSV file
        notes: list[VaultNote], # The notation notes to consider adding to the database
        backup: bool = True # If `True`, makes a copy of `file` in the same directory and with the same name, except with an added extension of `.bak`.
        ) -> None:
    """
    Either create a `csv` file containing data for definition
    identification or append to an existing `csv` file.

    The columns of the database file are as follows:

    - `Time added` - The time when the row was added.
    - `Time modified` - The time when the labels of the row 
    - `Notation note name` - The name of the note from which the data for the row
      was derived.
    - 'Notation' - The notation which is being summarized
    - 'Latex in original' - The entry of the `latex_in_original` field of the
      note if available, cf. `make_a_notation_note`
    - `"Summary"` - The summary of the notation.
    - `"Main note name"` - The name of the main note of the
      notation note
    - `"Processed main note contents"` - The processed contents of the
      main note

    All timestamps are in UTC time and specify time to minutes
    (i.e. no seconds/microseconds).
    
    TODO: implement updating rows and rewrite the next paragraph to
    accurately reflect the implementation. I would like the 'Notation', 'Latex in original',
    'Summary', 'processed main note contents' to be the "pivot_cols"

    If a "new" note has the same processed content as a pre-existing
    note and anything is different about the "new" note, then update
    the row of the existing note. In particular, the following are updated:
    - Time modified (set to current time)
    - Notation (overwritten)
    - Latex in original (overwritten)
    - Summary (overwritten)
    - Main note name (overwritten)
    - Processed main note contents (overwritten)
    
    This method assumes that all the processed content in the
    CSV file are all distinct if the CSV file exists.
    """
    if not notes:
        return
    file = Path(file)
    ddf = pd.read_csv(file) if os.path.exists(file) else None
    new_df = gather_definition_identification_data(vault, notes)
    if new_df.empty:
        return
    cols = [
        
    ]