helper.html

Helper functors dealing with HTML tags

from fastcore.test import *

Consolidating special characters that are changed with the `str` function of `bs4.element.Tags` objects

source

html_tag_str

 html_tag_str (html_tag:bs4.element.Tag)

Return the full string of html_tag, accounting for special characters that bs4 changes

When using the __str__ function of bs4.element.Tag objects, special characters such as <, > and & change into <, > and &, etc. The html_tag_str function makes it so that these characters are changed back.

soup = BeautifulSoup('', 'html.parser')
tag = soup.new_tag('span')
tag.string = '&hi<'
test_eq(html_tag_str(tag), '<span>&hi<</span>')

Handling less than `<` symbols in latex math mode strings

BeautifulSoup’s html.parser parses less than < symbols without a following space as the beginning of an HTML tag, even when the symbol < is used within a LaTeX math mode string. To get around this, we detect when this happens and add a space after these symbols.

source

find_lt_symbols_without_space_in_math_mode

 find_lt_symbols_without_space_in_math_mode (text:str)

Return the indices in text with math mode less than < symbols without a space that follows.

	Type	Details
text	str
Returns	list	The index of

In the following example, there are a few math mode strings with less than < symbols. Some of these symbols are followed by spaces and others are not.

text = r"""
here is a math mode $a<b$. Here is another $a< b$.
Here is an in-line one:

$$ asdf <cbba$$

Here is another:

$$
asdf < basdf
$$
"""
output = find_lt_symbols_without_space_in_math_mode(text)
print(output)
test_eq(len(output), 2)
test_eq(text[output[0] + 1], 'b')
test_eq(text[output[1] + 1], 'c')

[23, 85]

text_2 = r"""
<b>Now there is an HTML tag</b>. But it shouldn't be detected
because the tag is not within math mode text.
But this inequality is: $a <d$
"""
output = find_lt_symbols_without_space_in_math_mode(text_2)
print(output)
test_eq(len(output), 1)
test_eq(text_2[output[0] + 1], 'd')

[136]

source

add_space_to_lt_symbols_without_space

 add_space_to_lt_symbols_without_space (text:str)

Add space after less than < symbols if the symbol is not followed by a space.

Let us again use text from the example for the find_lt_symbols_without_space_in_math_mode function:

print(add_space_to_lt_symbols_without_space(text))
assert not find_lt_symbols_without_space_in_math_mode(add_space_to_lt_symbols_without_space(text))


here is a math mode $a< b$. Here is another $a< b$.
Here is an in-line one:

$$ asdf < cbba$$

Here is another:

$$
asdf < basdf
$$

Removing HTML tags in a text and obtaining the data of the tags.

markup = '<b>Hello</b>'
soup = BeautifulSoup(markup, 'html.parser')
tag = soup.b
new_str = soup.new_string(' World')
tag.append(new_str)

new_str

' World'

source

remove_html_tags_in_text

 remove_html_tags_in_text (text:str,
                           replace_with_attributes:Union[str,list[str],Non
                           eType]=None, definitely_replace:bool=False,
                           seed:int=None)

*Remove the HTML tags in text.

HTML tags are assumed to be not nested.*

	Type	Default	Details
text	str		The text in which to remove the HTML tags.
replace_with_attributes	Union	None	Attribute(s) within the HTML tags which should be used to replace the text of the tags. If `None`, then the texts are not replaced with the attributes. If multiple attributes are specified, then only one attribute is used to replace the text for each HTML tag (independently at random of other replacements). Each attribute’s text has an equal chance of being selected for replacement. Repeats are ignored.
definitely_replace	bool	False	If `True` and if a given HTML tag has an attribute specified in `replace_with_attributes`, then the text for that tag will definitely be replaced by the text of one of the attributes. Otherwise, the original text and each attribute’s text have an equal chance of being selected.
seed	int	None	Random seed
Returns	tuple		The text `removed` without HTML tags and a list whose elements consist of the removed HTML tags and the starting and ending indices of the text corresponding to the removed tags within `removed`.

The remove_html_tags_in_text function removes HTML tags, preserving the underlying text by default.

html = 'Let $K$ be a field. An <b definition="Abelian variety over a field">Abelian variety over $K$</b> is a variety that'
text_without_html_tags, removed_tags = remove_html_tags_in_text(html)
print(text_without_html_tags)

test_eq(text_without_html_tags, 'Let $K$ be a field. An Abelian variety over $K$ is a variety that')

Let $K$ be a field. An Abelian variety over $K$ is a variety that

removed_tags[0][0].attrs

{'definition': 'Abelian variety over a field'}

In the following example, there is a less than < symbol, which is definitely not the opening of an HTML tag. The following verifies that the placeholder < is not used to replace the less than symbol, which is what bs4.BeautifulSoup’s html.parser does.

text = 'Hello, this has a less than symbol: $a< b$'
text, html_tags = remove_html_tags_in_text(text)
assert not html_tags
assert '< ' in text
assert 'lt' not in html_tags

The same applies to the greater than > symbol, and & symbols

text = 'Hello, this has a greater than symbol: $a>b$'
text, html_tags = remove_html_tags_in_text(text)
assert not html_tags
assert '>' in text
assert 'gt' not in html_tags

text = r'Hello $$ f &= 3 \\ g &= 5'
text, html_tags = remove_html_tags_in_text(text)
assert not html_tags
assert '&' in text
assert '&amp;' not in html_tags

c:\Users\hyunj\Documents\Development\Python\trouver_py310_venv\lib\site-packages\bs4\__init__.py:435: MarkupResemblesLocatorWarning: The input looks more like a filename than markup. You may want to open this file and pass the filehandle into Beautiful Soup.
  warnings.warn(

The remove_html_tags_in_text function additionally returns a list with information about the tags that are removed. Each item in this list is a tuple (tag, start, end), where tag is the tag that has been removed, and start and end are the indices within the string output text_without_html_tags of the function at which the text replacing the tag can be found.

In the example above (continued below), there is excactly one tag that is removed.

print(removed_tags)
removed_tag, start, end = removed_tags[0]
print(text_without_html_tags[start:end])

test_eq(text_without_html_tags[start:end], 'Abelian variety over $K$')

[(<b definition="Abelian variety over a field">Abelian variety over $K$</b>, 23, 47)]
Abelian variety over $K$

The remove_html_tags_in_text function can be used to replace the underlying text of HTML tags with specified attribute values.

In the below example, the text has a tag which contains a typo attribute. Passing 'typo' to the replace_with_attributes parameter and passing True to the definitely_replace parameter guarantees that the value of the typo attribute is used instead of the text of the tag.

html = r'The following tag fixes a typo and simultaneously keeps around the data of that typo: <span typo="$\operatorname{Gul}(K)$">$\operatorname{Gal}(K)$</span>'
text_without_html_tags, removed_tags = remove_html_tags_in_text(html, replace_with_attributes='typo', definitely_replace=True)
print(text_without_html_tags)

test_eq(text_without_html_tags, 'The following tag fixes a typo and simultaneously keeps around the data of that typo: $\\operatorname{Gul}(K)$')

removed_tag, start, end = removed_tags[0]
test_eq(text_without_html_tags[start:end], '$\\operatorname{Gul}(K)$')

The following tag fixes a typo and simultaneously keeps around the data of that typo: $\operatorname{Gul}(K)$

If the definitely_replace parameter is False (which it is by default), then the original text might be preserved or it might be replaced.

html = r'<span typo="$\operatorname{Gul}(K)$">$\operatorname{Gal}(K)$</span>'
possible_outputs = [
    r'$\operatorname{Gal}(K)$',
    r'$\operatorname{Gul}(K)$'
]
output, _ = remove_html_tags_in_text(html, replace_with_attributes='typo', definitely_replace=False)
assert output in possible_outputs

Adding HTML tag data

On the other hand, we may also need to add HTML tag data to a text.

source

add_HTML_tag_data_to_raw_text

 add_HTML_tag_data_to_raw_text (text:str,
                                tags_and_locations:list[tuple[bs4.element.
                                Tag,int,int]])

*Add specified HTML tags to the specified locations/ranges in text.

See the add_HTML_tag_data_to_text function for adding HTML tag data to text that may or may not already have HTML tags.*

	Type	Details
text	str	The text onto which to add HTML tags. This is assumed to contain no HTML tags.
tags_and_locations	list	Each tuple consists of the tag object to add as well as the indices within `text` to. The ranges specified by the tuples are assumed to not overlap with one another.
Returns	str	The modification of `text` in which the tags are added at the specified locations; the ranges in `text` are replaced..

text = "Now this will have an HTML tag. This will also have an HTML tag too!"
tags_and_locations = [
    (BeautifulSoup('<span some_attr="hi">this</span>', 'html.parser'), 4,8),
    (BeautifulSoup('<div some_attr="hi">This</div>', 'html.parser'), 32,36)
]
output = add_HTML_tag_data_to_raw_text(text, tags_and_locations)
print(output)
test_eq(output, 'Now <span some_attr="hi">this</span> will have an HTML tag. <div some_attr="hi">This</div> will also have an HTML tag too!')

Now <span some_attr="hi">this</span> will have an HTML tag. <div some_attr="hi">This</div> will also have an HTML tag too!

Now let us look at the same example, with the order in tags_and_locations reversed.

text = "Now this will have an HTML tag. This will also have an HTML tag too!"
tags_and_locations = [
    (BeautifulSoup('<div some_attr="hi">This</div>', 'html.parser'), 32,36),
    (BeautifulSoup('<span some_attr="hi">this</span>', 'html.parser'), 4,8)
]
output = add_HTML_tag_data_to_raw_text(text, tags_and_locations)
print(output)
test_eq(output, 'Now <span some_attr="hi">this</span> will have an HTML tag. <div some_attr="hi">This</div> will also have an HTML tag too!')

Now <span some_attr="hi">this</span> will have an HTML tag. <div some_attr="hi">This</div> will also have an HTML tag too!

Consolidating special characters that are changed with the __str__ function of bs4.element.Tags objects

html_tag_str

Handling less than < symbols in latex math mode strings

find_lt_symbols_without_space_in_math_mode

add_space_to_lt_symbols_without_space

Removing HTML tags in a text and obtaining the data of the tags.

remove_html_tags_in_text

Adding HTML tag data

add_HTML_tag_data_to_raw_text

Consolidating special characters that are changed with the `str` function of `bs4.element.Tags` objects

Handling less than `<` symbols in latex math mode strings