Return the full string of html_tag, accounting for special characters that bs4 changes
When using the __str__ function of bs4.element.Tag objects, special characters such as <, > and & change into <, > and &, etc. The html_tag_str function makes it so that these characters are changed back.
Handling less than < symbols in latex math mode strings
BeautifulSoup’s html.parser parses less than < symbols without a following space as the beginning of an HTML tag, even when the symbol < is used within a LaTeX math mode string. To get around this, we detect when this happens and add a space after these symbols.
Return the indices in text with math mode less than < symbols without a space that follows.
Type
Details
text
str
Returns
list
The index of
In the following example, there are a few math mode strings with less than < symbols. Some of these symbols are followed by spaces and others are not.
text =r"""here is a math mode $a<b$. Here is another $a< b$.Here is an in-line one:$$ asdf <cbba$$Here is another:$$asdf < basdf$$"""output = find_lt_symbols_without_space_in_math_mode(text)print(output)test_eq(len(output), 2)test_eq(text[output[0] +1], 'b')test_eq(text[output[1] +1], 'c')
[23, 85]
text_2 =r"""<b>Now there is an HTML tag</b>. But it shouldn't be detectedbecause the tag is not within math mode text.But this inequality is: $a <d$"""output = find_lt_symbols_without_space_in_math_mode(text_2)print(output)test_eq(len(output), 1)test_eq(text_2[output[0] +1], 'd')
Attribute(s) within the HTML tags which should be used to replace the text of the tags. If None, then the texts are not replaced with the attributes. If multiple attributes are specified, then only one attribute is used to replace the text for each HTML tag (independently at random of other replacements). Each attribute’s text has an equal chance of being selected for replacement. Repeats are ignored.
definitely_replace
bool
False
If True and if a given HTML tag has an attribute specified in replace_with_attributes, then the text for that tag will definitely be replaced by the text of one of the attributes. Otherwise, the original text and each attribute’s text have an equal chance of being selected.
seed
int
None
Random seed
Returns
tuple
The text removed without HTML tags and a list whose elements consist of the removed HTML tags and the starting and ending indices of the text corresponding to the removed tags within removed.
The remove_html_tags_in_text function removes HTML tags, preserving the underlying text by default.
html ='Let $K$ be a field. An <b definition="Abelian variety over a field">Abelian variety over $K$</b> is a variety that'text_without_html_tags, removed_tags = remove_html_tags_in_text(html)print(text_without_html_tags)test_eq(text_without_html_tags, 'Let $K$ be a field. An Abelian variety over $K$ is a variety that')
Let $K$ be a field. An Abelian variety over $K$ is a variety that
removed_tags[0][0].attrs
{'definition': 'Abelian variety over a field'}
In the following example, there is a less than < symbol, which is definitely not the opening of an HTML tag. The following verifies that the placeholder < is not used to replace the less than symbol, which is what bs4.BeautifulSoup’s html.parser does.
text ='Hello, this has a less than symbol: $a< b$'text, html_tags = remove_html_tags_in_text(text)assertnot html_tagsassert'< 'in textassert'lt'notin html_tags
The same applies to the greater than > symbol, and & symbols
text ='Hello, this has a greater than symbol: $a>b$'text, html_tags = remove_html_tags_in_text(text)assertnot html_tagsassert'>'in textassert'gt'notin html_tagstext =r'Hello $$ f &= 3 \\ g &= 5'text, html_tags = remove_html_tags_in_text(text)assertnot html_tagsassert'&'in textassert'&'notin html_tags
c:\Users\hyunj\Documents\Development\Python\trouver_py310_venv\lib\site-packages\bs4\__init__.py:435: MarkupResemblesLocatorWarning: The input looks more like a filename than markup. You may want to open this file and pass the filehandle into Beautiful Soup.
warnings.warn(
The remove_html_tags_in_text function additionally returns a list with information about the tags that are removed. Each item in this list is a tuple (tag, start, end), where tag is the tag that has been removed, and start and end are the indices within the string output text_without_html_tags of the function at which the text replacing the tag can be found.
In the example above (continued below), there is excactly one tag that is removed.
print(removed_tags)removed_tag, start, end = removed_tags[0]print(text_without_html_tags[start:end])test_eq(text_without_html_tags[start:end], 'Abelian variety over $K$')
[(<b definition="Abelian variety over a field">Abelian variety over $K$</b>, 23, 47)]
Abelian variety over $K$
The remove_html_tags_in_text function can be used to replace the underlying text of HTML tags with specified attribute values.
In the below example, the text has a tag which contains a typo attribute. Passing 'typo' to the replace_with_attributes parameter and passing True to the definitely_replace parameter guarantees that the value of the typo attribute is used instead of the text of the tag.
html =r'The following tag fixes a typo and simultaneously keeps around the data of that typo: <span typo="$\operatorname{Gul}(K)$">$\operatorname{Gal}(K)$</span>'text_without_html_tags, removed_tags = remove_html_tags_in_text(html, replace_with_attributes='typo', definitely_replace=True)print(text_without_html_tags)test_eq(text_without_html_tags, 'The following tag fixes a typo and simultaneously keeps around the data of that typo: $\\operatorname{Gul}(K)$')removed_tag, start, end = removed_tags[0]test_eq(text_without_html_tags[start:end], '$\\operatorname{Gul}(K)$')
The following tag fixes a typo and simultaneously keeps around the data of that typo: $\operatorname{Gul}(K)$
If the definitely_replace parameter is False (which it is by default), then the original text might be preserved or it might be replaced.
html =r'<span typo="$\operatorname{Gul}(K)$">$\operatorname{Gal}(K)$</span>'possible_outputs = [r'$\operatorname{Gal}(K)$',r'$\operatorname{Gul}(K)$']output, _ = remove_html_tags_in_text(html, replace_with_attributes='typo', definitely_replace=False)assert output in possible_outputs
Adding HTML tag data
On the other hand, we may also need to add HTML tag data to a text.
*Add specified HTML tags to the specified locations/ranges in text.
See the add_HTML_tag_data_to_text function for adding HTML tag data to text that may or may not already have HTML tags.*
Type
Details
text
str
The text onto which to add HTML tags. This is assumed to contain no HTML tags.
tags_and_locations
list
Each tuple consists of the tag object to add as well as the indices within text to. The ranges specified by the tuples are assumed to not overlap with one another.
Returns
str
The modification of text in which the tags are added at the specified locations; the ranges in text are replaced..
text ="Now this will have an HTML tag. This will also have an HTML tag too!"tags_and_locations = [ (BeautifulSoup('<span some_attr="hi">this</span>', 'html.parser'), 4,8), (BeautifulSoup('<div some_attr="hi">This</div>', 'html.parser'), 32,36)]output = add_HTML_tag_data_to_raw_text(text, tags_and_locations)print(output)test_eq(output, 'Now <span some_attr="hi">this</span> will have an HTML tag. <div some_attr="hi">This</div> will also have an HTML tag too!')
Now <span some_attr="hi">this</span> will have an HTML tag. <div some_attr="hi">This</div> will also have an HTML tag too!
Now let us look at the same example, with the order in tags_and_locations reversed.
text ="Now this will have an HTML tag. This will also have an HTML tag too!"tags_and_locations = [ (BeautifulSoup('<div some_attr="hi">This</div>', 'html.parser'), 32,36), (BeautifulSoup('<span some_attr="hi">this</span>', 'html.parser'), 4,8)]output = add_HTML_tag_data_to_raw_text(text, tags_and_locations)print(output)test_eq(output, 'Now <span some_attr="hi">this</span> will have an HTML tag. <div some_attr="hi">This</div> will also have an HTML tag too!')
Now <span some_attr="hi">this</span> will have an HTML tag. <div some_attr="hi">This</div> will also have an HTML tag too!