Skip to content

fix: preserve text nested inside <br> by html.parser (#244)#264

Open
gaoflow wants to merge 1 commit into
matthewwithanm:developfrom
gaoflow:fix/br-nested-text-loss
Open

fix: preserve text nested inside <br> by html.parser (#244)#264
gaoflow wants to merge 1 commit into
matthewwithanm:developfrom
gaoflow:fix/br-nested-text-loss

Conversation

@gaoflow

@gaoflow gaoflow commented Jun 25, 2026

Copy link
Copy Markdown

Summary

Fixes #244.

Python's html.parser has an edge case where text following <br /> (written with a space before the slash) is parsed as child content of the <br> element instead of a sibling text node, when the preceding sibling is a bare <br> tag.

from bs4 import BeautifulSoup
soup = BeautifulSoup('Hello<br>cruel<br />world', 'html.parser')
list(soup.children)
# ['Hello', <br/>, 'cruel', <br>world</br>]   # 'world' is nested inside <br>!

convert_br received the nested text via its text parameter but silently discarded it, causing characters to disappear from the output:

from markdownify import markdownify as md
md('Hello<br>cruel<br />world')
# Before: 'Hello  \ncruel  '   # 'world' is lost
# After:  'Hello  \ncruel  \nworld'

Fix

Append text to the returned line-break marker in all branches of convert_br. When <br> is used correctly as a void element, text is always an empty string, so the change is backward-compatible. The _inline branch is handled similarly for consistency.

Test

Added two assertions to test_br covering the mixed <br> / <br /> pattern that triggers the html.parser nesting behaviour.

This pull request was prepared with the assistance of AI, under my direction and review.

Python's html.parser can wrap text following <br /> (with a space) inside
the <br> element as child content when the <br> is preceded by a sibling
<br> without a slash.  convert_br received the nested text via its `text`
parameter but discarded it, silently dropping characters from the output.

Append `text` to the returned line-break marker so that any accidentally
nested content is always preserved.

Closes matthewwithanm#244
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

<br /> causes text to be lost

1 participant