HTML 2.0.0

WARNING

This is documentation for legacy versions. For the most current version click here.

Starting with version 2.0.0 Markwon brings the whole HTML parsing/rendering stack on-site. The main reason for this are special definitions of HTML nodes by commonmark spec . More specifically: inline and block . These two are a bit different from native HTML understanding. Well, they are completely different and share only the same names as HTML-inline and HTML-block elements. This leads to situations when for example an <i> tag is considered a block when it's used like this:

<i>
Hello from italics tag
</i>

A bit of background


This issue had brought attention to differences between HTML & commonmark implementations.

Let's modify code snippet above a bit:



 


<i>
Hello from italics tag

</i>

We have just added a new-line before closing </i> tag. And this changes everything as now, according to the commonmark dingus , we have 2 HtmlBlocks: one before new-line (containing open <i> tag and text content) and one after (containing as little as closing </i> tag).

If we modify code snippet a bit again:




 

<i>
Hello from italics tag

</i><b>bold></b>

We will have 1 HtmlBlock (from previous snippet) and a bunch of HtmlInlines:

  • HtmlInline (<i>)
  • HtmlInline (<b>)
  • Text (bold)
  • HtmlInline (</b>)

Those little differences render Html.fromHtml (which was used in 1.x.x versions) useless. And actually it renders most of the HTML parsers implementations useless, as most of them do not allow processing of HTML fragments in a raw fashion without fixing content on-the-fly.

Both TagSoup and Jsoup HTML parsers (that were considered for this project) are built to deal with malicious HTML code (all HTML code? 😶). So, when supplied with a <i>italic fragment they will make it <i>italic</i>. And it's a good thing, but consider these fragments for the sake of markdown:

  • <i>italic
  • <b>bold italic
  • </b><i>

We will get:

  • <i>italic </i>
  • <b>bold italic</b>

* Or to be precise: <html><head></head><body><i>italic </i></body></html> & <html><head></head><body><b>bold italic</b></body></html>

Which will be rendered in a final document:

expected actual
italic bold italic italic bold italic

This might seem like a minor problem, but add more tags to a document, introduce some deeply nested structures, spice openning and closing tags up by adding markdown markup between them and finally write malicious HTML code 😆!

There is no such problem on the frontend for which commonmark specification is mostly aimed as frontend runs in a web-browser environment. After all parsed markdown will become HTML tags (most common usage). And web-browser will know how to render final result.

We, on the other hand, do not posess HTML heritage (thank 🤖!), but still want to display some HTML to style resulting markdown a bit. That's why Markwon incorporated own HTML parsing logic. It is based on the Jsoup project. And makes usage of the Tokekiser class that allows to tokenise input HTML. All other code that doesn't follow this purpose was removed. It's safe to use in projects that already have jsoup dependency as Markwon repackaged jsoup source classes (which could be found here )

Parser

There are no additional steps to configure HTML parsing. It's enabled by default. If you wish to exclude it, please follow the exclude section below.

The key class here is: MarkwonHtmlParser that is defined in markwon-html-parser-api module. markwon-html-parser-api is a simple module that defines HTML parsing contract and does not provide implementation.

To change what implementation Markwon should use, SpannableConfiguration can be used:


 


SpannableConfiguration.builder(context)
        .htmlParser(MarkwonHtmlParser)
        .build();

markwon-html-parser-impl on the other hand provides MarkwonHtmlParser implementation. It's called MarkwonHtmlParserImpl. It can be created like this:

final MarkwonHtmlParser htmlParser = MarkwonHtmlParserImpl.create();
// or
final MarkwonHtmlParser htmlParser = MarkwonHtmlParserImpl.create(HtmlEmptyTagReplacement);

Empty tag replacement

In order to append text content for self-closing, void or just empty HTML tags, HtmlEmptyTagReplacement can be used. As we cannot set Span for empty content, we must represent empty tag with text during parsing stage (if we want it to be represented).

Consider this:

  • <img src="me-sad.JPG">
  • <br />
  • <who-am-i></who-am-i>

By default (HtmlEmptyTagReplacement.create()) will handle img and br tags. img will be replaced with alt property if it is present and \uFFFC if it is not. And br will insert a new line.

Non-closed tags

It's possible that your HTML can contain non-closed tags. By default Markwon will ignore them, but if you wish to get a bit closer to a web-browser experience, you can allow this behaviour:


 


SpannableConfiguration.builder(context)
        .htmlAllowNonClosedTags(true)
        .build();

Note

If there is (for example) an <i> tag at the start of a document and it's not closed and Markwon is configured to not ignore non-closed tags (.htmlAllowNonClosedTags(true)), it will make the whole document in italics

Implementation note

MarkwonHtmlParserImpl does not create a unified HTML node. Instead it creates 2 collections: inline tags and block tags. Inline tags are represented as a List of inline tags ( reference ). And block tags are structured in a tree. This helps to achieve browser-like behaviour, when open inline tag is applied to all content (even if inside blocks) until closing tag. All tags that are not inline are considered to be block ones.

Renderer

Unlike MarkwonHtmlParser Markwon comes with a MarkwonHtmlRenderer by default.

Default implementation can be obtain like this:

MarkwonHtmlRenderer.create();

Default instance have these tags handled:

  • emphasis
    • i
    • em
    • cite
    • dfn
  • strong emphasis
    • b
    • strong
  • sup (super script)
  • sub (sub script)
  • underline
    • u
    • ins
  • strike through
    • del
    • s
    • strike
  • a (link)
  • ul (unordered list)
  • ol (ordered list)
  • img (image)
  • blockquote (block quote)
  • h{1-6} (heading)

If you wish to extend default handling (or override existing), #builderWithDefaults factory method can be used:

MarkwonHtmlRenderer.builderWithDefaults();

For a completely clean configurable instance #builder method can be used:

MarkwonHtmlRenderer.builder();

Custom tag handler

To configure MarkwonHtmlRenderer to handle tags differently or create a new tag handler - TagHandler can be used

public abstract class TagHandler {

    public abstract void handle(
            @NonNull SpannableConfiguration configuration,
            @NonNull SpannableBuilder builder,
            @NonNull HtmlTag tag
    );
}

For the most simple inline tag handler a SimpleTagHandler can be used:

public abstract class SimpleTagHandler extends TagHandler {

    @Nullable
    public abstract Object getSpans(@NonNull SpannableConfiguration configuration, @NonNull HtmlTag tag);
}

For example, EmphasisHandler:

public class EmphasisHandler extends SimpleTagHandler {
    @Nullable
    @Override
    public Object getSpans(@NonNull SpannableConfiguration configuration, @NonNull HtmlTag tag) {
        return configuration.factory().emphasis();
    }
}

If you wish to handle a block HTML node (for example <ul><li>First<li>Second</ul>) refer to ListHandler source code for reference.

WARNING

The most important thing when implementing custom TagHandler is to know what type of HtmlTag we are dealing with. There are 2: inline & block. Inline tag cannot contain children. Block can contain children. And they most likely should also be visited and handled by registered TagHandler (if any) accordingly. See TagHandler#visitChildren(configuration, builder, child);

Css inline style parser

When implementing own TagHandler you might want to inspect inline CSS styles of a HTML element. Markwon provides an utility parser for that purpose:

final CssInlineStyleParser inlineStyleParser = CssInlineStyleParser.create();
for (CssProperty property: inlineStyleParser.parse("width: 100%; height: 100%;")) {
    // [0] = CssProperty({width=100%}),
    // [1] = CssProperty({height=100%})
}

Exclude HTML parsing

If you wish to exclude HTML parsing altogether, you can manually exclude markwon-html-parser-impl artifact from your projects compile classpath. This can be beneficial if you know that markdown input won't contain HTML and/or you wish to ignore it. Excluding HTML parsing can speed up Markwon parsing and will decrease final size of Markwon dependency by around 100kb.

markwon
dependencies {
    implementation("ru.noties:markwon:${markwonVersion}") {
        exclude module: 'markwon-html-parser-impl'
    }
}

Excluding markwon-html-parser-impl this way will result in MarkwonHtmlParser#noOp implementation. No further steps are required.

Note

Excluding markwon-html-parser-impl won't remove all the content between HTML tags. It will if commonmark decides that a specific fragment is a HtmlBlock, but it won't if fragment is considered a HtmlInline as HtmlInline does not contain content (just a tag definition).

Last Updated: 6/17/2019, 2:08:33 PM