2.0.0
HTMLWARNING
This is documentation for legacy versions. For the most current version click here.
Starting with version 2.0.0
Markwon
brings the whole HTML parsing/rendering
stack on-site. The main reason for this are special definitions of HTML nodes
by
commonmark spec
. More specifically:
inline
and
block
.
These two are a bit different from native HTML understanding.
Well, they are completely different and share only the same names as
HTML-inline
and
HTML-block
elements. This leads to situations when for example an <i>
tag is considered
a block when it's used like this:
<i>
Hello from italics tag
</i>
A bit of background
This issue had brought attention to differences between HTML & commonmark implementations.
Let's modify code snippet above a bit:
<i>
Hello from italics tag
</i>
We have just added a new-line
before closing </i>
tag. And this
changes everything as now, according to the
commonmark dingus
,
we have 2 HtmlBlocks: one before new-line
(containing open <i>
tag and text content)
and one after (containing as little as closing </i>
tag).
If we modify code snippet a bit again:
<i>
Hello from italics tag
</i><b>bold></b>
We will have 1 HtmlBlock (from previous snippet) and a bunch of HtmlInlines:
- HtmlInline (
<i>
) - HtmlInline (
<b>
) - Text (
bold
) - HtmlInline (
</b>
)
Those little differences render Html.fromHtml
(which was used in 1.x.x
versions)
useless. And actually it renders most of the HTML parsers implementations useless,
as most of them do not allow processing of HTML fragments in a raw fashion
without fixing content on-the-fly.
Both TagSoup
and Jsoup
HTML parsers (that were considered for this project) are built to deal with
malicious HTML code (all HTML code? 😶). So, when supplied
with a <i>italic
fragment they will make it <i>italic</i>
.
And it's a good thing, but consider these fragments for the sake of markdown:
<i>italic
<b>bold italic
</b><i>
We will get:
<i>italic </i>
<b>bold italic</b>
* Or to be precise: <html><head></head><body><i>italic </i></body></html>
&
<html><head></head><body><b>bold italic</b></body></html>
Which will be rendered in a final document:
expected | actual |
---|---|
italic bold italic | italic bold italic |
This might seem like a minor problem, but add more tags to a document, introduce some deeply nested structures, spice openning and closing tags up by adding markdown markup between them and finally write malicious HTML code 😆!
There is no such problem on the frontend for which commonmark specification is mostly aimed as frontend runs in a web-browser environment. After all parsed markdown will become HTML tags (most common usage). And web-browser will know how to render final result.
We, on the other hand, do not posess HTML heritage (thank 🤖!), but still
want to display some HTML to style resulting markdown a bit. That's why Markwon
incorporated own HTML parsing logic. It is based on the
Jsoup
project.
And makes usage of the Tokekiser
class that allows to tokenise input HTML.
All other code that doesn't follow this purpose was removed. It's safe to use
in projects that already have jsoup
dependency as Markwon
repackaged jsoup source classes
(which could be found
here
)
Parser
There are no additional steps to configure HTML parsing. It's enabled by default. If you wish to exclude it, please follow the exclude section below.
The key class here is: MarkwonHtmlParser
that is defined in markwon-html-parser-api
module.
markwon-html-parser-api
is a simple module that defines HTML parsing contract and
does not provide implementation.
To change what implementation Markwon
should use, SpannableConfiguration
can be used:
SpannableConfiguration.builder(context)
.htmlParser(MarkwonHtmlParser)
.build();
markwon-html-parser-impl
on the other hand provides MarkwonHtmlParser
implementation.
It's called MarkwonHtmlParserImpl
. It can be created like this:
final MarkwonHtmlParser htmlParser = MarkwonHtmlParserImpl.create();
// or
final MarkwonHtmlParser htmlParser = MarkwonHtmlParserImpl.create(HtmlEmptyTagReplacement);
Empty tag replacement
In order to append text content for self-closing, void or just empty HTML tags,
HtmlEmptyTagReplacement
can be used. As we cannot set Span for empty content,
we must represent empty tag with text during parsing stage (if we want it to be represented).
Consider this:
<img src="me-sad.JPG">
<br />
<who-am-i></who-am-i>
By default (HtmlEmptyTagReplacement.create()
) will handle img
and br
tags.
img
will be replaced with alt
property if it is present and \uFFFC
if it is not.
And br
will insert a new line.
Non-closed tags
It's possible that your HTML can contain non-closed tags. By default Markwon
will ignore them,
but if you wish to get a bit closer to a web-browser experience, you can allow this behaviour:
SpannableConfiguration.builder(context)
.htmlAllowNonClosedTags(true)
.build();
Note
If there is (for example) an <i>
tag at the start of a document and it's not closed
and Markwon
is configured to not ignore non-closed tags (.htmlAllowNonClosedTags(true)
),
it will make the whole document in italics
Implementation note
MarkwonHtmlParserImpl
does not create a unified HTML node. Instead it creates
2 collections: inline tags and block tags. Inline tags are represented as a List
of inline tags (
reference
). And
block tags are structured in a tree. This helps to achieve browser-like behaviour,
when open inline tag is applied to all content (even if inside blocks) until closing tag.
All tags that are not inline are considered to be block ones.
Renderer
Unlike MarkwonHtmlParser
Markwon
comes with a MarkwonHtmlRenderer
by default.
Default implementation can be obtain like this:
MarkwonHtmlRenderer.create();
Default instance have these tags handled:
- emphasis
i
em
cite
dfn
- strong emphasis
b
strong
sup
(super script)sub
(sub script)- underline
u
ins
- strike through
del
s
strike
a
(link)ul
(unordered list)ol
(ordered list)img
(image)blockquote
(block quote)h{1-6}
(heading)
If you wish to extend default handling (or override existing),
#builderWithDefaults
factory method can be used:
MarkwonHtmlRenderer.builderWithDefaults();
For a completely clean configurable instance #builder
method can be used:
MarkwonHtmlRenderer.builder();
Custom tag handler
To configure MarkwonHtmlRenderer
to handle tags differently or
create a new tag handler - TagHandler
can be used
public abstract class TagHandler {
public abstract void handle(
@NonNull SpannableConfiguration configuration,
@NonNull SpannableBuilder builder,
@NonNull HtmlTag tag
);
}
For the most simple inline tag handler a SimpleTagHandler
can be used:
public abstract class SimpleTagHandler extends TagHandler {
@Nullable
public abstract Object getSpans(@NonNull SpannableConfiguration configuration, @NonNull HtmlTag tag);
}
For example, EmphasisHandler
:
public class EmphasisHandler extends SimpleTagHandler {
@Nullable
@Override
public Object getSpans(@NonNull SpannableConfiguration configuration, @NonNull HtmlTag tag) {
return configuration.factory().emphasis();
}
}
If you wish to handle a block HTML node (for example <ul><li>First<li>Second</ul>
) refer
to ListHandler
source code for reference.
WARNING
The most important thing when implementing custom TagHandler
is to know
what type of HtmlTag
we are dealing with. There are 2: inline & block.
Inline tag cannot contain children. Block can contain children. And they
most likely should also be visited and handled by registered TagHandler
(if any)
accordingly. See TagHandler#visitChildren(configuration, builder, child);
Css inline style parser
When implementing own TagHandler
you might want to inspect inline CSS styles
of a HTML element. Markwon
provides an utility parser for that purpose:
final CssInlineStyleParser inlineStyleParser = CssInlineStyleParser.create();
for (CssProperty property: inlineStyleParser.parse("width: 100%; height: 100%;")) {
// [0] = CssProperty({width=100%}),
// [1] = CssProperty({height=100%})
}
Exclude HTML parsing
If you wish to exclude HTML parsing altogether, you can manually
exclude markwon-html-parser-impl
artifact from your projects compile classpath.
This can be beneficial if you know that markdown input won't contain
HTML and/or you wish to ignore it. Excluding HTML parsing
can speed up Markwon
parsing and will decrease final size of
Markwon
dependency by around 100kb
.
dependencies {
implementation("ru.noties:markwon:${markwonVersion}") {
exclude module: 'markwon-html-parser-impl'
}
}
Excluding markwon-html-parser-impl
this way will result in
MarkwonHtmlParser#noOp
implementation. No further steps are
required.
Note
Excluding markwon-html-parser-impl
won't remove all the content between
HTML tags. It will if commonmark
decides that a specific fragment is a
HtmlBlock
, but it won't if fragment is considered a HtmlInline
as HtmlInline
does not contain content (just a tag definition).