Parsing HTML to Contentful rich text
A while ago Contentful released a new feature called Rich Text which is a new field type that allows you to create rich text content. It gives you the possibility to format your textual content in all the ways you're used to, making text bold, adding headers, inserting lists, quotes and code etc. This isn't all that exciting as it's been available through the markdown field type for a long time, however, rich text fields are quite different from traditional markdown or html editors. A rich text field stores the content as a typed json structure. Turning your paragraph into something like this:
{
"data": {},
"content": [
{
"data": {},
"marks": [],
"value": "This is a paragraph",
"nodeType": "text"
}
],
"nodeType": "paragraph"
}
This typed structure (well, as typed as vanilla javascript can be anyway) allows for greater flexibility and programmability than, for example, markdown. It does come with a cost in excessive bloat, though. Compare the following two paragraps in markdown, html and rich text.
Markdown
This is a paragraph.
So is this!
HTML
<p>This is a paragraph.</p>
<p>So is this!</p>
Rich text
{
"data": {},
"content": [
{
"data": {},
"marks": [],
"value": "This is a paragraph.",
"nodeType": "text"
}
],
"nodeType": "paragraph"
},
{
"data": {},
"content": [
{
"data": {},
"marks": [],
"value": "So is this!",
"nodeType": "text"
}
],
"nodeType": "paragraph"
}
That's a lot of json for two small paragraphs, but like I eluded to above, it does come with significant benefits as well. Flexibility, you can create new custom node types with minimal changes to the structure. Programmability, since the structure is very rigid, it's much easier to build a program reading this structure and producing some output than writing a markdown or html parser.
I've played around quite a lot with the rich text structure while I was building support for it into the .NET SDK and I will admit to being quite skeptical at first. I felt like everything rich text could do markdown could do better, but during implementation I came around and changed my mind. A json structure like rich text made writing a parser for the json a breeze, I just created a few classes and a custom deserializer and I suddenly had the entire content structure in strongly typed c#.
As adoption of the rich text field now has grown significantly, even though the feature itself is still in beta, the number of questions around it in the Contentful Community slack has grown as well. One thing that pops up from time to time is how to get your existing content, markdown or html, into a rich text field. For markdown there's already an NPM package provided by Contentful, but I haven't been able to find anything for HTML. I figured it'd be a fun thing to build with F#, as I've never really built anything publicly with it before.
I started out by figuring out the different types I needed and came up with the following.
type Sys = { id:string; ``type``:string; linkType:string }
type TargetData = { sys:Sys }
type ReferenceData = { target:TargetData }
type LinkData = { uri:string }
type ContentfulData =
| LinkData of LinkData
| ReferenceData of ReferenceData
| Unit of Unit
type Node = { nodeType:string; data: ContentfulData; content: List<ContentfulNodes> }
and TextMark = { ``type``:string; }
and TextNode = { nodeType:string; marks: List<TextMark>; data: ContentfulData; value:string; }
and
ContentfulNodes =
| Node of Node
| TextNode of TextNode
| TextMark of TextMark
I then started out by parsing the html using the HtmlParser in the Fsharp.Data package.
let rec private getChildElements (elems:list<HtmlNode>) =
let content = elems |> List.map (fun i ->
match i.Name() with
| "p" -> Node { nodeType = "paragraph"; data = Unit (); content = (i.Elements() |> getChildElements ) }
| "" -> TextNode { nodeType = "text"; data = Unit (); marks = []; value = i.InnerText()}
| _ -> Node { nodeType = "unknown"; data = Unit (); content = [] }
)
content
This function takes in a list of HtmlNodes and matches on the name of the node, if it's a <p>
-tag we turn that into a rich text paragraph
. If it doesn't have a name we turn that into a text node. I then filled out the other supported types and a little method for parsing marks, which is attributes that decorate text in rich text lingo, for example, bold, italic, underline etc.
let private getMarks (elem:HtmlNode) =
elem :: (elem.Descendants()|> Seq.toList) |>
List.map(fun i ->
match i.Name() with
| "strong" | "b" -> { ``type`` = "bold" }
| "u" -> { ``type`` = "underline" }
| "i" -> { ``type`` = "italic" }
| "code" -> { ``type`` = "code" }
| _ -> { ``type`` = "unsupported" }
) |>
List.filter(fun m -> m.``type`` <> "unsupported")
let rec private getChildElements (elems:list<HtmlNode>) =
let content = elems |> List.map (fun i ->
match i.Name() with
| "p" -> Node { nodeType = "paragraph"; data = Unit (); content = (i.Elements() |> getChildElements ) }
| "ul" -> Node { nodeType = "unordered-list"; data = Unit (); content = (i.Elements() |> getChildElements ) }
| "ol" -> Node { nodeType = "ordered-list"; data = Unit (); content = (i.Elements() |> getChildElements ) }
| "li" -> Node { nodeType = "list-item"; data = Unit (); content = (i.Elements() |> getChildElements ) }
| "h1" -> Node { nodeType = "heading-1"; data = Unit (); content = (i.Elements() |> getChildElements ) }
| "h2" -> Node { nodeType = "heading-2"; data = Unit (); content = (i.Elements() |> getChildElements ) }
| "h3" -> Node { nodeType = "heading-3"; data = Unit (); content = (i.Elements() |> getChildElements ) }
| "h4" -> Node { nodeType = "heading-4"; data = Unit (); content = (i.Elements() |> getChildElements ) }
| "h5" -> Node { nodeType = "heading-5"; data = Unit (); content = (i.Elements() |> getChildElements ) }
| "h6" -> Node { nodeType = "heading-6"; data = Unit (); content = (i.Elements() |> getChildElements ) }
| "blockquote" -> Node { nodeType = "blockquote"; data = Unit (); content = (i.Elements() |> getChildElements ) }
| "hr" -> Node { nodeType = "hr"; data = Unit (); content = [] }
| "a" -> Node { nodeType = "hyperlink"; data = LinkData { uri = i.AttributeValue("href") }; content = (i.Elements() |> getChildElements )}
| "strong" | "b" | "i" | "u" | "code" -> TextNode { nodeType = "text"; data = Unit (); marks = getMarks i; value = i.InnerText()}
| "" -> TextNode { nodeType = "text"; data = Unit (); marks = []; value = i.InnerText()}
| _ -> Node { nodeType = "unknown"; data = Unit (); content = [] }
)
content
One glaring omission here is <img>
-tags, which I haven't handled yet as I haven't fully decided on the strategy. Should the package help you create an asset or should it let you define your own callback for other custom tags you want to handle? I'll ponder this for the next version.
I then created the public functions that takes in the actual html string.
let parseHtml html =
let res = HtmlDocument.Parse html
let content = res.Elements() |> getChildElements
Seq.toList content
let htmlToJson html =
let structure = parseHtml html
let doc = { nodeType = "document"; data = Unit (); content = structure }
JsonConvert.SerializeObject doc
Note how I wrap the entire html in a rich text document
-node in the htmlToJson
function as this is a required root element in a Contentful rich text structure.
This package is now available as Contentful.RichTextParser
on NuGet. I had plans to use the package in an Azure function to make it available for other languages as well, but currently the support for custom nuget packages in Azure Functions is abysmal, so that will have to wait. Anway, below is an example on how the package can be used in conjunction with the .NET SDK to parse HTML and pass it back in a rich text field to Contentful.
var html = "<p>Hello</p><p>Another with <strong>bold maybe?</strong></p>";
var json = Parser.htmlToJson(html);
var obj = new
{
Rich = JObject.Parse(json)
};
var newEntry = await _client.CreateEntryForLocale(obj, "rich-thing-html", "content-type-id");
I hope you find it useful.