Fixing the Markdown Parser on the Utopian V2 Frontend

in #utopian-io6 years ago (edited)

Repository

https://github.com/utopian-io/v2.utopian.io

Bug Fixes

  • What was the issue(s)?

Markdown Parser Bugs

The issue that was encountered in this set of pull requests was with regards to the Markdown parser and how it parsing certain types of data. Specifically, images, styled links and blocks that had both italics and bolded text. While working to fix these problems, I encountered a few more cases where upon 'mention' links and general links would not be properly parsed in a larger document.

The image above shows an example of this. The parser is not properly parsing the two primary links in this block of text. This text appeared very far down in the body of the contribution and everything after it was also plain text.

The image here shows another example of a slightly different issue where the parser isn't automatically finding the steemit usernames. All of the usernames in this block of text should be linked to the user's profile pages on the utopian frontend. However, in this case, we have a big blob of plain text. The functionality of this is very similar to the functionality that you would find on Busy or Steemit; for example: @ausername creates a link to a profile even if there isn't a user that is registered under that name.

The Regular Expression that the parser was using to parse images was capturing trailing whitespace and it caused multiple images to be spliced together into one single Html img element. This had some fairly large implications when it came to dealing with images inside of complex markdown structures like tables and embedded links. When you write a markdown table you have to create a block of markdown that is not interrupted by any line break (\n) or carriage return characters (\r).

![](image/example1) |  ![](image/example2) | ![](image/example3) 

Images inside of the tables would be formatted like the above block. Below is an example of the RegEx capturing three different markdown image blocks which were embedded in a table.

The output of this through the parser would look something like this:

< img src="image/example1,image/example2,image/example3"  / > 

The resulting image would only show the first image in this src attribute. This was preceded by a newline character and then a pipe character with another mangled img element below it. If a user put nine images into a three by three table; the parser would only display the three leftmost images and not generate the table.

  • What was the solution?

Cleaning Up the Parser Bugs

Working through these problems brought me to the conclusion that these issues were all related. The way that the whitespace was being dealt with caused all of these problems to occur. Also, due to the way that the parser and sanitizer work together, there was a smattering of <b/> closing tags throughout some of the contributions which further complicated things.

My first instinct was to try to tackle the image/table problem. I approached this by looking at the generated HTML elements and by testing markdown patterns and strings which contained multiple nested elements and images. This was done by simply adding small lines to the code which could then call the main parse() function.

In these first two commits you can see that the parser naturally surrounded any image elements with two newline characters on both sides. The image elements were transformed to add the proxy and to help prevent potential Cross Site Scripting (XSS) attacks. An image that looked like this: ![image description](image/url) would be converted to something more like this \n\n![](proxy/image/url)\n\n. Removing the wrapping newline characters ensured that the main JavaScript Markdown-it library would work as intended.

In the second commit, I worked on the RegEx which was meant to capture any markdown image block or naked image url and replace it with a proxied version of that element inside of an <a> HTML tag. The original RegEx looked like this: \(?:!\\[(?:.+)?\\]\\()?(' + image + ')(?:\\))?\gi. The concatenated image variable contains more specific elements to match on image URL patterns.

This RegEx was causing the errors due to one of its small little non-capture groups: (?:.+)?. If you go to a RegEx matching site like regexpr and input this RegEx; you will notice that it captures any character except for line breaks. As shown above, a markdown table cannot contain a non-escaped line break character and the wrappers were naturally adding four to each image. This was resolved by changing the above non-capture group to this negation character set [^\]]*.

The way that this new RegEx works is by parsing any character zero or more times that isn't a closing square bracket. This includes line breaks, whitespace and other special characters. The parser removes all of the text inside of the square brackets for security reasons. With this new negated character set, the parser now knows to remove any line breaks and special characters inside of the square brackets. This single commit contributes to solving all of the image problems for the parser including the table issues.

After cleaning up the whitespace problems in the parser, I noticed that the styled text issue was a result of incorrect formatting. In these documents, there were many closing <b/> tags and they were surrounded by newline characters. The next major commits, here and here were made to deal with these problems specifically.

The parser is supposed to take the Markdown String and output an HTML document based on that String. To keep the document structure between the HTML and Markdown, we have a normalization module. This helps maintain the format of the contributions across all browsers because they each contain different specifications for line breaks.

There are times when we have both Markdown and HTML inside of the string that is being parsed. In these transition periods, we use the normalizer module to ensure that all of the Markdown gets properly parsed. Each HTML element was preceded by a single newline character but to make sure that the markdown is parsed, I needed it to be on its own line inside of this transition string. By adding a second newline character, I was able to ensure that this is the case without affecting the formation of the documents.

Next, I make a small RegEx to clean up all of the closing </b> tags. The markdown parser doesn't use the <b> tag and it instead uses a <strong> tag. Since none of these documents should have any <b> tags inside of them, I went ahead and wrote a RegEx to remove all of the closing </b> tags. The RegEx used in this case looks like this: /^(<\/ ?b>)+/gim. This RegEx scans the document for any </b> tags which appear at the beginning of a line or the string. This works for any ` tag in the document because it is applied to small snippets of the document string through a simple replace call.

state.src = state.src.replace(boldExpresson, (_) => '')

The block of code above takes the boldExpression RegEx and applies it to sections of the strings until it replaces any and all </b> tags with empty strings. Doing this resolves the rest of the issues. This includes fixing the mentions parser, the styled links, the bold and italics text and it maintains the structure of more complex nested markdown elements.

After confirming these fixes, I made a final PR to clean up the code and add a small bit of functionality to a different issue that was open.

The pull requests referenced in this contribution are here and here

Sources:
The images in this contribution come from these utopian contributions: here and here.

GitHub Account

https://github.com/tensor-programming

Sort:  

Hi,

I was dealing with some regex earlier today. I needed to extract the links to images from a markdown text. Probably I can use your regex expression to solve all my problems.

Anyway. Well written post. I definitely should have a look at utopian.

Kind regards,

The Secret Service

Thats cool. Feel free to use any of the code you find in the repo. Its open source after all.

Cheers,

Tensor

Thank you for your contribution. I really liked the way you have explained the problem and the fix which follows. Also its good to see you are working on Utopian Frontend, kudos.

Your contribution has been evaluated according to Utopian policies and guidelines, as well as a predefined set of questions pertaining to the category.

To view those questions and the relevant answers related to your post, click here.


Need help? Write a ticket on https://support.utopian.io/.
Chat with us on Discord.
[utopian-moderator]

Thank you very much for going through my contribution and reviewing it. I know it was a huge block of text, but I figured that explaining the process I went through would make it more interesting.

Thank you @tensor. You have been of great help with improving the frontend.

Thanks @espoem,

It's been an interesting experience and I've really enjoyed working with this team.

Many of the projects I've taken on in the past few years have been filled with very predictable developments; so much so that it has driven me partially into complacency (this is partially why I tend to switch technologies so quickly).

With this project, I've had to work around many of the conventions that were already in place when I joined the team. Also, due to a lack of time and information; I've been pushed to look at the project from a different perspective then I normally would. Doing things this way has really helped shake some of the conceit that has built up over the years.

It's also a bonus that all of the developers on this project are top notch (unlike some other projects that will remain unnamed here).

I look forward to continuing to contribute to the project in these next few months and I really want to see this thing launch (and succeed).

Loading...

Hey @tensor
Thanks for contributing on Utopian.
We’re already looking forward to your next contribution!

Want to chat? Join us on Discord https://discord.gg/h52nFrV.

Vote for Utopian Witness!

Coin Marketplace

STEEM 0.31
TRX 0.11
JST 0.034
BTC 64332.82
ETH 3146.25
USDT 1.00
SBD 4.17