wiki:HtmlParsing

Version 6 (modified by jukka, 12 years ago) (diff)

--

When displaying a large body of text, which may or may not have html-tags, following things must be done:

  • Allow all tags inserted by Kupu to be there
  • parse tex-code: \( ..code.. \) or \begin{equation} ... \end{equation} with tex-parser. Replace with image-tags.
  • links should be shortened as mentioned in comment by Hans for ticket #1352
  • parse bracket-links: [objid linkname], replace with a-href:s.
  • If the text is not inside any other tags, tex-code or bracket-links, convert linebreaks to br:s.
  • if there are words longer than 50chars, chop them, unless they are inside "":s or ' ':s.
  • Allow following tags: whitelist=??,??, remove everything else.
  • Make our own attributes filtering: ???

What else? Please fill in whitelist.

Currently I'm doing this with one run that takes all 'interesting' parts out and deals them to functions according to type. (?P<type>regex) is a python-specific notation that gives dictionary keys to match-objects.

Patterns and whitelist for allowed tags is below. Feel free to add to whitelist.

pattern=re.compile(r"""
    (?P<html_open><[a-z].*?>) # opening html tags, those that begin with '<x', where x is a letter 
    |(?P<html_close></.*?>) # closing html tags, those that begin with '</'
    |(?P<bracket>\[.*?\]) # everything that is put inside brackets
    |(?P<tex>\\\(.*?\\\)) # tex should be written inside \( ... \) 
    |(?P<tex_equation>\\begin\{(?P<tex_tag>.*?)\}(?P<tex_string>.*?)\\end\{(?=P<tex_tag>)\}) # detect \begin{smthing}...\end{smthing}
    |(?P<linebreak>\n|\r) # detect linebreaks
    |(?P<awordtoolong>\S{40}) # detect >40 char words,
    |(?P<endfile>\Z) # detect end of a string, so open tags can be closed 
    """, re.IGNORECASE | re.VERBOSE)

whitelist=re.compile(r"""
    <(
    p
    |a
    |br
    |b
    |i
    |embed
    |li
    |ul
    |ol
    |table
    |tr
    |th
    |td
    >|\>\ .*?)>
    """, re.IGNORECASE | re.VERBOSE)