wiki:HtmlParsing

Version 7 (modified by jukka, 12 years ago) (diff)

--

When displaying a large body of text, which may or may not have html-tags, following things must be done:

  • Allow all tags inserted by Kupu to be there
  • parse tex-code: \( ..code.. \) or \begin{equation} ... \end{equation} with tex-parser. Replace with image-tags.
  • links should be shortened as mentioned in comment by Hans for ticket #1352
  • parse bracket-links: [objid linkname], replace with a-href:s.
  • If the text is not inside any other tags, tex-code or bracket-links, convert linebreaks to br:s.
  • if there are words longer than 50chars, chop them, unless they are inside "":s or ' ':s.
  • Allow following tags: whitelist=??,??, remove everything else.
  • Make our own attributes filtering: ???

What else? Please fill in whitelist.

Currently I'm doing this with one run that takes all 'interesting' parts out and deals them to functions according to type. (?P<type>regex) is a python-specific notation that gives dictionary keys to match-objects.

Patterns and whitelist for allowed tags is below. Feel free to add to whitelist.

ppattern=re.compile(r"""
    (?P<html_open><[a-z].*?>) # opening html tags, those that begin with '<x', where x is a letter 
    |(?P<html_close></.*?>) # closing html tags, those that begin with '</'
    |(?P<url>(?<!"|')http://\S*?) # http://something, where http is not preceded with " or '
    |(?P<bracket>\[.*?\]) # everything that is put inside brackets
    |(?P<tex>\\\(.*?\\\)) # tex should be written inside \( ... \) 
    |(?P<tex_equation>\\begin\{(?P<tex_tag>.*?)\}(?P<tex_string>.*?)\\end\{(?=P<tex_tag>)\}) # detect \begin{smthing}...\end{smthing}
    |(?P<linebreak>(?<!>)\ *?\n|\r) # detect linebreaks, unless they're after closed tag, f.ex !'<br/>  \n'   
    |(?P<awordtoolong>[^ \t\n\r\f\v<>]{41}) # detect >40 char words,
    |(?P<endfile>\Z) # detect end of a string, so open tags can be closed 
    """, re.IGNORECASE | re.VERBOSE)

# whitelist is for html-tags only
whitelist=re.compile(r"""
    <(
    p
    |a
    |br
    |b
    |i
    |h3
    |img
    |embed
    |li
    |ul
    |ol
    |table
    |tr
    |th
    |td
    >|\>\ .*?)>
    """, re.IGNORECASE | re.VERBOSE)