Version 5 (modified by jukka, 12 years ago) (diff)

--

When displaying a large body of text, which may or may not have html-tags, following things must be done:

• Allow all tags inserted by Kupu to be there
• parse tex-code: $$..code..$$ or $$...$$ with tex-parser. Replace with image-tags.
• links should be shortened as mentioned in comment by Hans for ticket #1352
• If the text is not inside any other tags, tex-code or bracket-links, convert linebreaks to br:s.
• if there are words longer than 50chars, chop them, unless they are inside "":s or ' ':s.
• Allow following tags: whitelist=??,??, remove everything else.
• Make our own attributes filtering: ???

What else? Please fill in whitelist.

Currently I'm doing this with one run that takes all 'interesting' parts out and deals them to functions according to type. (?P<type>regex) is a python-specific notation that gives dictionary keys to match-objects.

Patterns and whitelist for allowed tags is below. Feel free to add to whitelist.

pattern=re.compile(r"""
(?P<html_open><[a-z].*?>) # opening html tags, those that begin with '<x', where x is a letter
|(?P<html_close></.*?>) # closing html tags, those that begin with '</'
|(?P<bracket>$.*?$) # everything that is put inside brackets
|(?P<tex>\\$$.*?\\$$) # tex should be written inside $$...$$
|(?P<tex_equation>\\begin\{(?P<tex_tag>.*?)\}(?P<tex_string>.*?)\\end\{(?=P<tex_tag>)\}) # detect \begin{smthing}...\end{smthing}
|(?P<linebreak>\n|\r) # detect linebreaks
|(?P<awordtoolong>\S{40}) # detect >40 char words,
|(?P<endfile>\Z) # detect end of a string, so open tags can be closed
""", re.IGNORECASE | re.VERBOSE)

whitelist=re.compile(r"""
<(/{0,1}
p
|a
|br
|b
|i
|embed
|li
|ul
|ol
|table
|tr
|th
|td
\ .*?)>
""", re.IGNORECASE | re.VERBOSE)