When displaying a large body of text, which may or may not have html-tags, following things must be done:
- Allow all tags inserted by Kupu to be there
- parse tex-code: \( ..code.. \) or \begin{equation} ... \end{equation} with tex-parser. Replace with image-tags.
- links should be shortened as mentioned in comment by Hans for ticket #1352
- parse bracket-links: [objid linkname], replace with a-href:s.
- If the text is not inside any other tags, tex-code or bracket-links, convert linebreaks to br:s.
- if there are words longer than 50chars, chop them, unless they are inside "":s or ' ':s.
- Allow following tags: whitelist=[??,??], remove everything else.
- Make our own attributes filtering: ???
What else? Please fill in whitelist.
Currently I'm doing this with one run that takes all 'interesting' parts out and deals them to functions according to type. (?P<type>regex) is a python-specific notation that gives dictionary keys to match-objects.
Patterns and whitelist for allowed tags is below. Feel free to add to whitelist.
Hans: I don't want to mess up the code, therefore I will add my comments here:
- If we want to embed videos from Youtube we should also allow <object> and <param> elements. These two elements are also required for Slideshare, Schooltube and Internet Archive.
- If we want to embed podcasts from Ourmedia we should allow some JavaScript?. Exactly this Javascript should be enough: <script language="JavaScript?" src="http://ourmedia.org/players/1pixelout/audio-player.js"></script>
- If we want to embed maps from Google Maps we should allow <iframe> element.
- You can see some examles of embedding here, add additional environments that may be valuable for us: http://lemill.net/content/embedding-external-content-to-lemill
ppattern=re.compile(r"""
(?P<html_open><[a-z].*?>) # opening html tags, those that begin with '<x', where x is a letter
|(?P<html_close></.*?>) # closing html tags, those that begin with '</'
|(?P<url>(?<!"|')http://\S*?) # http://something, where http is not preceded with " or '
|(?P<bracket>\[.*?\]) # everything that is put inside brackets
|(?P<tex>\\\(.*?\\\)) # tex should be written inside \( ... \)
|(?P<tex_equation>\\begin\{(?P<tex_tag>.*?)\}(?P<tex_string>.*?)\\end\{(?=P<tex_tag>)\}) # detect \begin{smthing}...\end{smthing}
|(?P<linebreak>(?<!>)\ *?\n|\r) # detect linebreaks, unless they're after closed tag, f.ex !'<br/> \n'
|(?P<awordtoolong>[^ \t\n\r\f\v<>]{41}) # detect >40 char words,
|(?P<endfile>\Z) # detect end of a string, so open tags can be closed
""", re.IGNORECASE | re.VERBOSE)
# whitelist is for html-tags only
whitelist=re.compile(r"""
<(
p
|a
|br
|b
|i
|h3
|img
|embed
|li
|ul
|ol
|table
|tr
|th
|td
>|\>\ .*?)>
""", re.IGNORECASE | re.VERBOSE)
