Last modified 12 years ago
Last modified on 09/04/07 08:34:08
When displaying a large body of text, which may or may not have html-tags, following things must be done:
- Allow all tags inserted by Kupu to be there
- parse tex-code: \( ..code.. \) or \begin{equation} ... \end{equation} with tex-parser. Replace with image-tags.
- links should be shortened as mentioned in comment by Hans for ticket #1352
- parse bracket-links: [objid linkname], replace with a-href:s.
- If the text is not inside any other tags, tex-code or bracket-links, convert linebreaks to br:s.
- if there are words longer than 50chars, chop them, unless they are inside "":s or ' ':s.
- Allow following tags: whitelist=??,??, remove everything else.
- Make our own attributes filtering: ???
What else? Please fill in whitelist.
Currently I'm doing this with one run that takes all 'interesting' parts out and deals them to functions according to type. (?P<type>regex) is a python-specific notation that gives dictionary keys to match-objects.
Patterns and whitelist for allowed tags is below. Feel free to add to whitelist.
Hans: I don't want to mess up the code, therefore I will add my comments here:
- If we want to embed videos from Youtube we should also allow <object> and <param> elements. These two elements are also required for Slideshare, Schooltube and Internet Archive.
- If we want to embed podcasts from Ourmedia we should allow some JavaScript?. Exactly this Javascript should be enough: <script language="JavaScript?" src="http://ourmedia.org/players/1pixelout/audio-player.js"></script>
- If we want to embed maps from Google Maps we should allow <iframe> element.
- You can see some examles of embedding here, add additional environments that may be valuable for us: http://lemill.net/content/embedding-external-content-to-lemill
ppattern=re.compile(r""" (?P<html_open><[a-z].*?>) # opening html tags, those that begin with '<x', where x is a letter |(?P<html_close></.*?>) # closing html tags, those that begin with '</' |(?P<url>(?<!"|')http://\S*?) # http://something, where http is not preceded with " or ' |(?P<bracket>\[.*?\]) # everything that is put inside brackets |(?P<tex>\\\(.*?\\\)) # tex should be written inside \( ... \) |(?P<tex_equation>\\begin\{(?P<tex_tag>.*?)\}(?P<tex_string>.*?)\\end\{(?=P<tex_tag>)\}) # detect \begin{smthing}...\end{smthing} |(?P<linebreak>(?<!>)\ *?\n|\r) # detect linebreaks, unless they're after closed tag, f.ex !'<br/> \n' |(?P<awordtoolong>[^ \t\n\r\f\v<>]{41}) # detect >40 char words, |(?P<endfile>\Z) # detect end of a string, so open tags can be closed """, re.IGNORECASE | re.VERBOSE) # whitelist is for html-tags only whitelist=re.compile(r""" <( p |a |br |b |i |h3 |img |embed |li |ul |ol |table |tr |th |td >|\>\ .*?)> """, re.IGNORECASE | re.VERBOSE)