Changes between Version 4 and Version 5 of HtmlParsing


Ignore:
Timestamp:
08/30/07 16:40:50 (12 years ago)
Author:
jukka
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • HtmlParsing

    v4 v5  
    1111 
    1212What else? Please fill in whitelist.  
    13 It would be best if we could do this with as few run-throughs as possible. Maybe one run that looks all of the start-tags and end-tags, tex and bracket-links included and then deals with them recursively. 
     13 
     14Currently I'm doing this with one run that takes all 'interesting' parts out and deals them to functions according to type. (?P<type>regex) is a python-specific notation that gives dictionary keys to match-objects.  
     15 
     16Patterns and whitelist for allowed tags is below. Feel free to add to whitelist. 
     17 
     18{{{ 
     19 
     20pattern=re.compile(r""" 
     21    (?P<html_open><[a-z].*?>) # opening html tags, those that begin with '<x', where x is a letter  
     22    |(?P<html_close></.*?>) # closing html tags, those that begin with '</' 
     23    |(?P<bracket>\[.*?\]) # everything that is put inside brackets 
     24    |(?P<tex>\\\(.*?\\\)) # tex should be written inside \( ... \)  
     25    |(?P<tex_equation>\\begin\{(?P<tex_tag>.*?)\}(?P<tex_string>.*?)\\end\{(?=P<tex_tag>)\}) # detect \begin{smthing}...\end{smthing} 
     26    |(?P<linebreak>\n|\r) # detect linebreaks 
     27    |(?P<awordtoolong>\S{40}) # detect >40 char words, 
     28    |(?P<endfile>\Z) # detect end of a string, so open tags can be closed  
     29    """, re.IGNORECASE | re.VERBOSE) 
     30 
     31whitelist=re.compile(r""" 
     32    <(/{0,1} 
     33    p 
     34    |a 
     35    |br 
     36    |b 
     37    |i 
     38    |embed 
     39    |li 
     40    |ul 
     41    |ol 
     42    |table 
     43    |tr 
     44    |th 
     45    |td 
     46    \ .*?)> 
     47    """, re.IGNORECASE | re.VERBOSE) 
     48 
     49 
     50}}}