Changes between Version 6 and Version 7 of HtmlParsing


Ignore:
Timestamp:
09/03/07 11:43:21 (12 years ago)
Author:
jukka
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • HtmlParsing

    v6 v7  
    1818{{{ 
    1919 
    20 pattern=re.compile(r""" 
     20ppattern=re.compile(r""" 
    2121    (?P<html_open><[a-z].*?>) # opening html tags, those that begin with '<x', where x is a letter  
    2222    |(?P<html_close></.*?>) # closing html tags, those that begin with '</' 
     23    |(?P<url>(?<!"|')http://\S*?) # http://something, where http is not preceded with " or ' 
    2324    |(?P<bracket>\[.*?\]) # everything that is put inside brackets 
    2425    |(?P<tex>\\\(.*?\\\)) # tex should be written inside \( ... \)  
    2526    |(?P<tex_equation>\\begin\{(?P<tex_tag>.*?)\}(?P<tex_string>.*?)\\end\{(?=P<tex_tag>)\}) # detect \begin{smthing}...\end{smthing} 
    26     |(?P<linebreak>\n|\r) # detect linebreaks 
    27     |(?P<awordtoolong>\S{40}) # detect >40 char words, 
     27    |(?P<linebreak>(?<!>)\ *?\n|\r) # detect linebreaks, unless they're after closed tag, f.ex !'<br/>  \n'    
     28    |(?P<awordtoolong>[^ \t\n\r\f\v<>]{41}) # detect >40 char words, 
    2829    |(?P<endfile>\Z) # detect end of a string, so open tags can be closed  
    2930    """, re.IGNORECASE | re.VERBOSE) 
    3031 
     32# whitelist is for html-tags only 
    3133whitelist=re.compile(r""" 
    3234    <( 
     
    3638    |b 
    3739    |i 
     40    |h3 
     41    |img 
    3842    |embed 
    3943    |li