Changeset 2037


Ignore:
Timestamp:
09/19/07 02:04:47 (12 years ago)
Author:
jukka
Message:

Fixed something with html-tag parsing that caused errors in cataloguing.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • trunk/LeMillTool.py

    r2030 r2037  
    5353 
    5454pattern=re.compile(r""" 
    55     (?P<html_open>(<|&lt;)(?P<html_tag>[a-z!][^\s>]*).*?>|(&gt)) # opening html tags, those that begin with '<x', where x is a letter  
     55    (?P<html_open><(?P<html_tag>[a-z!][^\s>]*).*?>) # opening html tags, those that begin with '<x', where x is a letter  
    5656    |(?P<html_close></.*?>) # closing html tags, those that begin with '</' 
    5757    |(?P<url>(?<!"|')http://\S*) # http://something, where http is not preceded with " or ' 
     
    155155            full_tag=match.group('html_open') 
    156156            tag=match.group('html_tag') 
    157             tag_match=re.match(whitelist,tag) 
    158             if tag_match: 
     157            if tag: 
     158                tag_match=re.match(whitelist,tag) 
     159            else: 
     160                print full_tag 
     161            if tag and tag_match: 
    159162                tag=tag_match.group() 
    160163                if len(tag)+5 < len(full_tag): # if tag is very short it can't have attributes so don't bother searching 
     
    166169                #print 'accepted:%s###' % tag 
    167170                return full_tag 
    168             tag_match=re.match(restricted, tag)             
    169             if tag_match: 
     171            if tag:     
     172                tag_match=re.match(restricted, tag)             
     173            if tag and tag_match: 
    170174                if self.isGoodEmbed(full_tag): 
    171175                    if not full_tag.endswith('/>'): # also deals with self-closing tags like <br/> 
Note: See TracChangeset for help on using the changeset viewer.