Ticket #1509 (closed defect: fixed)

Opened 12 years ago

Last modified 12 years ago

Whitelist of allowed HTML tags doesn't work in Kupu

Reported by: hans Owned by: jukka
Priority: blocker Milestone:
Component: generic Version:
Keywords: Cc:
Time planned: Time remaining:
Time spent:

Description (last modified by hans) (diff)

I understand that we have a whitelist of allowed HTML tags in Kupu. According to UI the following tags are allowed: <h2>, <p>, <br />, <pre>, <ul>, <ol>, <li>, <a>, <i>, <b> (some allowed tags such as <sub>, <sup>, <table>, <td>, <tr> are missing from that list).

I don't understand how users are able to input their own JavaScript?, CSS and form elements... I don't want LeMill look like a MySpace? :)

http://lemill.net/content/fluxogramas/view

http://lemill.org/trac/attachment/ticket/1509/fluxogramas.png?format=raw

Attachments

fluxogramas.png (94.5 KB) - added by hans 12 years ago.

Change History

Changed 12 years ago by hans

comment:1 Changed 12 years ago by hans

  • Description modified (diff)

comment:2 Changed 12 years ago by tarmo

Then again, we need stuff like google videos, which need the embed tag etc.

comment:3 Changed 12 years ago by hans

We need a working whitelist of allowed tags. I am not able to find that part of code quickly, but I understood from Vahur that it is messed up and we have a short blacklist instead of whitelist.

Of course we should allow embedding videos from Youtube and Google Video, presentations from Slideshare, etc.

Can somebody post here the list of HTML elements and attributes that are actually allowed at the moment?

comment:4 Changed 12 years ago by pjotr

Kupu gets set up in ConfigurationMethods?.py starting from line 422.

At the moment we have this blacklist:

TAGS_BLACKLIST= ['center', 'span', 'tt', 'big', 'small', 'u', 's', 'strike', 'basefont', 'font', 'div', 'img', 'h1', 'em', 'strong']

INVALID_ATTRIBUTES = ['dir', 'lang', 'valign', 'halign', 'border', 'frame', 'rules', 'cellspacing', 'cellpadding', 'bgcolor', 'style']

And this line here:

'attributes':'dir,lang,valign,halign,border,frame,rules,cellspacing,cellpadding,bgcolor,width',

comment:5 Changed 12 years ago by vahur

hmm.. why is div or span blacklisted and not script, style... or b as strong is listed? anyways, I think blacklisting doesn't work efficiently because we can't go through all possible combinations user might think of to harm a site. From a point of javascript, what most browsers support, it might become an issue of user's privacy also. Can we go through all materials, regularly, in LeMill and make sure that there are no nasty popups? Again, case, we have here, shows clearly how blacklisting does work (read: doesn't work)... in case of embeding stuff from somewhere (google, youtube whatever) it is easy to make sure that content comes from that site(google, youtube, whatever)

comment:6 Changed 12 years ago by jukka

  • Owner changed from anonymous to jukka
  • Status changed from new to assigned

I'm working on adding tex-support (#1507) and there I need to make sure that the intended math formulas are not broken with <br/>-tags or other htmlifications. Combine this blacklist-problem, br-tags breaking tex-formulas and our earlier problems with long lines and I'd say we better refactor our html-parsing. Let's have a wiki page to collect all adjustments that are needed when displaying longer pieces of text: HtmlParsing

comment:7 Changed 12 years ago by jukka

I'm thinking.. should we have two variables for every larger textfields (descriptions, bodytexts, life stories, comments). One contains the editable 'raw' text. This is always as you've pasted it. We won't remove any javascripts, anything. It is pure source. The other is cleaned up text that has gone through text_parse and can be displayed without any additional filtering. Every time we edit that raw text, we generate new cleaned text.

What we did previously and what kupu is still doing is to try to fix some things by rewriting that raw text, removing tags and adding tags. Then, in display stage it runs through filtering that doesn't do any lasting changes to the actual text, just for a string that is going to be displayed.

Currently I've removed all of those search-and-replace-in-raw-text -thingies (though kupu still does that, I'm trying to stop it). All of these filterings should be done only when displaying stuff. This isn't any more expensive as it was before, as current filtering is much more efficient. Still it is expensive, and the results should be always the same <i>unless</i> someone has edited the page after last visit. Because of this we should store cleaned text in variable and just get it instead of creating it for every view.

The problem is that this would be quite big, site-wide change and there are few buts. One is getting a tex-image that isn't there anymore. Only the parse_text -phase can actually check if it is there. Well, this is minor thing. From efficiency pow I think the savings would be huge.

comment:8 Changed 12 years ago by jukka

This should work now. Everything that gets saved and displayed has gone through parse_text, which filters all except few tags and attributes.

comment:9 Changed 12 years ago by jukka

  • Status changed from assigned to closed
  • Resolution set to fixed
Note: See TracTickets for help on using tickets.