Ticket #1853 (new story)

Opened 11 years ago

Last modified 11 years ago

A question for all developers: Should we store all binary data as external files?

Reported by: jukka Owned by: anonymous
Priority: blocker Milestone:
Component: generic Version:
Keywords: Cc:
Time planned: Time remaining:
Time spent:

Description

Working with Georgian LeMill, I've noticed that over 20000 pieces is a really bad thing. Externalising those slides that are used only once will get rid of about 1/4:th of them, but still it will grow to be unmaintainable soon. When binary data is saved with object, the object will be as large as that data, and when images have different sizes, the object will be as big as all of those combined. And when object is changed, object is rewritten.

One solution would be for all pieces to store their binary data in external file folders with UID that corresponds to pieces UID. Then they could have there the original uploaded content and for images, different sized versions there. Our DB would be lot smaller in size, but there would be a huge external folder to backup.

I don't know if this would affect speed in actual use, but upgrading would be a lot easier as archetype update for any object that has binary content requires reading, resizing and rewriting all that.

I think this would be one major step for making LeMill more scalable, otherwise we'll be soon dealing with 20+ gigabyte Data.fs:s, and with files that size, nothing is easy anymore.

I'm asking if anyone has a good reason for why this wouldn't help, or what problems should we expect with 10000+ subfolder filesystem folder (should we split it to subdirectories?) I'd also like to know if developers have time at november and december to help me to do this or are there more important features?

Change History

comment:1 Changed 11 years ago by tarmo

Here's comments from theuni:

Large files in the ZODB using strings or the existing PData-based
approaches (like OFS.File) are really bad.

If you ever migrate to a current version (2.11) you can start using ZODB
blobs which are files that can live in the ZODB but don't have the
negative side effects.

comment:2 Changed 11 years ago by pjotr

I'm not a expert and can not predict the outcome, but it might be a good idea to save all these things in local FS (as I don't really have any experience with that - I don't know how fast would that be; or maybe using some other database that is more suitable for such purpose would be even better).

We have some promises that we have to keep according to a contract with Tigerleap Foundation, but I think that anything that would make LeMill as a whole work better and quicker is a good solution, I'd try to help as much as I can.

As for the comment by theuni: Well I guess it would be quite a nice solution, 
but there we are stuck with plone 2.5 atm. I guess that even moving on to 3.X 
would be a problem (considering the amount of time and resources we have). !!! 
Just an example, as we are sure not moving to new Plone versions, but it would be 
more sane to just dump Plone as whole and move on to some better hunting grounds!!! 
So going for the latest and greatest might be a bit problematic.

comment:3 Changed 11 years ago by hans

I also think that it is important to reduce the size of Data.fs and keeping binary files in the file system will help that.

comment:4 Changed 11 years ago by vahur

Why not... how does a filesystem handle 20k+ files/folders in one folder? I feel like breaking them up by sections would be a smart thing to do.

If LeMill is hosted behind apache or something that can serve files then these images could be served directly (no request to zope!).

+1

comment:5 Changed 11 years ago by jukka

I've found that archetypes have a nice abstraction layer where for each field it can be determined where the value is stored. These storages must implement only few methods: set, get and unset. The basic storage is following:

In Archetypes/Storage?/init.py

class AttributeStorage(Storage):
    """Stores data as an attribute of the instance. This is the most
    commonly used storage"""

    __implements__ = IStorage

    security = ClassSecurityInfo()

    security.declarePrivate('get')
    def get(self, name, instance, **kwargs):
        if not shasattr(instance, name):
            raise AttributeError(name)
        return getattr(instance, name)

    security.declarePrivate('set')
    def set(self, name, instance, value, **kwargs):
        # Remove acquisition wrappers
        value = aq_base(value)
        setattr(aq_base(instance), name, value)
        instance._p_changed = 1

    security.declarePrivate('unset')
    def unset(self, name, instance, **kwargs):
        try:
            delattr(aq_base(instance), name)
        except AttributeError:
            pass
        instance._p_changed = 1

We can easily create LocalStorage? (or ExternalStorage?, we have to choose which point of view to use to it!), which uses the instance given as a hint where to find the file in filesystem and gets/sets files there. Probably alternative sizes for images need some fixing, also calculating audio lengths and others that take Files as objects and try to do tricks with them. But in effect, this could make filesystem saving quite clean and simple process.

Note: See TracTickets for help on using tickets.