Memory, Man

We’ve had ‘custom memory allocators’ on our upcoming features list for a while. Last year a student did some work on this during the Summer of Code, but the system ended up being a little too ambitious with its use of templates and got a bit too costly in terms of the template instantiation requirements. Unfortunately the student never returned, so I picked up the baton recently, and I felt it was worth writing about some of the things I’ve done, since most of the C++ allocator discussions on the net are pretty shallow and only deal with the simple cases. This is quite a big and fairly technical discussion, so I’ve placed the bulk of it after the jump to spare those who don’t care about this sort of thing! :)

Why Custom Allocators?

Two words: performance and flexibility. The default C/C++ runtime allocation routines are very general purpose and not really designed for high performance applications, and can suffer from fragmentation particularly when dealing with lots of small allocations. In addition, for some applications people want to be able to allocate memory from particular pools based on the type of allocation being made – think embedded systems and consoles for example which can have non-homogenous memory architectures.

The Feature List

I had a list of bullet points I wanted to tick off as part of this work, and I’m glad to say I think I’ve covered them all:

  • Completely replaceable memory allocators for all heap allocations, for all classes, structures, primitive types and containers
  • Categorised memory usage for all allocations; category can influence the allocator used
  • Allocators compiled statically for performance, without requiring a user to change anything other than a single header file, but without dumbing things down to a single global type of allocator
  • Tracking of memory leaks, with file & line number information
  • At least one example of using a custom allocator instead of the default malloc/free
  • Support for aligned memory allocation; both platform-specific SIMD alignment and user defined alignments

The simple approach

If you search the net for information about C++ custom allocators, you’ll run into generally 3 examples:

  1. Implementing custom new/delete operators in a class
  2. Overriding the global new/delete functions, possibly with macros to add debug functionality
  3. Implementing allocator classes to replace std::allocator in STL containers

Now, don’t get me wrong, there’s nothing wrong with these techniques; indeed I’ve used 1 and 3 in my implementation (2 is evil, for reasons I’ll discuss later). However, alone the base implementation doesn’t scale well or allow the level of flexibility I was looking for.

My approach

  1. Allocation policy: this is a strategy for allocating memory; this is the basic fundamental building block of memory management. Implements (potentially aligned) allocation and deallocation. Any class at all can be an allocation policy so long as it implements the allocBytes and deallocBytes methods.
  2. AllocatedObject<AllocPolicy>: a base class for all classes with a virtual function table which need to use custom allocators. Defines custom new/delete operators which route the allocation via the templated AllocPolicy
  3. STLAllocator<T, AllocPolicy>: an STL-compatible allocator object which delegates requests for memory for STL containers of type T to the configured AllocPolicy class.
  4. Alloc/dealloc macros: in order to add file & line number information in debug mode to allocations to assist tracking of leaks, macros are normally used (since functions and classes would hide this information). Some people redefine ‘new’ and ‘delete’, and indeed that’s what our old leak tracker did (adapted from Paul Nettle’s work), but in practice this is a bad idea – it works if you only ever have to deal with your own codebase, but in a heterogenous environment with many libraries, it becomes hopelessly entangled far too quickly – either other libraries will try to do the same and clash, or there will be a mixed use of macro’d or unmacro’d new/deletes because of header inclusion order, etc. Instead, we use macros such as OGRE_NEW, OGRE_DELETE, OGRE_MALLOC etc. The downside is that we have to update our code to use them instead of it being automatic, but it’s far outweighed by predictability and it also resolves some knotty problems with mixed types too.

No huge surprises there I hope. The nice thing about this is that you can quite easily implement a single class which implements allocBytes and deallocBytes and it can immediately serve most memory management contexts. However, we’re only just getting started. :)

Category Support

All allocations are done by specifying a memory category. At the base level, you’re expected to define classes that resolve the following policies, which are referred to in the OGRE code:

template <MemoryCategory C> class CategorisedAllocPolicy;
template <MemoryCategory C, size_t align = 0> class CategorisedAlignAllocPolicy;

MemoryCategory is just an enum listing various usages, such as general purpose, geometry, animation, etc. CategorisedAllocPolicy is used to allocate unaligned memory in a given category, and CategorisedAlignAllocPolicy is used to allocate aligned memory – the default parameter of ’0′ for alignment actually means ‘use the default SIMD alignment for this platform’. The simplest possible example that we provide is this:

template <MemoryCategory Cat> class CategorisedAllocPolicy : public StdAllocPolicy{};
template <MemoryCategory Cat, size_t align = 0> class CategorisedAlignAllocPolicy : public StdAlignedAllocPolicy<align>{};

This ignores the category and generates an implementation for all categories using the allocators StdAllocPolicy and StdAlignedAllocPolicy – both use the default system allocators. You can use full or partial template specialisation to provide different allocation policy implementations for different categories of memory, of course, something like this:

template <> class CategorisedAllocPolicy<MEMCATEGORY_SCENE_OBJECTS> : public YourSceneObjectAllocPolicy{};
template <size_t align> class CategorisedAlignAllocPolicy<MEMCATEGORY_SCENE_OBJECTS, align> : public YourSceneObjectAllocPolicy<align>{};

I’ll be providing one or two other custom allocation policies as alternative examples.

The challenge of dealing with primitive types and non-virtual or external classes

Ok, so the cases that most of the examples around the net tend to deal with are defining custom new/delete operators in your classes, and implementing a std::allocator compatible allocator for STL containers. We’ve done that, I won’t repeat the details here. But that’s only half the story.

A typical application has to allocate far more than that. It needs to allocate primitive types, such as ints and bools, potentially arrays of them. It also needs to deal with classes which have been made deliberately non-virtual, so are not appropriate to be conveniently derived from a class which provides standard allocator-aware new/delete operators (like our AllocatedObject). A good example of this in the Ogre source is Vector3 – it’s a class, but it’s deliberately non-virtual, because a virtual fuunction table adds a pointer overhead to every instance, and while small, when you’re dealing with classes that are essentially small value types, that one pointer could increase the size of each instance by a great deal – with Vector3 it would be a 33% increase for example. When you consider arrays of these types, this can soon add up and start adding extra cache misses you don’t need.

Sure, you could individually define custom new/delete operators for these small classes to avoid having to inherit the common implementation (and please, don’t suggest inheriting without a virtual destructor to get the implementations but with no virtual function table, that way leads to undefined behaviour according to the C++ standard), but not only is this code duplication, it won’t work for real primitive types, and also not for classes from external libraries. A more general purpose approach is needed.

At this point most people resort to global new/delete operators – sometimes wrapped in macros (hopefully not called new/delete of course). This works to a degree, again until you start using multiple libraries. As soon as more than one library / app has tried to globally override new/delete, all bets are off. Cue all sorts of linker clashes between the two, and that’s even if you managed to disentangle references to new/delete in header files to avoid cross-pollination of definitions, particularly if you used custom parameters to the new/delete operators as you generally need to do if you want leak tracking or variable allocators. Basically, it’s a nightmare and given that Ogre is supposed to be a good citizen in a multi-library setup, it just wasn’t a practical option.

So, I chose to do everything via safely-named macros. As well as the regular OGRE_NEW and OGRE_DELETE which provide debug functionality for full classes but otherwise just delegate back to new/delete, I also defined OGRE_NEW_T / OGRE_DELETE_T to handle allocator-aware memory management (and debugging) for primitives and classes we can’t or don’t want to add new/delete operators to. They require slightly more verbose code, but they’re unintrusive:

Vector3* v = OGRE_NEW_T(Vector3, MEMCATEGORY_GEOMETRY)(Vector3::UNIT_Y);
OGRE_DELETE_T(v, Vector3, MEMCATEGORY_GEOMETRY);

Notice how I can use constructor args as usual; this would not be the case if I was using a malloc analog. What’s happening inside the macro is that it’s calling placement new ie memory is allocated via a custom allocator and then the object constructed inside it. I have to tell it the category, as opposed to that automatically being known by AllocatedObject-derived classes allocated via OGRE_NEW. I also have to tell it the same information when I destroy it with OGRE_DELETE_T – the category so that the same allocator can be used to free the memory (if there were different allocators for different categories, this is important), and the type so that the destructor can be called – because objects allocated with placement new have to have their destructor called manually. Note there’s a small hack here – if you know for sure that there’s no destructor, you can use OGRE_FREE instead which doesn’t call the destructor and doesn’t require the type again – this might be a good optimisation particularly for arrays (OGRE_NEW_ARRAY_T/OGRE_DELETE_ARRAY_T).

For primitive types with no required constructor or destructor, there is OGRE_ALLOC_T and OGRE_FREE:

long* pLong = OGRE_ALLOC_T(long, 10, MEMCATEGORY_GENERAL);
OGRE_FREE(pLong, MEMCATEGORY_GENERAL);

As mentioned above if you have a single primitive type you need to use a constructor on, you can actually mix OGRE_NEW_T and OGRE_FREE because you want the constructor, but there is no destructor called ‘long’:

long* pLong = OGRE_NEW_T(long, MEMCATEGORY_GENERAL)(0);
OGRE_FREE(pLong
, MEMCATEGORY_GENERAL);

It might seem complex to have these different ways of allocating memory, but it’s entirely necessary to provide the full breadth of flexibility to be able to handle all types of categorised allocation without clash-inducing global new/delete operators. It’s important to note every one of these types of allocations functions identically in terms of memory leak tracing, debug line number information, and use of the allocator policies. That’s just not something you can achieve with just class-level new/delete operators and STL allocators without globally redefining new/delete (with all the inherent problems that causes).

SharedPtr throws a spanner in the works

This is all very well, but now that there’s several ways to allocate memory, via regular new, placement new or just via a raw byte allocation, if we then give this memory to a SharedPtr, it needs to know how to free it afterwards. SharedPtr now has an optional enum param to tell it whether to use OGRE_DELETE (regular delete operator), OGRE_DELETE_T (manually call destructor and free), or OGRE_FREE (just free memory) – it defaults to the first one for backwards compatibility.

The one limitation here is that memory allocated any other way than OGRE_NEW has to have been allocated via the MEMCATEGORY_GENERAL category. The simple reason for this is that alloc policies are bound statically, for speed – if we had to look up an allocation policy dynamically every time an allocation / deallocation was done it would slow things down. Thus, for classes which don’t have their own alloc policy bound (because they have their own new/delete operators via AllocatedObject), a single general one has to be compiled in to SharedPtr’s single destruction routine. Code which manually deallocates can use whatever it likes of course, since that code will statically resolve the allocator independently.

Conclusion

Phew – this has been quite in-depth I know but that’s the nature of it. I’ve spent a long time on these allocation routines, way longer than I expected, because I’ve been trying out loads of different approaches and figuring out what I thought was the best tradeoff of flexibility, speed and elegance. I’m sure it’s still not perfect, but I think it’s a lot more practical than the glib versions that tend to pop up when you look into custom allocators – the sorts that work perfectly fine if you have simple needs, you use all virtual classes, and you only have to worry about a single application source, but fall apart as soon as you have to worry about dealing with a wider variety of types, playing nicely with many different libraries, and need a way to allow people to customise allocation for different categories of memory use easily.

I still have quite a lot of code to go through, changing the explicit new/deletes to the correct macros and assigning categories where they’re missing, but I’m getting there!

  • http://research.edm.uhasselt.be/lvanacken/blog Panter

    Very interesting post! Thanks for taking the time to share this.

    I am wondering by the way, what amount of speed gain do you expect? Or right not it is still restricted pure to flexibility?

  • http://www.stevestreeting.com Steve

    Mostly it’s about flexibility but for applications that do a lot of construction / destruction it should lead to better performance without needing to use pools of objects of a specific type. Also the fragmentation associated with memory can be lower. I haven’t got any specific benchmarks yet, but as an example look at the figures over at the nedmalloc page, this is one allocator option I expect to include in the final version.

  • syedhs

    Hey Steve, I really have to say this. You are a programmer with many hats with each of them done in great depth and excellence. I have to envy you on this, and I really do!

    Meanwhile, I will definitely make use of this posting as Custom Allocators implementation are pretty weirdsome to me – I know the concept but at the moment I just cant make myself understand heavily templatized codes. Perhaps I am just scared of something I don’t know (yet, hopefully) ;)

  • zeroskill

    Hurray for dropping global new/delete overriding! I remember running into this exact issue several times when working on libraries intended to interface with Ogre. There was no good fix for making it work outside of just turning that feature off. I do like the workaround you took of simply creating new alloc/dealloc macros to be used instead (OGRE_NEW/OGRE_DELETE). Another fine way is to expose the create/destroy within static class methods. If you wanted a new Foo, you would call Foo::new(), and to free it simply Foo::free(). This allows each class to implement it’s own new/delete within the class header, making it very easy to ensure that all Foo allocations are always performed using the same model, and changing that only requires changing a few lines in a header to make sweeping code changes.

    The initial implementation would look something like:
    static inline Foo Foo::new(){
    return OGRE_NEW Foo();
    }
    //Would need an allocator function for each constructor signature

    It’s a bit of work up front, but class specific memory pools could be a neat concept to play with.

    On a side note, I hope that function definition is right. I honestly haven’t touched C++ in over a year now. Been doing a lot of Java work lately.

  • http://www.stevestreeting.com Steve

    For info you can already do class-specific memory pools for those classes that are fully virtual (and thus have their own new/delete). I’ve grouped them into the same category-specific pools that are applied for primitive types too but if you wanted you have the option of making it class-specific just by changing the typedefs in the single memory config header.

    The major problem with what you’re suggesting is that it makes leak tracking impossible, since every single allocation would be in Foo::new. By calling OGRE_NEW directly you get the opportunity to add file & line number information from the source of the allocation.

  • zeroskill

    True, as soon as you make a function call you’re no longer in the originating code, and the macro definitions for your file and current line number are no longer accurate. Personally, I never used the built in leak detection for anything useful. Aside from the whole old memory manager causing issues with the project I was working on, even during the sections where it would work I found that external profilers did as good of a job, and the good ones didn’t require a recompile. Nettle did have some good examples of how his memory manager saved him from hours of debugging, though in the grand scheme I wonder how much time that would really save across a whole project. You either fight the memory manager constantly for the duration of the project, or spend an extra few hours here and there debugging specific issues.

    I can appreciate that not everyone can afford GlowCode, or LTProf (or any of the other far more expensive tools), and some systems don’t necessarily have an abundance of development tools available to them. But, I never liked the idea of a library performing extra duties outside of the original scope. If it’s a sound library, I’d like it to handle sound functionality and that’s all.

    But, this is starting to get outside the original conversation scope. I’m glad that the memory manager is getting an overhaul. It’s definitely needed one for a while. As always, keep up the great work!

  • http://www.stevestreeting.com Steve

    Well, this memory leak checker is entirely for our own use – external apps won’t use it unless they specifically use our macros, so we’re not going outside our mandate, we’re just checking our own consistency. Personally I’ve found external tools like Valgrind and Rational Purify to be highly misleading at times, reporting leaks that I’ve proven don’t physically exist – not sure why, maybe it’s because they don’t know the full context or maybe it’s because I didn’t configure them right. Nevertheless I’ve always found my own internal leak detectors to be more reliable, which is why I’ve never forked out for any of the expensive ones. It’s an approach that lots of other projects use too (wxWidgets, Qt, DirectX). The other advantage of course is that you don’t have to periodically remember to run some external tool, a leak will show up just in normal debug tests.