Critical section performance costs

· by Steve · Read in about 3 min · (520 Words)

I’ve been doing some profiling on critical sections, to see just how much time they do take to acquire and release when there is no contention. On my 3GHz P4, over 200 million successful locks, the average time was 0.000073055 milliseconds, or about 73 nanoseconds. That sounds pretty quick, but in fact it’s a fair amount of time given that a single CPU cycle lasts about a third of a nanosecond - a little over 200 cycles then. It’s also important to realise that most of the software libraries that you use that are thread safe also incur this overhead - so for example whenever you access a C++ stream object when using a multithreaded runtime (the only ones available for VC8), you will be incurring a critical section cost. This is precisely why I don’t want to make every call in Ogre threadsafe, because most people don’t need it - background loading is just one common subset. To put it in perspective, if you allocated 1% of a 60Hz refresh time, you could acquire and release 2283 uncontended locks - quite a few, but the more methods you apply this to, the worse it gets, so some considered restraint is in order.

I’ve been thinking of various options - one is to not use threads at all, but just to use non-blocking I/O calls and a temporary stream cache area. This is what platforms with no hardware threading and no OS to pre-empt (e.g. PS2) use to stream data off the CD in the background. It’s a nice and simple approach and pretty portable, but it can’t necessarily support all source archive formats, and non-blocking file I/O was never standardised in C++ (a surprise to me, since I’m used to it being available in Java) so a pure C solution is probably the only option. A bit messy.

Another is the ‘Falagard option’ (he’s lending his name to a lot of techniques these days ;)) - ie just have one additional thread which loads data into a memory cache ‘staging’ area on a per-resource basis, and regular resource loading carries on in the main thread using this cached data. If I used this I would modify it a bit so that a partially loaded resource is handled safely in a ‘come back later’ kind of way instead of forcing immediate loading. This has the advantage that locking behaviour is much more constrained than the next option, and is fairly simple and universally applicable, but doesn’t mitigate all the loading cost (cache->object creation time is still in the main thread) and can’t be used for manual loaders.

The last is the option I was pursuing thus far, which was to try to create resources end-to-end in the background, universally supporting everything from manual loaders to regular resource files. However, the more I investigate this the more dependencies, and more lock situations, I find - since resources loaded fully will create other objects (e.g. hardware buffers) which will then also need thread protection. Having discovered the cost of critical sections, I’m wondering whether the intermediate solution might be the most pragmatic.