glMapBuffer, how I mock thee

And lo, it came to pass that several users of the green-headed abomination known as Ogre did throw up their hands and bemoan the most uncanny and unjust chasm that existed between the lands of D3D and OpenGL. For some of these poor souls, their torrid creations did speed – nay, rocket – on the one, whilst something of a laggard they became on the other. Much disquiet there was among the gathered hordes, and it did trouble those whose craft it was to deduce the matter’s origin. Toil day and night they did, across continents too vast for the eye to encompass – searching; searching.

In far-flung Asia did the first clue arise, and where the sun rose, so did the foul spectre of the cause – glMapBuffer.

Damn, I can’t keep that up any more. Anyway – the fact of the matter was that GL was underperforming in some demos and we didn’t know why. GL is typically a pain in the ass to performance test compared to D3D, since there are just so fewer decent tools, at least for free. geDEBugger is one option of course which I may consider in the future. But anyway, the cause as originally discovered by genva was glMapBuffer – as a GL function, it sucks.

We’ve used hardware buffers for years, and were there when VBOs were brand new in the spec, and lived through all kinds of really crappy driver issues, some of which were performance related. One of these was about the only serious nVidia GL driver bug I think I’ve encountered – the same cannot of course be said for ATI’s GL drivers. Anyway, moving on … glMapBuffer is equivalent in purpose to D3D’s buffer locking methods, and that’s why we used it. The options are a little less flexible than D3D’s though, and specifying usage and locking modes is considerably less refined. However, we found a way that was good enough, at least we thought.

It turns out though that GL performs far more syncing when using glMapBuffer than you would expect. Even if we tell it that we don’t want to read the buffer back, and that we want to discard the whole buffer contents so a stall should not be necessary, it still seems to sync up something somewhere. To the extent that just using glBufferSubData instead of performing a write-lock (ie writing your buffer data to a temporary main memory area and uploading it), makes a huge performance difference where dynamic buffers are being used. We’d only used glBufferSubData when bulk-uploading fro main memory, but it’s actually better to use it all the time. Grr.

What I did was generalise things so that we now use a pool of ‘scratch’ memory when locking GL buffers now. Provided there’s memory available (and there almost always will be since it’s only used for active locks), locked buffers now grab a chunk of memory from this pool to use instead of glMapBuffer, and at unlock if the mode was a writable one, it uploads the data again in bulk. Personally I think that GL should be able to figure this out from the usage and lock modes just like D3D can, but clearly it can’t. No amount of fiddling with these modes and trying to follow suggested GL lock strategies helped, meaning that glMapBuffer is a pretty damn useless piece of API. The result of using glBufferSubData instead is up to a 50% improvement on demos using dynamic buffers, bringing the performance up to D3D level again.

I love the design of the GL API (most of the time – some of those extensions are pretty crazy) but jeez, there needs to be better documentation about this kind of thing. The D3D documentation at least tells you what strategies are most efficient in what situations and why (most of the time). GL’s extension specs are probably the most annoyingly written of any documentation you can come across – I can only imagine sometime someone said "Hey, I know – let’s make our advanced API references read like the minutes of a meeting! How awesome will that be!". And for some reason no-one shot him on the spot. Missed opportunity. A collection of incremental and narrow-focussed technical discussion docs is woefully inadequate when it comes to giving you a holistic overview of an API, including what’s good and what’s not, leading to an awful lot of really annoying trial and error iteration. I hope the Khronos group sorts this out, the GL docs are still a hell of a tangled mess and one of the few things stopping GL being a serious challenge to D3D again except driver quality – nVidia being one of the few that seem to consistently get it right. GL’s lovely to learn the basics, but get beyond that and it can be such a pain in the arse sometimes.

  • http://zeux.info/ Arseny Kapoulkine

    IIRC, if you want a D3DLOCK_DISCARD-like behaviour, you have to explicitly discard your buffer (calling glBufferData or glBufferSubData with NULL as data, I forgot which) and then call map.

    Though I agree, this is a very weak point in OpenGL – it’s strange how some parts of OpenGL are so much closer to hardware than in Direct3D and some are so far from it.

  • tau

    Great read, Steve and Kudos to you guys to sort that out. That’s serious.

    I also noticed something strange while testing new 1.4.0RC1-2 demos: it works fine on my dev machine in D3D mode, but every single demo fails to run under GL. I have updated drivers (ATI x800 GTO), all the Dagon demos still work fine in GL.

  • http://www.stevestreeting.com Steve

    @Arseny: Yep, that’s what they say, but it doesn’t work. I went through all the papers and tips articles I could find and tried all combinations of glBufferData with NULL pointers, all the access modes. Nada.

    @tau:odd, because nfz is on ATI and RC2 provided a workaround for the main ATI GL driver bug. There are further ATI GL optimisations in CVS since then but the demos should work in RC2, they have for others. But then, I gave up on ATI’s GL support some time ago ;)

  • Pingback: SteveStreeting.com » Buffer access modes in DirectX10