glMapBuffer vs glBufferSubData, the return

Well, I had a bit of a rant about glMapBuffer yesterday on the blog, and I was lucky enough that an acquaintance at nVidia read it and sent me some tips from one of their GL driver gurus. Splendid :)

Having pored over that, I found that I’d tried most of what he mentioned but he made specific reference to the size of the updates being significant. This I hadn’t really experimented with, so I decided to revisit the whole situation again today (apologies to my wife Marie again here, I know it’s the weekend but this really can’t wait ;)).

And you know, size really does matter. :) I played with a lot of different scenarios and came to the conclusion that when locking and updating less than about 32k of buffer data (on my 6800), glBufferSubData trumps glMapBuffer by a significant margin. The closer you get to that threshold though, the less difference it makes, and when you start getting well above that, glMapBuffer starts to pull away again. With very, very large updates, glMapBuffer wins outright (provided you discard the current contents of the buffer – in GL this means calling glBufferData with a null pointer). So, I’ve changed my implementation now to use chunks of a scratch buffer and glBufferSubData when the locked area is smaller than 32k, and glMapBuffer otherwise. The result is that all of the demos are now faster in GL, and I’d discovered after making that previous post that  a couple of demos had lost a small amount of speed by avoiding glMapBuffer – these were cases where rather large buffers were being locked in one go.

It’s still a bit odd – I personally would have thought that using glMapBuffer in write-only mode, having previously discarded the buffer by calling glBufferData(…, NULL) would be directly equivalent to using glBufferSubData. The usage mode is surely the same, ie that you don’t want to read the current contents of the buffer and want to discard them, freeing the buffer up from needing to be synced. But, clearly not – most strange.

I’ve a little doubt at the back of my mind that this 32k threshold is rather unscientific and I’m wondering if it might vary per card, hoping to get some guidance from the guru on that :) But, it’s good that I’ve found a route which seems to perform better in all cases now.

Big thanks go to Kevin at nVidia for taking the time to read my blog and get some info from their driver team, it was much appreciated.

  • Pingback: Lost in the Triangles » Blog Archive » ARB_vertex_buffer_object is stupid

  • Pingback: SteveStreeting.com » OpenGL Long Peaks and buffer improvements

  • Lee Sandberg

    Hi Steve and all

    Yes I think this was a good catch and making sure different cards (or perhaps different drivers) are handled the right way is important.

    Ogre doesn’t have any way to handle implementing different driver/cards implementations does it?

    Like a standard way?

  • http://www.stevestreeting.com Steve

    Not at the moment, our goal really is to try to make that unnecessary for the user – it’s nasty to have to be concerned about that at the application / content creator level. However, it might happen as part of the Summer of Code ‘hardware emulation’ project, it’s being discussed right now.