Adventures in conversionland

· by Steve · Read in about 12 min · (2409 Words)

As  you know I’ve been reviewing DVCSs lately. I’m taking my time doing real use cases on them, and deliberately not doing the sort of feet-first leap into whatever looks best / most popular on the surface because I don’t particularly want to discover unexpected problems down the track. It’s consuming a lot more time than I expected - I’m writing up my findings and may publish the entire results later on if I can find the time to clean them up and format them better, but for the moment I thought I’d share some experiences with the conversion process of a relatively large, long-lived, multi-branch repository (OGRE) from Subversion to Git and Mercurial, because that’s what I’ve been wrestling with in the last few days. I discovered a bunch of additional issues during this process that did not occur when starting from scratch or doing conversions from more trivial repositories, so I thought it might help others to talk about it.

Source Subversion repositiory specifications

Revisions: 9215 (as of today)

Branches:  9 permanent, 22 temporary / experimental

Size: 375 MB

Also of note is that the source repository is still at Subversion 1.3 - this is because Sourceforge was stuck on this version for a long time and we haven’t upgraded the repository since they started supporting newer versions. We never bothered because it requires locking out the repository while you download the whole thing to a local machine, upgrade it and re-upload it, which is a hassle, especially when you have things to do. In practice the server-side version hasn’t been a major issue since you can still use newer clients with it and svnmerge operates regardless.

General Approach

I rsync the OGRE repository down to a local Linux server several times a week, so that was the source of all my conversions, eliminating most of the network time. I tried to convert the repositories using Windows clients in the first instance, because that was easier to use the latest versions of the tools (my Linux Server is on Ubuntu 8.04 LTS and even with hardy-backports available it’s not as up to date - and for simplicity because this is an important server I stick to the official versions). There is a 1Gb network connection between the machines so it could be pretty speedy.

The principle is that I want to preserve all history, all branches, and all tags. In practice I may actually prune off some branches later on, so that the clone process is quicker, but the base principle is that it should be a lossless conversion in the first instance. Definitely no top-skimming of the trunk like some conversion articles advocate - we have stable branches that must be maintained and regularly have work that we want to keep in experimental branches. In particular, post conversion it must be possible to continue committing to and merging from stable branches.

Git Conversion Experience

I’d previously converted some other, small and fairly simple Subversion repositories using git-svn (less than 500 revisions, and 2-3 branches) and it worked fine. However, when trying it against the considerably more complex OGRE repository I hit problems very quickly. On Windows, using msysGit 1.6.4 the process failed after 1900 revisions, just after doing the automatic repository tidy (git gc). The error message was simply ‘fatal error running git-svn’, even though it had been running exactly that command for the last 1900 revisions. Thinking there might be an msysGit issue here, I switched to the Linux server (git 1.5.4) and tried the same thing. This time it fell over at revision 176 with absolutely no error message. In both cases the repository left behind was corrupt so I could not resume the process.

The other thing I noticed was how long the process took on Windows. 1900 revisions took 5 hours (!) and thus I wasn’t in a hurry to retry the process there. On Linux the process was much faster, as far as it got. It’s worth noting that this is not caused by running across 2 machines - not only do I have a very adequate 1Gb link, Mercurial managed significantly faster conversions using the same topology. msysGit’s git-svn conversion is simply incredibly slow.

At this point I decided to try upgrading the Subversion repository, just in case git-svn hadn’t been tested with older repository versions. My Linux server had svn 1.5 on it, so I upgraded the OGRE repository to that locally and re-ran the git-svn process on the Linux machine (as I say, I wasn’t keen on repeating the glacially slow msysGit conversion). Sure enough, this time all 9200-odd revisions converted fine, in only about 1 hour 40 minutes, or about 15 times faster than doing it on Windows.

So, I may have had a few problems, and being forced to upgrade the repository before converting was a bit of a pain, but at least it worked and was fast (on Linux anyway). After that, I started cloning the repository both on Linux and Windows and tried performing some standard operations.

The first thing that surprised me was that when cloning the converted repository, I could only see the ‘master’ branch on the remote machine. It’s common practice for Git not to create any local branches other than master on clone, but usually you can do ‘git branch -a’ to see all the remote branches that are available, which show up as something like ‘origin/v1-6’ - you can then check them out to local branches. However, no branches other than ‘origin/master’ showed up, even though I knew they’d been converted. It turns out that git-svn converts all branches except master into remote branches in the converted repository, referencing the original Subversion URL - so very much like having cloned from another Git repository. That sort of makes sense, but in the context of a full conversion to a repository that is destined to become the upstream master, isn’t that useful. In practice what you need to do is after the git-svn conversion is complete, git checkout each of the branches that you care about in your converted repository, thus creating local branches in that repository which subsequent cloners will be able to checkout themselves.

So, once I’d figured this out I started to check out different branches to test if it had worked. At first it seemed to, when checking out the first branch (switching from master to v1-6 in a local clone from the conversion). When I came to try to switch back to master however, Git complained that I had modified files in my working directory. WTF? I’d only just checked out the clean copy of the v1-6 branch. But sure enough, git status told me I had 5 modified files. Diffing them showed no changes, and “git reset -hard” returned with no error, but git status still showed these files as modified. Bizarre. A git checkout -f still let me switch, but again after completion a set of other files showed up as modified. Switching back and forth (with -f) a few times revealed that the list of modified files after checkout was different each time. Again worried that this was a Windows thing, I tried checking out on my Linux machine instead (so at that stage the entire process, conversion to checkout, was done on Linux). But no, the same problem occurred - a random selection of 5-7 modified files on clean checkout.

This has raised some serious concerns about using Git for me. Firstly the flaky conversion which requires a bunch of extra steps just to get it to work at all, then the post-conversion bizarre behaviour of thinking files are modified when they’re not. I had none of these problems with smaller repositories, created from scratch or converted, which up until now I’d been using for testing (and Git had been winning me over in fact since it had been working well). But the bottom line is that this process needs to work reliably for the OGRE repository. If it doesn’t, it’s pretty much untenable.

Mercurial Conversion Experience

I started off with the in-built ‘hg convert’ process. It all went smoothly and took about 8 hours, and the resulting repository was mostly fine. However, the default behaviour is to process the revisions in an order which “produces the fewest jumps between branches in the commit log”. In practice, I found that this meant the revision log when reviewing multiple branches was badly jumbled and difficult to use; the use of the ‘-datesort’ option resolved this but increased the conversion time to just under 10 hours (still faster than msysGit but a lot slower than git on Linux).

The guys from BitBucket, who I’d talked to to see if they would offer free unlimited hosting for OGRE since we wouldn’t fit in the default 150MB limit (result was that they were super-friendly and offered not only that but lots of advice), suggested that I try hgsubversion instead. I was initially put off by their website suggesting it wasn’t fit for production use (they’ve removed this statement now), but BitBucket told me that was a little out of date, and in fact the Python project is using it for their conversion, which is obviously of major size. So, I gave it a shot and got some good help from the hgsubversion guys, and the results were great - 1hour 40 minutes from the Windows end (coincidentally the same speed Git managed on the local Linux machine), and the log view was properly ordered right off the bat.

The one remaining issue I had (and this is true of git-svn too) is that all of the branches are open-ended on conversion - that is, no record is made of merges that have been done between branches. That means you would have problems continuing a branch and then merging it, because Mercurial would think it has to merge everything from the point the branch was taken. Neither svnmerge or svn:merge properties are taken into account.

One way to resolve this is to manually create a merge point to close off the branches. The easiest way to do this is:

  • Grab the default tip
  • Open a command line and define a temporary environment variable “HGMERGE=internal:local”. This means that you want to keep the local files and throw away the other source when doing a merge, which is important for our dummy merge
  • hg merge -y
  • commit - only the .hgtags file should be modified, the rest of the commit is merely metadata alteration to close off the source_branch

Once you’ve done that, your branch is joined back to the trunk and you can carry on as before, any new commits to that branch will merge across cleanly. The only downside of this is that the merge is strictly at the wrong point - if you view the history in the trunk it won’t be technically accurate and you’ll need to use your commit messages as the real guide to the actual merges before the conversion.

A better way to do this would be to record the merges during the conversion, that is for merge commits in Subversion to have 2 parents. So far, none of the conversion tools read svnmerge or svn:merge metadata to implement this, but the standard ‘hg convert’ has an option called ‘-splicemap’ where you can specify merge points to be applied during the conversion. Unfortunately I’ve tried to use this twice so far, and both times it hasn’t worked (just silently done nothing). The documentation for -splicemap is not great so it could be I got the URLs wrong. But anyhow, following 2 failed attempts (20 hours! because this was the standard hg convert with -datesort) I decided I’d try to get a similar bit of functionality working in hgsubversion instead, since that’s much faster (1hr 40m a pop). Right now I’m hacking away on it to try to make this work, so far it’s not but I’ll let you know if I eventually succeed. One of the benefits of Mercurial is that it’s all in Python so it’s very easy to modify, compared to Git which runs all kinds of random scripts and executables, including sh and perl so it’s much more tangled to dig into.

**Conclusions, so far

**

I started my DVCS evaluation very pro-Mercurial and very anti-Git. While working through my detailed use cases, a process which I’ve not quite completed yet, Git has grown on me a great deal, and I discovered a few things about Mercurial which I found a bit limiting at first, but which are mitigated via extensions - Rebase, Queues and Transplant particularly. My recent experience with more complicated, full-scale and imported repositories has once again gone in Mercurial’s favour though, and I saw a nastier side of Git - when it goes wrong, it’s a lot more difficult to figure out why. In contrast when I’ve had my Mercurial conversion crash - and I stress this only happened due to my own screw up, once because it ran overnight when my rsync kicked in and changed the repository under its feet, and a few times when I’ve been experimenting with hacking the Python to get the merges done - the reason has always been clear; a nice Python trace, and the repository was always intact anyhow - in the case of the core hg convert the conversion even restarted from where it left off once I’d fixed it.

If I were to graph my relative opinion of the two over the period I’ve been doing this so far, it would look something like this:

gitmercurialopinion

Git totally came up from behind and I was really starting to dig it, until it started freaking out on me with the conversion and I started to try to diagnose why and found it mostly unhelpful. Again I stress I'm not done with my tests yet, but I'm perhaps 75% of the way through now and the conversion problems I've had with Git in the last few days don't look good. Bazaar, I'm afraid, is no longer likely to be part of the evaluation - it takes a long time to do these evaluations properly rather than just trivially, and our survey has indicated that it is the least commonly used among our community by a very large margin, so I'm focussing on the ones more users are likely to already be comfortable with.

The evaluation process continues…