multipart-mixed

SVN vs. Mercurial vs. Git For Managing Your Home Directory

For several years I've kept the bulk of my home directory in a revision control system. This allows me to synchronize my files across the two machines I use commonly, keep a backup on my home NAS box, and have complete revision history of files.

There's a price, however: the SCM keeps metadata on my machines, and this can add up. Plus there's the time needed to commit files. When it became clear I needed to switch away from Subversion because it doesn't cooperate with iWork files, I decided to look into alternatives.

Mercurial and Git appeared to be the best solutions, but there's quite the holy war going on between the two. Git's confusing, Mercurial is slow, etc.. I decided to run some of my own tests and let the data speak for itself.

Update 2008.04.25: Adding results for Bazaar.

Home Directories vs. Source Code

Keep in mind that managing a home directory is different than managing source code. I consider source code management an entirely different problem: company processes, branching/merging, platform compatibility, etc. are just as important as commit time and repository size. A home directory, on the other hand, is all about me. I sync my machines but that's about it, and I rarely need to branch.

The distribution of files is different, too. Source code tends to have many small, easily compressible text files. Home directories tend toward fewer and larger files — in my case, my photo library and design projects are a real problem. When I'm working on a large Photoshop file and committing several revisions, I need the SCM's binary storage/differencing engine to handle that efficiently.

Last, change tracking is a bit more lenient with home directories. I may shuffle some stuff around, and I don't need to explain the changes to anyone else. I'd like to tell the SCM "just make the current version look like this." Some GUI interfaces do this well, as does Mercurial's "addremove" command.

Testing Add Time and Repository Growth

My first test is adding piles of files and watching repository growth. I used three sets of files for this test, with the intent of covering large binary files down to smaller text files:

  • Digital Negative (DNG) files: 501MB for 134 files. Median file size 3.4MB, mean 3.8MB.

  • JPEG files: 500MB for 1301 files. Median file size 228KB, mean 405KB.

  • Pile of PDFs, source code, office docs: 500MB for 6069 files. Median file size 4KB, mean 24KB.

I added the files one set at a time. In all configurations the repository is on local disk. Full test protocol and output is at the bottom of this page for the curious.

Time Required To Add + Commit

The purpose of this test is simply to look at time required to add each data set to the repository.

SCM Tool DNG files JPEG files Document files Repack (Git only)
Subversion 2m 30s 4m 54s 20m 13s
Mercurial 1m 33s 1m 54s 1m 59s
Git 1m 6s 1m 30s 1m 29s 9m 0s
Bazaar 1m 25s 1m 38s 1m 35s

You can see Mercurial and Git are noticeably faster than Subversion, and scale much better for large quantities of small files. I've seen arguments that Git is faster than Mercurial; this data indicates that it's faster at adds but not hugely so. If you count repacking Git's repository, however, the argument goes the other way, with Mercurial the clear leader.

Update 2008.04.25: Bazaar looks very good here, too.

Repository Expansion for 500MB Add

The purpose of this test is to see how much the repository grows as each file set is added. In the case of Subversion, note that the working copy is always 2x the size of the working files; all files are duplicated in the .svn directory. The numbers below show the repository size only. So, for each 500MB added, the working copy grows an additional 500MB and the repository grows by the amount shown below.

In the case of Git, I show the incremental size at each add (as expected), and after the last add I also did a git gc to repack the repository.

SCM Tool DNG files JPEG files Document files Total size (after repack w/ Git)
Subversion 475MB 479MB 366MB 1321MB
Mercurial 470MB 480MB 389MB 1340MB
Git 470MB 475MB 231MB 1158MB
Bazaar 469MB 474MB 369MB 1312MB

There's little difference between these SCMs in how efficiently they store already-compressed images in the repository. Git is noticeably more efficient with the small document-type files.

Testing File Modification Time and Repository Growth

For my use, I'm not as concerned about making changes to lots of small files. My problem is with large image files. If I'm working on a big Photoshop file and want to commit changes often, I want those changes to take minimal space in my local repository.

To test this, I created a reasonably large Photoshop file (starting size 56MB) with several layers, then made several rounds of edits (ending size 71MB), committing the changes between each edit. The same sets of edits were applied for each SCM.

Time Required to Commit Modified File

SCM Tool Initial Add First Change Second Change Third Change Repack
Subversion 17s 13s 22s 13s
Mercurial 16s 7s 8s 8s
Git 6s 8s 10s 11s 19s
Bazaar 9s 7s 8s 8s

All SCMs posted respectable numbers here, with Mercurial out in front on committing changes.

Update 2008.04.25: Bazaar does a great job here, too.

Repository Growth With Modified File

The purpose of this test is to look at how much the repository grows each time the Photoshop file is modified. Note that with Subversion the working copy is fixed at 2x the size of the working files; all version history is stored in the repository. This can be a substantial advantage when your repository is on a separate server because you don't have to worry about your local copy growing out of control with many revisions. These Subversion numbers show growth of the repository, not the local copy. Also note that I forgot to take a data point, leading to two "not available" entries in the table below.

SCM Tool Initial Add First Change Second Change Third Change Total size (after repack w/ Git)
Subversion 47.5MB N/A N/A 11.4MB 105.5MB
Mercurial 44.3MB 18.3MB 18.4MB 10.9MB 91.9MB
Git 44.2MB 44.1MB 53.4MB 54.9MB 55.2MB
Bazaar 44.3MB 18.3MB 18.4MB 10.9MB 91.9MB

Git turns in some very interesting numbers here. The repository grows significantly with each change (topping out at 197MB before repack) but packs down very tight on repack. The ending repository size is significantly smaller than the ending Photoshop file size (55MB repo, 71MB image).

Update 2008.04.25: Curiously, the Bazaar results track Mercurial's almost exactly. Are they using the same repository format? I did a few web searches but couldn't turn up an answer.

Conclusions

First, keep in mind that this is testing SCM systems for the purpose of managing a home directory, and the data used in the test is representative of my home directory. Your mileage will vary. I have specifically not focused on managing source code because the bulk of my source code is managed separately with my company's chosen SCM (mostly Subversion).

Looking at these numbers, Subversion finished worse than I expected. The working copy is always 2x the size of the files being managed, which can be a blessing for large binary files with many revisions but a curse for everything else. The repository growth is reasonable; I'm not as concerned about that. Speed-wise, adding files is slow to terrible. Updating large binary files is reasonable.

Git and Mercurial both turn in good numbers but make an interesting trade-off between speed and repository size. Mercurial is fast with both adds and modifications, and keeps repository growth under control at the same time. Git is also fast, but its repository grows very quickly with modified files until you repack — and those repacks can be very slow. But the packed repository is much smaller than Mercurial's.

So there's really no "Git rules, Mercurial sucks" argument or vice-versa. It's more a question of workflow and priorities. In my opinion, Mercurial is easier to set up and use day-to-day. Its "addremove" command, in particular, is a great time-saver. But Git can really squash its repository down with repacking, much smaller than Mercurial.

Honestly, I was expecting the numbers to reveal a clear-cut answer to which tool I should use. They didn't. So, my recommendation is to look at your workflow, evaluate how Git or Mercurial would fit into it, and pick based on that. Or flip a coin, whichever.

Update 2008.04.24: I've been using Mercurial for several months to manage my home directory and I'm quite happy with it. I may switch to Git, however, for its more compact repository — I haven't decided if it's worth the trouble.

Update 2008.04.25: I added results for Bazaar due to popular request. Its results track Mercurial very closely, so for the basic use I've tested here, I don't see a compelling reason to use one vs. the other.

Test Environment

  • MacBook Pro dual 2.2 GHz, 2GB RAM
  • Mac OS X 10.5.2
  • Mercurial version 0.9.5
  • Subversion version 1.4.4 (r25188)
  • Git version 1.5.4.5
  • Bazaar version 1.3

Test Data

Comments

Which one did you eventually choose? You left this reader hanging to see who won your coin toss.


Thanks for the detailed description of your tests though. Even though I don't do anything like this (source control my home directory) it was an interesting read.


Those captcha's are really hard to read/hear.

This is a very interesting post. Let me assume you have more than one machine that would access your decentrialized HOME directory. Would you like to share some experience about how you handle the merge/branch/resolve in different machine?

I also heard that ZFS support versioning in the file system level, I could be wrong though. If you are dedicated to one machine, have you considered to use ZFS for the version control?

Kun Xi, thanks for your comments. ZFS is fantastic (I use it at my day job) but it's not well-supported under Mac OS X yet. For example, I can't boot my laptop from ZFS.

As you've pointed out, part of the point of using a SCM is sharing between machines. With Subversion I set up a svn server on my NAS box, and this was very convenient since I could sync my laptop and desktop to the NAS. I haven't yet set this up for Mercurial or Git. In the meantime with Mercurial I just use "hg serve" on one and "hg pull" on the other, then switch. Works great, and it's fast.

I don't know if it would be better or worse, but I'd be interested to see the results if you ran the same tests with Bazaar.

For what it's worth, there exist two tools that remove the 2x working copy overhead of subversion: scord and fsvs. (Affiliation note: I'm the author of scord.)
http://scord.sourceforge.net/
http://fsvs.tigris.org/

Also, note that the per-working copy storage overheads of git and mercurial are similar to svn. git and mercurial store a (compressed) duplicate of the repository at each working copy.

Hi Josh; interesting post. What does a normal session using these version control systems look like for home directory situations? For instance, say you import a new roll of photos from your camera? Or move some stuff around? Or, for instance, delete something?

Did you consider Unison - file synchronization?

Perhaps it doesn't deal well with resource forks on macs.

Do you have any comments on ease of learning and ease of use for each of the three SCMs?

Have you ever tried Changes.app (since you're on a mac)? http://changesapp.com/

What would it take to get you to add bazaar (bzr) to your tests?

Can you schedule the Git repacks to happen later, thereby getting the extra speed of Git and it's size advantage.

You should totally switch to Windows so that you can add Visual Source Safe to your benchmark :)

Why not just pray to The Lord to take care of everything for you?

If you use a *nix, have you tried rdiff-backup ( http://www.nongnu.org/rdiff-backup/features.html )? Reverse diff via rsync works well for both keeping a mirror and keeping backups, and there's only one folder of metadata. It's worked really well for me, although I'll probably move to btrfs when it's ready...

I use Unison, and that seems to work great.

Why just $HOME -
why not any path -
even / ?!

Unison is great.

I would third the Unison recommendation.

Do you know about FSVS (fsvs.tigris.org)?
Uses a subversion repository, but is much faster - see http://fsvs.tigris.org/svn-diff.html for details.

If you're on Mac, why not use Time machine?

I suggest Perforce. It's free for up to two users and performes really well with multi GB binary assets. It's basically the defacto standard when it comes to SCM in the game development industry.

Isn't using "hg commit -A" equivalent to using addremove? You can put a "commit = -A" line under the "[defaults]" part of your "~/.hgrc" file and it forces that option on every commit. I'm not quite sure how this will handle file renames though (i.e., if it will work like "hg addremove -s").

Thanks for all the comments. Re Unison: I have used this quite a bit, and it's good for sync, but it doesn't do any revision control. I thought this was okay for some types of files (e.g. my photos), but now I'd prefer to keep those revision-controlled as well.

Re Time Machine: also, works great for what it does (revision control) but it doesn't address the problem of sync. Now one can use a combo of Unison for sync and Time Machine on one Mac for the revision control. I'm actually still doing that for files under ~/Pictures but that means I've got two revision control systems to manage, and I want to ditch one. I've been using SCMs for so long that the SCM path just "feels like" the better option.

Re FSVS: I was most excited when Chris posted this, it looks perfect. Unfortunately it won't build on Mac out of the box. I haven't had time to dig into it further.

Ted: thanks for the commit -A tip!

There's also svk, which sits on top of svn and makes it more git/bzr like.

I wouldn't mind seeing perforce too. (didn't think of it when I suggested bazaar. But since chetan mentioned it).

Hi Josh,

thanks for the shootout. I'm using Subversion, Mercurial and Git and have the following thoughts for you:

At first, it would be interesting to look at file copying, renaming, or moving to different directory as well. I think Subversion will do this as lightweight copy, while Mercurial keeps the old history under the old name and copies at least the latest revision to the new location. The repository sizes will thus noticably increase for some systems when using large files.

Secondly, the use model should be considered. You talked about Photoshopping. After going through several steps in a session, do you really need the interim steps any longer? If not, then the Git feature for easy branching without need for cloning comes in handy. When your session is finished you can completely remove the branch from the repo and take only the final version. This should will free up the space occuppied by the branch.
As far as I know Subversion can't do this at all, and Mercurial requires a full Repo Clone as you cannot remove a named branch from the repo. Such a clone takes a lot of additional space and also might be difficult to place, when you are really tracking the whole home directory.

Personally, I currently prefer the Mercurial command set over the one from Git and I dislike the need for Git's repacks, but Git's the advantages both in terms of space and handling of inline branches are clearly there.

Regards

Guido

Guido, you've got some very good points there. Renaming in Mercurial is indeed wasteful on disk. I've reorganized some folders with large files and the commit increased my repository size considerably -- I wasn't expecting that. I just tried the same in Git and it's nearly a zero-cost operation, as it should be. That's a compelling advantage for Git over Mercurial.

I hadn't thought about the use model of creating a branch for intermediate versions and throwing those away later. I didn't know Git allowed that, I'll definitely need to try it since that sounds like a great plan. Thanks for the suggestion!

Could it be possible to explain your setup that you're personally using? I'd like to understand how you're it, as I would like to have a process to sync my computers and to have some versionning, but I can't see exactly how it should be done.
Anyway, it looks like a good shoot-out.

This is an interesting issue, but why didn't you just use rsync? Was there any particular reason you wanted revision management?

@Fizz:

Also, note that the per-working copy storage overheads of git and mercurial are similar to svn. git and mercurial store a (compressed) duplicate of the repository at each working copy.


This is not true at least for git: git uses hard links to the object store, so will take almost no extra space for real branches.

@Chris: I did investigate using rsync, but there's two issues: first, rsync is much better at doing one-way sync than two-way. It's hard to do 2-way sync (with deleting obsolete files) without significant risk of deleting wanted files. Unison is much better about this. Even with Unison, however, I find myself wanting the safety net of revision history.

The only flaw with the SCM approach is that (for better or worse) it's impossible to get rid of old stuff. Use case: I've got old video projects that are huge, and once I'm done, I back them up to DVD, and I don't want to keep that stuff around forever in my SCM repository. For these things I currently don't check them in, and I sync them by hand when needed. Based on Guido's comment, though, I may be able to manage this type of project using a branch that I can throw away later.

I would like to make the comment that Bazaar does not work with really big files (i.e. 80 MB and up). See https://bugs.launchpad.net/bugs/109114.

I think the size increase for Mercurial may be related to the way it handles renames. It treats renames as copy-and-remove.

http://hgbook.red-bean.com/hgbookch5.html

If you rename in the filesystem, I suppose you have to tell hg that you’ve renamed the file so it knows what happened. In this case, it seems that you’d need to use “hg rename --after” according to the Mercurial Book.

Also, the docs explicitly state that Mercurial tracks only files and not directories. As far as it is concerned, a directory doesn’t exist until there’s a tracked file in it. I suppose that if you rename a directory, all of its contents would appear very differently to hg and you’d have to account for that to prevent revlog bloat.

However, I haven’t tried any of that myself. My needs are more simple right now, except for cross-platform Win/Mac support.

Does running 'bzr pack' make any difference to the bzr repository growth? My understanding is that bzr attempts to combine the advantages of git's pack format (small) with those of Mercurial's revlog format (pretty small with no need to repack to stay small), by using the former when repacking (and branching?) but the latter for changes since the last bzr pack.

pack's essentially compress all history across the whole repo, wherease revlogs perform history compression at the per-file level. The former can do a better job (since it has more to work with), but requires examining all history in order to do it and it therefore too slow to perform for every commit. The latter only requires examining the history of files that are being modified, and hence can be fast.

Tom, I ran bzr pack and didn't have good luck with it. In fact, just running pack will *grow* the repository! It keeps obsolete packs in the repo, for what reason I don't know. After manually removing .bzr/obsolete_packs I find that there's a reduction in repository size but only by a negligible amount.

What I like about git is that you can easily edit the history. So you can commit everything without thinking and sort out later what you won't need. That's a good thing for those big files.

I'm rather late to the party, but FWIW:

Bazaar keeps obsolete packs around in case something goes wrong with the new ones, like your computer crashes and the file system (NFS?) truncates them or something.

Bazaar does not use Git's pack format or Mercurial's revlog format. It uses its own format. It's similar in idea to Git's packs, but I don't think the implementations share anything.

Whenever you commit/pull/etc., Bazaar simply creates a new pack file that contains the new revision(s). It occasionally also automatically repacks into somewhat fewer files, because having dozens of files would hurt performance.

When you run "bzr pack", it packs everything into one file, and topologically sorts the revisions, but it doesn't do much else in the way of optimizations.

I think this "Distributed Version Control Systems: A Not-So-Quick Guide Through" is a pretty good comparison of distributed SCM:

http://www.infoq.com/articles/dvcs-guide

Great article! Just what I need.

@Pieter: if you have projects that you want to get rid of once they are done, and you use svn, then you can make separate repositories for these projects and include them with svn:externals in your main repo. Once you are done, delete the repo and delete the svn:externals entry.


Right now I'm using svn to manage my ~ (in fact, my / is in svn and I include the ~ as svn:externals). I'm looking for something faster/more efficient (such as git), but I'm not quite sure if it has all the features I need (eg can I have git repo's inside git repo's? does it work with symlinks? - I'll look that up...)
With svn, I like to keep different "parts" of the system separate (in separate repositories or directories in repositories) (/, ~, programming projects, misc scripts, documents, images, ...) and then pull them together with svn:externals. I'll have to figure out how to do something similar with git.

Been thinking about switching my home directory from subversion to git, and found this page very useful!

Just a note about bzr...
It supports several branch formats, each with different properties. It would be helpful to know which you used (run 'bzr info'), and perhaps try different formats to see how they behave. I hear the "1.9" format is nice, but requires bzr 1.9 or newer.

This is a very interesting post, but as far as ease of use, doesn't Time Machine beat everything?

Also, Git is (according to Randal Schwartz) NOT well suited to using to store your home directory, since it's designed for managing a whole bunch of *related* files. I would guess that this is a problem with other DVCSs, but I don't know about the others.

Now that's just Randal's opinion, in one sense, but it seems to make sense. Since it keeps a revision number for an entire project, rather than one for each file, it does seem at least imperfect. If you want to back up one file, rather than your whole repo, it might be rather a pain.

On the other hand Time Machine makes it brain-dead easy. Of course I don't think Time Machine is nearly as efficient as Git (although I haven't tested) so it appears to have trade offs, but for one's home directory it seems to me to be the best option. Any thoughts?

Also, I was under the impression that Git didn't do any of its diff'ing magic on binary files, but you seem to have proven that idea to be false. Thanks.

Hi are you running Mercurial or Git on your NAS (linkstation?)

Hi Bob, I do have Mercurial on there. I generally ran it over SSH, but for big updates I run Mercurial's web server. That's much faster than SSH on a wimpy box like the LinkStation. However, it's also completely insecure, so be sure to shut it off as soon as you're done!

git seems like a very compelling solution for backup, but for me the main deficiency is the fact that it increases the size of the local copy by a factor of 2 as it stores everything in the .git directory too. That makes it impractical to use for backup of large ammounts of data, especially in the case of my laptop which has a small drive that is nearly full. I would use rsync, but then it that creates problems when I want to sync the data with another laptop.

@Josh, you can workaround this fact by symlinking the .git directory to a bare repository on an external drive. Whenever you have your external drive plugged-in you can use git to checkin.

Plan 9 has had fossil/venti since long before git/Hg/etc existed ...

@Josh: The problem with the ſize of the repoſitory is that Mercurial does delta packaging ‘on the fly’, while git gc has to be explicit ſarted (which reſults in bigger ‘temporary’ repoſitories, while git producing ſmaller ſizes in total).

The good news for the git Uſer: On can eaſily ſolve this by tweaking properties in your .gitconfig file:

For example, ſetting gc.auto to 100 will ſtart gc automatical after 100 new blobs (ſtandard is 6700, ſince git is indented to manage many ſmall c files and not very big binary files by default, although it can handle both).

Or you can increaſe core.loosecompression, ſo that even unpackaged files will be ſaved with compeſſion (and not uncompreſſed, whis is the default).

@Pars Par: But using git garbage collection more often will slow it down, so you lose the slight speed advantage.

@Josh Carter: You could have a look at the Mercurial share extension: http://mercurial.selenic.com/wiki/ShareExtension (distributed with Mercurial)

Mercurial all the way for me. Thanks for the test though...

Hi, I'm wondering if assembla https://www.assembla.com/plans=carmelad willbe a good way to use GIT,Mercurial and Trac..I saw the website while searching and they have free account but it's public. Oh well, I guess I have to try it..Good post btw..lot's of info since i'm new to this.

Post a comment