Distributed version control with Mercurial (Project Review) -- 22 Mar 2007

Convene, meeting record visibility

poll shows a few in favor of world-visible records; none in favor of confidential records

RESOLUTION: world-visible records

TimBL: Creative Commons Attribution Non-Commercial :)

The Mercurial SCM, presentation by Matt Mackall

Presenter: Matt Mackall, Selenic

<DanC> The Mercurial SCM

TimBL: slides don't seem to work with safari. remote/slidey issue?

<timbl> http://www.w3.org/Talks/Tools/Slidy/slidy.js Line 146: Can't hanlde non-ascii in string: "flÃ¨ches"

<RalphS_> The Mercurial SCM (title slide)

<RalphS_> Why a New SCM? (slide 2)

Matt: "your tea is still warm" came from someone at Sun, noticing merges were faster
... given two lines of history; stable branch and unstable branch
... bug fixes often committed to 'stable' but you want them also in 'unstable'
... each time you pull patches across to 'unstable' you want to remember what happened the last time you merged

<Zakim> ericP, you wanted to ask for a use case for a repeated merge

EricP: like cvs -j ?

Matt: right. cvs and svn do not remember the previous merge point
... so you have to revisit all the same conflicts again

<RalphS_> Why a decentralized system? (slide 3)

<karl> offline work++

Matt: centralized systems are burdensome for people who work without write access as they have larger merges to deal with, thus slow
... consider folks like freebsd developers; a few people have write access to the repository, a large number of people have to go through those few for commits
... so benefits of version control are not available to many
... better to have ability to do local commits
... permit off-line commits
... sync whenever you feel you need to do so

<RalphS_> The Mercurial basics (slide 4)

Matt: every hg repository is a working directory ala cvs plus a private store
... explicit checkout not required
... store has complete private copy of entire project history
... no need to communicate with original server to do work on the project
... the copy you have is equivalent to what you got from the original server; it has full revision information

<RalphS_> Making a commit (slide 5)

<DanC> (the first time I used hg on a plane, it felt really revolutionary. http://dig.csail.mit.edu/breadcrumbs/node/96 )

<Zakim> timbl, you wanted to ask Is the store append-only? Optimised for getting latest version?

Tim: does the store keep the history as append-only?
... for what is it optimized? latest version or fetching full history?

Matt: append-only. small delta tacked on to end of file for revision n+1
... uses
... uses 'revlog' format that I invented
... rougly like format of MPEG file
... small deltas and periodic full images
... you can seek directly in and read a small chunk

<DanC> revlog design note

Matt: the amount of data you have to read is equivalent to the uncompressed size of the content
... so both append-only and very fast
... for files with very long history the index is separate and also append-only
... for files with historys < 128k, the index is combined in the same file and read with one i/o

<Zakim> ht, you wanted to check about the nature of revs. . .

Henry: in cvs, every file has its own revision history
... if you've chosen to branch you name the files in that branch
... whereas in svn the world has a revision history and every change is considered a change to the world

<DanC> (this is agendum 4, btw)

Henry: have I framed this correctly? which is hg?

Matt: hg makes atomic project-wide snapshots
... revlog for each file plus a "manifest"
... manifest is a list of all files in a project for a given changeset
... recursive pointers into manifest
... so hg is in the latter camp; a project-wide sense of the state of each file

<RalphS_> Revisions, Changesets, Heads, and Tip (slide 6)

<timbl> I wonder whether projects can be nested.... concerned about 'project-wide" scalability... I giess that is agendum #4

<ericP> recursive hash, like rsync?

<RalphS> [I suppose, then, that the act of starting a new project is "collecting tips" :) ]

<RalphS_> Cloning, Merging, and Pulling (slide 7)

[note, slide 7 is animated]

<tlr> so both Alice and Bob need to do a merge, even though Bob had done one already?

<tlr> +

Matt: hg also has 'pushing', opposite of 'pulling'
... pushing requires more privs

Thomas: am I understanding that both Alice and Bob do merges of e,f,g that lead to the same h?

Matt: no, Alice pulls Bob's revision h
... so if she hasn't worked in the meantime she is guaranteed to get Bob's h
... it's not uncommon to regularly merge back and forth
... in general it's very little work to do one of these merges

<Zakim> timbl, you wanted to ask whether pulling needs a hg server running on the pulled site, or just read access (http, file:. ftP:. etc)

Tim: push requires write access. does pull work by reading arbitrary files/ftp/http or do I need an hg server?

Matt: see next slide ...

<RalphS_> Multiple ways to share repositories (slide 8)

<DanC> (an example of the CGI interface: http://homer.w3.org/~connolly/projects/ )

Matt: can put a bunch of revlogs in an http directory but this is slower
... in practice, most people use the hgserve approach
... "bundles" are highly compressed version of the project
... bundles use the hgserve wire format
... tarballs can be pulled from hgserve
... so browsable source tree has a 'pull' button
... can pull raw versions of patch log via Web interface
... any network filesys will work

<RalphS_> Using cheap branches to manage changes (slide 9)

Matt: the idea is that branches are very light-weight; create and destroy them at will

<RalphS_> New features can be added with extensions (slide 10)

Matt: open source project have found mq to be useful in managing streams of updates as patches
... forest extension manages 'subprojects'; whereas cvs has a tree of repositories, the notion of subproject is more difficult in hg

<RalphS_> For more information (slide 11)

<mpm> http://hgbook.red-bean.com/

<RalphS_> current manual (replaces URI in slides)

Fractal project organization

<Zakim> timbl, you wanted to ask whether you can also brek out a subproject/

Tim: for a large project that grows and grows, will forest let me split off a subproject?

Matt: yes, if I'm understanding you
... a project is a boundary where you sensibly want to do atomic commits
... given two components in which you want to make simultaneous changes in an atomic commit, these should be in the same 'project'

Tim: organization of projects might not be so tied to atomic commits
... issue of clashes does not arise so often in some styles of work
... so management issue is to permit a single checkout to get everything
... nice thing about svn is that a subdirectory can be declared to be a soft link to a project somewhere else
... svn subproject can use its own access protocol

Matt: that's effectively what 'forest' does
... a forest is a set of projects and their relevant changesets
... I'm not as familiar with forest, as it was written by someone else but that's what people are using it for

<Zakim> DanC, you wanted to ask about splitting out, e.g. a utility tool from a big project

DanC: suppose I started writing a module as part of a project but then decide it should be split out. Is there a straightforward way to do this?

Matt: no, no easy way to delete old changesets or remove files from a tree
... hg histories are more immutable than in other systems
... the obvious way to do this would be to clone a project and delete all the unwanted stuff
... in future we may allow trimming history; e.g. 'commits before rev X are no longer relevant', or 'commits outside this tree are no longer relevant'

EricP: is it easy to write code that hacks the revlogs, like sed on cvs histories?

Matt: yep, people have been successful in hacking pretty low-level things

DanC: there's an API too, don't assume you can just hack files

<ted> not to worry eric, i bet there's a mode or will be shortly

Matt: several other projects have cloned the revlog approach now
... e.g. monotone, bzr
... svn may also be looking at adopting revlog

The toplogy of update flows in a large development system with overlapping cokmmunities with diff't access rights. Wrt slide 3

Tim: describe some typical topologies
... large numbers of developers, with subgroups who sync with each other but not the 'main' repository
... others hacking with only email access
... what sort of workflows are established?
... I worry about patches that are shared only between a subset of developers who heard about them
... so accidentially you have a patch that lots of people have but never gets into the 'main' repository
... hash across the project but no single place to go for the 'tip'

Matt: I think the optimal model is the linux model
... a central person, Linus, who is the only one with push access to a central repository
... most efficient if everyone only does pull
... with push you wind up with two heads
... second 'pusher' gets a message instructing them to do pull-merge-push

<tlr> yikes, that means it won't work well with datespace

Matt: if a single person does pulls from 'lieutenants' then a push to the central repository it's more efficient
... scales well; can assign lieutenants for subprojects and they can do pull-merge for their subprojects

Tim: ironic that the technology is Web-like and decentralized but works best with a centralized social process

Matt: in the end you want to wind up with one version
... in fact there are many long-lived independent branches, even in the linux world

Eric: in the W3C case we have 'dev' that is largly synchronous and a 'cvs.w3.org' space that is asynchronous
... we care that the code in dev.w3.org is synchronized
... but the Web space cvs.w3.org largely has pages that do not depend on each other
... in the svn model that requires a single revision number for 100k's of documents ...
... how would you arrange this in hg?

Matt: we currently do not implmeent 'partial repositories' where you'd ignore a subset of the repository
... there are existing projects with order 100k's of docs
... but you'd probably not want to use hg to do wikipedia

<DanC> (this is what I meant by "Fractal project organization" again, fyi)

Matt: I've had a back-burner idea of doing something rcs-like, managing a single file but keeping O(n)
... basically breaking out revlog but with a single file for management
... for a 100k document case you'd want to break this into smaller projects

EricP: perhaps year-month

<ted> 1.75M resources in www.w3.org webspace at present

Matt: you don't want to divide by time, you want to divide across the tree

DanC: the W3C naming scheme lets you use year-month as a default naming scheme, files can still change

<Zakim> timbl2, you wanted to ask whether one could turn off the project-wideness

Tim: I can imagine someone wanting to establish a new project at a given point in the www.w3.org tree
... 'this is a new software system' or 'this is a tutorial'
... making a self-consistent subsystem
... but such a subsystem is unusual
... e.g. these meeting minutes do not need to be synchronized with anything else
... initial description of hg sounds similar to cvs
... how big a change would it be to remove the project-wide idea?

Matt: pretty large change. easier to work with something like forest
... or start with project with many subprojects and mark subprojects as 'ignore for commit'
... this part of the problem space is very different from cvs and a little tricky for people to adapt to if they've structured their projects for the way cvs works
... these hurdles can be surpassed but you do have to think about things a little differently

DanC: thinking more rcs-like?

Matt: yes, if everything is really independent then you probably want something like rcs
... wikipedia only thinks about changing single files

Eric: on a human level we frequently manage file dependencies
... if you could say '{...} are independent by default, {...} are mutually dependent, ...' that might be useful

Matt: that's blue-sky from where we are today
... if we wanted to support wikipedia with what we currently have we understand the changes but we haven't worked on it

<ht> I really don't see how the structuring into projects makes any sense for our datespace

comparisons with Darcs

<Zakim> timbl1, you wanted to ask about comparisons with Darcs

Matt: I understand that Darcs is (a) orders of magnitude slower and (b) magical
... has nice properties for cherry-picking in a way people seem to like but that I believe are problematic

<tlr> http://darcs.net/DarcsWiki

Matt: cherry-picking is ...
... given 2 branches where you want to bring in single changes without regard to surrounding history, Darcs lets you do this
... what I understand from bzr folk is that the way Darcs handles patches internally it's possible to reorder patches such that older versions of the project cannot be reconstructed
... if what I've heard is correct and we're understanding this correctly, this is a serious failing in a version control system
... so (c) Darcs may have some theoretical and practical serious issues

Tim: I'm surprised to hear that Darcs may have this issue; I thought it was append-only

DanC: no, what was revolutionary about Darcs was that patches commute

Ted: when groups do [lots of branching], it has tended to become a big headache
... but the lieutenant model who are responsible for merging upward ...
... would be nice to get all revision history along with [the lieutenant's merge]
... capture all the history in the main repository

DanC: yeah, but consider 'why hg' initially
... using cvs as the 'main truth' causes you to loose interesting properties of the whole system

Matt: complicated merges in cvs require lots of wizardry
... cvs admin-type person has to understand nitty gritty details of cvs
... much less the case with hg; once you know the basics of push, pull, merge there's not a lot more
... you avoid big cvs merge problem
... people tend to do their own merges without a lot of help

hg hosting, large projects, user support

<ted> branching and merging in cvs with our wgs has often been a headache when introduced, the central repo and sticking with a main branch has simplified life for most

<ted> merging from numerous branches seems like it could get very tedious

<ted> what might work with our environment is if a group wishes to go off with a decentralized repo with a handful of 'leutenants' to handle collecting and committing to main [cvs] tree that could work

<ted> it would be nice to get the revision history as part of that though

<gerald> (+1 to ted)

<ted> my biggest concern in changing our base revision system is migrating 300+ users to another platform, the user support for these users has cost us considerably

DanC: one of the motivations for this Project Review ...
... main www.w3.org cvs repository (cvs.w3.org) seems not a good match for hg
... a long time ago we split out dev.w3.org so world could read the history
... socially dev.w3.org works like a bunch of independent projects
... in this style of work, hosting 200 hg projects on a server, does this use more disk i/o, etc [than cvs]?

Matt: hg is designed to minimize i/o

<gerald> (I'm more worried about user support and migration costs than CPU/IO which is relatively cheap)

Matt: if you're using a fast cgi approach, like @@ framework
... you can plug it into any of the pythong cgi interfaces; zope sorts of things

<ted> we would have to hack quite a few tools for hg if it results in mirroring

Matt: if you're doing something that generates a lot of load you should use fast cgi
... so indices are kept in memory
... there's an import of freebsd repository in to hg and they haven't yet gotten to the point where they need fast cgi
... if you get to the point where a single system is too slow you simply clone it and do round-robin load balancing

Tim: and somehow sharing patches?

Matt: you'd probably keep a single push point
... commits have to be single-threaded
... hg backend uses lock-less pull
... multiple readers, one writer
... so no lock contention
... e.g. dev{1,2,3}.w3.org could all pull from a backend server, push.w3.org
... pushes would all go to push.w3.org

DanC: are there any statistics [for dev.w3.org]?

<Zakim> s-mon, you wanted to ask about Matt's perspective on how folks have dealt with migrations.

Simon: there exist tools for importing cvs repositories into hg. how well have these worked in practice?

Matt: small projects get along pretty well.

<s-mon> tools for cvs<>hg conversion - http://www.selenic.com/mercurial/wiki/index.cgi/RepositoryConversion?action=show&redirect=ConvertingRepositories#head-8f6fdc4a130232720c51de0b4417e213898f28ad

Matt: projects that do repository hacking are problematic

<DanC> (my experiense is that the migration tools are pretty immature. )

Matt: firefox developers have had more trouble than others
... freebsd folks seem to get along ok, though they've not yet committed to using hg
... the biggest trick with cvs is converting histories to changesets
... requires going through the history and identifying co-occurring changes as changsets

<Zakim> ht, you wanted to raise agendum 9

Matt: the tools to do this can get confused

Henry: seems to me that there are 3 parts of W3C usage we'd need to consider:
... (1) datespace, for which hg doesn't seem the right model; single-document-orientation more appropriate
... (2) /TR space, for which we now tend to have a subdirectory for a document containing a set of files

<sandro> ohhhhh. Have the pubs from a given WG be a Project.

<DanC> sandro, that raises all the fractal questions: what about WDs shared by the Query and XSL WGs.

Henry: having each TR document be a "project" may be appropriate
... but design goal of making branching and merging fast is borderline irrelevant for /TR
... though sometimes we do have both editor and webmaster simultaneously changing doc at the last minute
... (3) an Area; e.g. within XML, SVG, editors tend to have workspaces
... some editorial teams may find branching and merging relevant
... might have been helpful in cases where MSM and I worked together
... (4) dev.w3.org clearly has projects and branching/merging might well support that community better

<Zakim> plh, you wanted to ask for eclipse extensions

Philippe: I've been very successful getting people to use eclipse for dev.w3.org
... because they're pretty autonomous and eclipse integrates ssh support\
... is there any support for eclispe within hg?

<DanC> (googling yields http://www.vectrace.com/mercurialeclipse/ )

Matt: yes, there is an eclispe plugin but I've not used it myself
... there may even be competing eclispe plugins
... but I can't speak from personal experience how useful they are
... see wiki
... 'OtherTools'

<mpm> http://www.selenic.com/mercurial/wiki/index.cgi/OtherTools

<Zakim> ted, you wanted to follow ? of plh's re clients

Ted: biggest job in getting new W3C users has been cvs startup
... what experience can you relay on hg learning curve?

Matt: first big project adopter was xen, a linux hypervisor
... hg started in April 2005
... hypervisor started using hg in June/July 2005

<DanC> "Xen - a free hypervisor for virtualising kernels" among http://www.selenic.com/mercurial/wiki/index.cgi/ProjectsUsingMercurial

Matt: we got 3 or 4 patches from them then they went quiet because they were happy
... I assume they're still happy

<ted> +1 to ht, our user base's tech skill sets varies widely

<DanC> yes, good point, ht (that our user base is very different from most opensource software dev projects - they would all know about SCM systems before, whereas many of ours have never seen one before )

Matt: Sun sends some questions; they have long-lived software processes that were closely adapted to teamware system and were slight mismatch for hg
... for the most part, Sun's questions are not about usage but about obscure bugs
... so I think people adapt to the hg model quickly once they understand push and pull

Ted: hypervisor is a group of geeks
... W3C [document editors] have very varied skill sets
... editor-type people in the mix
... how well do these people adapt?

Matt: not much experience there
... comments are that hg is easier to understand than cvs
... so people adopting version control for the first time have little difficulty
... I've tried to make hg similer to cvs where that was sensible
... but I've been annoyed by the usability of cvs so I've fixed some of that
... hopefully hg is easier to use than cvs

Thomas: in W3C datespace we have a lot of concurrent editing of independent files in a tree that is only Team-editable
... and an occasional subdirectory in which we give Member write access
... these are confined and @@ projects
... it's often painful to manage access rights in these subdirectories, give access to changelogs, etc
... could we adapt hg or something like it into a part of W3C webspace?
... use this as a way to grant Member write access?

DanC: I've run that experiment with the GRDDL test repository
... it sort-of worked
... it's straighforward to export cvs history into hg a little bit at a time

<ht> Test suites are another interesting example -- project concept really doesn't make a lot of sense

DanC: but hg is more expressive than cvs so importing hg history back into cvs loses data
... there are things in the hg history that have no analog in cvs

Thomas: I'd accept loss of history data as a way to experiment with hg

<gerald> +1 to Thomas

Tim: save the revlogs in cvs?

Matt: it's theoretically possible to build a cvs gateway to hg; something that looks like a cvs server but is backed by hg

<tlr> I'm looking at CVS purely as a way to bridge an hg repository into web space

DanC: such a gateway would have to round off various corners

Matt: yes, the gateway would be responsible for that

Henry: would work fine from cvs point-of-view, only problem would be an hg user also making changes

Matt: a cvs client could still checkout and would get an approximation of the log

<ericP> i need to go. thank you very much, Matt and DanC for putting this together

Matt: have an acl extension for managing push permissions for parts of a tree
... even without direct commit access it's still useful to be able to make local commits and use this to publish patches, produce changesets, etc.
... send one of these changesets to someone with push permission

DanC: Thanks, Matt!

[adjourn]

<Zakim> ted, you wanted to ask about access control in distributed environment where people can pull from others

Ted: is there a way to specify 'upstream' what can be pulled from the next generation?
... e.g. when someone without write access still makes changes available to others for pull

Matt: no, nothing like this. something like gpg'd files but that would be inefficient

<timbl> RSS?

<ted> so one could circumvent inadvertently patent policy and member confidentiality

Ted: we have policies about Member confidentiality
... sometimes people don't realize what category a given document falls into
... so inadvertently share a private document

Matt: no provisions in hg for any sort of rights management

Distributed version control with Mercurial (Project Review)

22 Mar 2007

Attendees