One of the most powerful tools we have as software developers is not a coding pattern, method, framework, or even really code at all. Like a bank keeps its most valuable assets in a safe, so do we as developers seek to protect our most valuable assets, the code we create.
Source control (referred to variously as source control management, version control, revision control, and probably a half dozen other terms as well) describes a system we use to store our code, manage changes to that code, and share our code with others. Our choice of a source control system is one of the single most important decisions we can make, and will radically affect how productive we are able to be.
In this article we will examine the rationale behind source control, and get a rundown of the different types of source control systems available, including examples of each still in widespread use today. After that we will discuss how to structure a solution to get the most out of our source control system, with an emphasis on .NET solutions. Lastly we will learn how to integrate a source control system with the software development lifecycle.
The Need for Source Control
Jeff Atwood writes that source control is “fundamental to the practice of modern software development.” That’s why it’s the very first item on The Joel Test, a quick 12-item checklist that can help you judge the quality of a software team.
Without source control, it is nearly impossible for a team to work on a software project together in the same room, let alone in separate locations. Imagine only one developer at a time being able to modify a project, and when they were done, they would have to zip up the project and send the new version to all the other teammates. Maybe two developers could work on the project at once if each developer only worked on one file at a time, and was constantly communicating with the other developers so nobody ever stepped on anybody else’s toes. Now imagine if one of the developers was in another building, or another city, or another continent! This obviously does not scale and does not work.
However, even a single developer working on a project all alone can benefit from a source control system. A source control system gives you the ability to experiment freely, secure in the knowledge that if the experiment is a failure, you can revert to a known good configuration. However if you achieve some measure of success, you can commit that code and have a history of it forever, knowing you will not use it.
Without source control, you will invariably find yourself in a situation where your current code is a mess, but you can’t figure out how to get back to that state where you had something that actually worked. Don’t let this happen to you! Live by the motto “Check in early, check in often.”
But before we can do that, we need to pick a source control system to use.
Source Control Types
The history of version control systems spans decades, thus there are dozens of options available. In this article we will examine the overarching styles of source control and focus mostly on those systems that are still in widespread use today.
Astoundingly, some developers still do not use any source control at all. Either they are working on code all by themselves and don’t see the need for source control, or if they must share code with others they store projects on a file server.
Down this path lies madness. Don’t do it!
First Generation – File Locking
The first generation of real source control systems consisted of file locking structures. Examples include Revision Control System (RCS) and Source Code Control System (SCCS), but neither is in widespread use today. The file locks allow only one user to operate on a file at any given time, and only tracks one file at a time.
Since file locking source control systems are really not widely used any longer, we won’t dwell on them. However it is interesting to note that there are some wiki engines available that use file locking source control systems to store page revisions, an application to which they are suited quite well.
Second Generation – Centralized
The second generation of source control systems are those that are characterized by a centralized structure. A single central server stores the master copy of an entire source tree, and each user is able to get a local working copy on their development machine. When the developer has finished some code, they will check it in or commit it to the central server, where it will then be available to other developers.
A centralized server offers many advantages, especially over the first generation file locking systems:
- Developers are able to get any revision of the code, either the most recent version or any previous version.
- Most centralized systems offer the ability for multiple developers to edit a file in their local working copy at the same time. Other systems require you to gain an exclusive lock to a file before editing it. Some system offer both modes simultaneously.
- The centralized server offers a single source of truth for a codebase, so you always know where the most up-to-date code is located.
- In most centralized systems, specific versions can be tagged so that it is easy to find the status of the code at a specific point in history, such as a released version of software.
However, centralized systems offer their share of disadvantages as well:
- The single centralized system also becomes a single point of failure. The source control system must be backed up religiously or a failed hard drive or corrupted repository could be catastrophic.
- It is often difficult to work together if the system requires exclusive locks to be taken on files.
- When working on the same file simultaneously, committing changes can require a fair amount of manual merging, which can be difficult to do properly.
- Operations that communicate with the central server (commits, compares, etc.) must travel over the network and can therefore be slow.
- It can be difficult to work when disconnected from the central server.
- Because a commit is immediately visible to all, developers may be hesitant to “check in early and often” for fear of breaking the build. The lack of freedom to commit at any incremental success can lead to the same problems as not using source control at all, although admittedly on a much smaller scale.
- Some centralized system support the concept of branching.
Now let’s look at a few.
Concurrent Versions System (CVS) is arguably the first centralized source control system. While it is essentially obsolete (the last stable release was in 2008) its use is still surprisingly widespread. It is not a viable candidate for new projects, and was largely supplanted by Subversion.
Apache Subversion (or SVN) was developed to be CVS done right. It is owned by the Apache Software Foundation and is released under an open-source license, and is still actively developed, and very widely used. If it is a centralized source control system that you want, and you want it to be free, then Subversion is a very solid performer.
Subversion fixes a lot of the problems that were inherent in CVS. For example, interrupted commits in CVS could cause a corrupted repository, but in Subversion, commits are fully atomic.
While the tools for using Subversion on most platforms are fairly solid, due to its centralized nature it can still be very hard to perform some operations, especially merges. If you create a separate branch of development for a large feature, and then development continues both in the feature branch and in the trunk for some time, merging the feature branch back into the trunk can prove very difficult.
Visual SourceSafe (VSS) is Microsoft’s first version control system, although it was originally developed by another company that was acquired by Microsoft. It has been largely supplanted by Microsoft Team Foundation Server (TFS) however it is still widely used, even though just about everyone is in complete agreement that it is truly awful. One must assume that the organizations still using it are doing so because it was a supported Microsoft product and/or it came for free with an MSDN license and they never bothered to upgrade.
Visual SourceSafe is not a viable candidate for any project. Please do not use it anywhere, ever.
Team Foundation Server (TFS) is Microsoft’s current offering for not just source control management (SCM), but for full application lifecycle management (ALM). This means that in addition to just storing your source code, TFS offers work item tracking, project management tools, reporting, and a host of other goodies to help you manage every aspect of the software development process.
The version control aspect of TFS, Team Foundation Version Control (TFVC) is descended from SourceSafe, but most of the truly dangerous problems from SourceSafe have been fixed. That being said, TFVC is still a centralized system and suffers from all of the associated problems, especially in the area of branching and merging. TFS will allow you to create different branches, but these must be separately mapped to a workspace on your local machine, so it is difficult to easily switch between branches. Merges are still a pain, with TFS sometimes insisting that you merge a file that did not change, and it is not uncommon for a developer to lose part of a day figuring out what happened to the code they wrote because another developer performed a bad merge.
Where TFS truly shines is in its associated application lifecycle management tools for teams employing Agile methodologies, like work item tracking, project management, testing, etc. However the sad truth is that in many cases, these features go unused! Large enterprises install and require use of TFS because it is the Microsoft-blessed solution, but then teams only use the version control and continue to do all their project management in SharePoint and Excel!
TFS is available both as an on-premise hosted solution (Team Foundation Server) and also as a cloud offering (Team Foundation Service). It is very confusing that they both share the same acronym. TFS(ervice) is available for free for teams of up to 5 users.
In 2013, Microsoft announced that TFS will be supporting Git natively, both in the Visual Studio tooling, as well as in TFS the Server and TFS the Service. The Visual Studio tooling was released as an add-on to Visual Studio 2012 and will be incorporated out of the box with Visual Studio 2013. Git hosting is currently available in TFS the Service, and Git will be supported in TFS the Server in Team Foundation Server 2013.
Third Generation – Distributed
The third generation of source control are Distributed Version Control Systems (DVCS). This means that there does not have to be a centralized server, although sometimes there still is one. In a DVCS, every developer has the complete history of the code repository, plus a local working copy. This means that when you commit, you don’t commit to a server, you commit locally. Put another way, that means instead of comparing your local changes to files on a server, the DVCS is comparing changes in your local working copy to the history in your local repository. This means that commits and merges are super fast!
DVCS systems also finally get merges right, by changing how they think about changes. Most centralized VCS systems will store the entire contents of a file for every change (although this is a simplification) whereas a DVCS will only store the deltas between each version – these are commonly called changesets.
You can think of a DVCS repository history as a stack of transparencies, where each transparent sheet contains only the change made in that revision, and by looking down through the entire stack, you can reconstruct the whole picture.
This makes it easy to treat a merge as a first class citizen, rather than an afterthought. It’s kind of like taking two stacks of transparencies and shuffling them together! A centralized VCS will see that two developers changed the same file and throw a fit. A DVCS will be able to merge both changes together automatically as long as they were in separate parts of the file. Also, instead of two distinct versions without much context, a DVCS merge always involves a base version, and then the two branch versions, so it’s easier to see what was changed and why.
In a centralized system, developers will commonly not want to check in until their code is perfect for fear of breaking the code for everyone else. In a distributed system, you can commit locally whenever you have anything worth saving. You can even commit an experiment, back up in the history, and try something else, keeping the experiment alive to refer to. Then once a feature is done and you are comfortable inflicting your new code on everyone else, you push all the changes to a shared repository.
There are two main variants of DVCS available today, Git and Mercurial. They both follow the pattern of committing locally, merging changes, then pushing the result. How they differ lies in how they organize the workflow around this process.
Git is a DVCS that was initially developed by Linus Torvalds to assist in the development of the Linux kernel. Because of this heritage, its history is very shell-based, and historically the tooling for Windows has not been great. Of course a great cause for the runaway success of Git has been GitHub, the social coding site that hosts many prominent open source projects.
Git’s design revolves around branches. Branch names are pointers to specific changesets. A changeset that can’t be referenced in the history of a branch is removed from the tree. References for remote branches also exist. So when you push code from one repository to another, the changesets and the the pointers to the heads of the branches are exchanged.
This enables a distributed workflow with GitHub (although it is possible outside of GitHub) where you fork a repository, create a new feature branch, and make some changes. You publish those changes to your own public repository (your GitHub fork) and then submit a pull request to the original repository asking that your change be included. This is the backbone of a lot of modern open-source software projects. It also allows another option for customer support. Instead of “We don’t have time to implement that right now”, a library maintainer can instead say “Can you send me a pull request for that?”
Git’s tooling on Windows has historically been poor, but this is getting better. You can use Chocolatey to install TortoiseGit, which will also install all the necessary dependencies for you.You can also install GitHub for Windows (which can be used for more than GitHub repositories), although this tool is a bit of a leaky abstraction and doesn’t make the full power of Git available to you.
Mercurial (also referred to as Hg, the chemical symbol for mercury) is another DVCS but with a slightly different approach.
The Mercurial branching strategy is to clone the entire repository, and then start working in the new location. Changesets can then be pushed and pulled between repositories.
Mercurial also has named branches that may seem similar to Git branches, but are not. Mercurial named branches live forever, so it doesn’t work to use them for experiments.
Mercurial has a command line syntax similar to git’s, but most people who use Mercurial on Windows will use the TortoiseHg GUI tools, which most beginners will find more user-friendly than the TortoiseGit tools.
Git vs. Mercurial: Which is Better?
Actually, this is the wrong question to ask. Any DVCS is superior to its centralized forebears. Participating in any flamewar about whether Git is better than Hg or Hg is better than Git is a waste of time. The important thing is to use one of them.
Both tools allow pretty seamless branching and merging and work very well. Mercurial’s tools can be a bit easier to learn for a beginner, but Git’s tools are ultimately more powerful for the power-user. In short, Mercurial is more likely to protect you from yourself but at the cost of advanced functionality, where Git gives you a lot of rope but will also let you hang yourself with it.
If you want to work with a project in GitHub, then obviously your choice is made for you. But just because GitHub is ubiquitous doesn’t mean that it is the only project hosting option out there. Bitbucket, CodePlex, and Google Code are all prominent source code hosting platforms that offer Mercurial. In fact, all three of those providers host both Git and Mercurial, but you have to make the decision when you create a project and stick with it.
If you don’t want to have to choose, you might not need to. Fog Creek Software offers a product called Kiln Harmony that enables simultaneous Git/Mercurial hosting. Steve can clone a repository using Git, and Barry can clone the same repository using Mercurial, and anything they commit makes a full round trip. Kiln also comes with FogBugz, an integrated bug tracking and issue management tool that integrates tightly with Kiln’s source control to provide a complete set of ALM tools that offer an alternative to TFS.
Both Git and Mercurial are pluggable when it comes to the external tools they use for diff and merge viewing. These tools are worth checking out to see what works best for you:
- WinMerge – a free graphical diff viewer that can be installed via Chocolatey. It easily beats the native diff viewer shipped with almost every SCM package out there. For some SCM packages, if you install WinMerge first, the SCM will automatically configure itself to use WinMerge when you install the SCM.
- KDiff3 – The standard-issue 3-way merge tool. When doing a manual merge, you need to be able to compare the base version along with the 2 versions being merged together. You can’t do this in a standard left/right viewer. Visual Studio can be configured to use KDiff3 instead of the built-in TFS merge tool when using TFS.
In addition, if you are in an enterprise that uses TFS (with TFS Version Control, not Git) and there is no hope of escape, there is a way you can use Git locally but still integrate with TFS. git-tfs is a 2-way bridge between Git and TFS, which you can install via Chocolatey. With git-tfs, you “clone” a TFS repository into a local Git repository, commit changes locally, and instead of pushing, you run the git-tfs “checkintool” to send those changes back to the centralized TFS repository. This gives you many of the advantages of Git, including magically automatic merges, in an environment where TFS is enforced.
If it isn’t obvious by now, this article wholeheartedly recommends using a DVCS for your source control needs. It’s important, however, to structure your projects with a DVCS in mind.
A DVCS repository contains the entire history of a project, all the way back to the first commit. This is fine for code files; both Git and Mercurial changesets store only the deltas (plus a few surrounding lines for reference) so things don’t bloat too quickly.
However, this changes fast when you talk about binary files. Without a way to record only the delta, the entire binary file’s contents must be stored as part of the changeset. If one 5 MB file gets changed 10 times, then that’s 50 MB that must be dragged along every time you clone that repository.
Here are things you should keep in mind when you structure your solution:
- The results of a build should not be committed to source control. Both Git and Mercurial have an ignore file (.gitignore or .hgignore, located within the root of the repository) where you can list the “bin” and “obj” directories created by Visual Studio.
- Use NuGet Package Restore so that NuGet packages themselves (which are nothing but fancy zip files) are not committed to source control, but will be restored automatically on a build.
- For websites, do not include large media like images, videos, or mp3s directly within your project. Store these static resources on a separate cookie-free domain. Aside from the benefits for your source control repository, this will also help get better website performance.
Software Development Lifecycle
Good version control is critical to project success, but it can be even more powerful if it is well integrated as a part of the software development lifecycle.
No mater the choice of version control, it’s important to use continuous build and integration to help quickly identify problems with the code. Ideally, each code push will result in a full build, automatic execution of a suite of unit tests, and a full non-production code deployment.
TeamCity is an excellent build server package made by the same company as ReSharper, and is free for very small teams. A completely free option is Jenkins CI, which is cross-platform (due to running on Java) and easily configurable from a web interface.
If you are using distributed version control, then every time someone pushes code to a shared repository, hooks can be set up in either TeamCity, Jenkins, or whatever CI solution you use, so that a build is triggered to run. This is far superior to a “nightly” build because developers will get nearly immediate feedback that something is wrong, and can quickly respond to it before other developers pull down the same bad code.
If your project is fairly small, one build task can run the build, perform unit tests, and deploy the project to a test environment. Some projects grow large enough, however, that it may be useful to break this up. For example, if the deployment part of the process is brittle for some reason, you may want to divide that into a separate build server task so that you can rerun the deployment without requiring a repeat of the build and test steps. Most CI server packages will allow the creation of chained tasks, where completion of one task can trigger a second, or the second can be triggered manually.
We don’t write software for no good reason. We have requested features to implement or bugs to fix. These should be stored in an issue tracker of some type, after all, “Do you have a bug database?” is the fourth question on The Joel Test.
Integration between issues and code is a key aspect to maintaining a codebase over the long term. Ideally, checkins should be able to relate to issues, and vice versa. This way, when inspecting the history of code changes you can refer to the issues that the code was trying to deal with, which gives you the essential Why to go with the code’s What. This also provides the source of information to create release notes, by summing up the items that were implemented or fixed in a given set of changesets.
This is where hosted source control systems really begin to shine and differentiate themselves, whether it is the Issues section of a GitHub repository, the FogBugz/Kiln bug-tracking/code-hosting combination, or the ALM features of Team Foundation Server, or probably several other tools. A good tool will allow you to easily link issues via commit messages. For instance, when working with a GitHub repository, you can close Issue #45 by adding “Closes #45” to your commit message.
Whether working on a team of one or a team of fifty, source control is a must for every software developer and one that can have a great effect on the overall success of that team.
In this article we have described the need for source control and outlined a general history of source control tools, including some detailed information on many that are in widespread use today. We wholeheartedly recommend using a 3rd-generation distributed version control system, but the choice between Git and Mercurial is left up to you. We also learned how to structure a solution for efficient storage in source control, and about the importance of integrating your source control with the entire software development lifecycle.
This article is not a definitive guide on how to use Git or Mercurial, only a collection of pointers that will hopefully put you on the right path. If you are new to distributed version control, we strongly encourage you check out Hg Init: a Mercurial tutorial, which was written by Joel Spolsky, CEO of Fog Creek Software and Stack Exchange. While it is on its face a tutorial for Mercurial specifically (and somewhat of an ad for Fog Creek’s Kiln product as well), the concepts of distributed source control are universal applicable to Mercurial and Git in concept, if not in syntax.