Analytical Code Assurance

Suggested reading

AF Duck book: Version control

NHSE RAP Community of Practice: Introduction to Git

At a glance

Version control (Git) is an essential tool that helps teams to safely collaborate and make their code resilient by tracking changes, enabling branching, and ensuring code is transparent, auditable and reproducible.

This guidance aims to give users an understanding of version control, the benefits of its use, and how to implement it in their work.

What is version control?

Version control is a way to track line-by-line changes to source code over time. Version control keeps an audit trail of changes and allows any point in history to be restored.

Version control supports collaboration on analytical projects by managing different versions of the same files and helps us to solve the problem of managing file versions manually through inefficient naming conventions.

As well as tracking the history of source code, version control helps us track changes in accompanying documentation and dependencies, streamlining environment and package management. This helps us understand what versions of the code produced what outputs, the environment those outputs were produced in, and state of any dependencies.

Version control should be seen as a tool that underpins and promotes high quality analytical code, facilitating other best practices detailed in this playbook such as:

What is Git?

Git is a distributed version control system that is the tool of choice at the NHSBSA. Git is built on simple and elegant principles that support independent work in an offline, local environment, and “pushing” changes back to an online hosted remote. Git hosting providers such as GitLab, GitHub and Azure enhance the collaborative experience with features such as code review and issue trackers.

There are 2 main ways to interact with Git:

through the command line interface (Git Bash) - the Git CLI is more difficult to use for beginners but much more flexible and powerful
through IDEs with in-built Git functionality - tools like RStudio and Visual Studio Code help users get started with the basics of creating repositories, linking them to a remote, committing, pushing, and pulling through streamlined GUIs.

Our suggested learning resources provide additional support for beginner and expert Git users.

Why should we use version control?

Version control is a key tool for assuring analytical code. It enables teams to manage change safely, track code lineage, collaborate, and reproduce results with confidence.

Version control makes code:

transparent and auditable by keeping a complete, attributed history of every change for full traceability
reproducible, making sure any past version of the code can be restored
safe to collaborate and work in parallel on through branching, preventing clashes and enforcing peer review
easy to revert to a stable version after problematic changes
resilient by storing work safely in a remote repository, protecting against local failures.
easier to share knowledge about, providing clear history and structure to help new team members understand the codebase

How do we use version control?

Teams should adopt tools and processes that fit their circumstances. However, when you adopt version control we recommend following a set approach:

commit often – small, frequent commits with meaningful messages are easier to review and understand for both other team members and your future self
push to the remote regularly – committing to your local repository doesn’t mean that your changes are reflected for everyone else. Remember to push your commits to the remote
use branching – use a branching strategy to organise work and separate changes by collaborators. Your main branch should always be assured, reviewed code that can be trusted
peer review – use pull/merge request functionality from Git hosting providers to review code before merging changes
manage sensitive data – follow best practice when managing credentials, secrets, API keys, and personal data. It is easy to accidentally check these things in to version control and difficult to remove them again

Rule of thumb

Commit after each small task is completed, such as a function, writing a common table expression, or developing a visualisation. Push to the remote at least once a day.

Git has a learning curve and it is easy to make mistakes when getting started but don’t be afraid to break things. Git is designed to make fixing things easy and mistakes are usually reversible.

Setting Git up

Git uses a configuration file to set your name and email address that are attached to commits. It is essential that you set your Git config to allow changes and commits to be attributable to you as an author.

Guidance on setting your config and how user accounts for git providers should be created are available on the NHSBSA Confluence.

Commits

A commit records the changes made to the file(s) that you’re working on. The message attached to a commit should clearly communicate what changed and why. Most commit messages are short and informative and should be written in the imperative, that is, how the commit changes the codebase in the present tense, rather than what you did in the past tense. For example, ‘Add feature’, not ‘Added feature’ or ‘Adds feature’, and ‘Fix bug’ not ‘Fixed bug’ or ‘Fixes bug’:

Add table creation function

Fix bug in identified_patient flag

In some cases you may want to provide more detail, the first line of a commit message can be treated like the subject of an email with the rest of the text as the body.

You should commit frequently when you have completed a small, discrete amount of work. For example, you have developed a chart, created a table, or written an output.

Small, frequent commits make understanding the history of a file much easier and reduces the burden for a peer reviewer. Small commits also make it easier to roll back changes after the introduction of a bug or issue – large, monolithic commits make bug hunting and fixing harder.

Pushing

Changes that you make to your local version of the files (repository) remain local until you explicitly “push” them to the remote. A remote is typically a repository hosted by one of the big Git providers – GitHub, GitLab, or Azure DevOps – that has been synced to your local copy. If you “clone” a repository from a service this syncing is done automatically. If you have created a local repository first you will have to set the remote manually.

Pushing your changes to the remote makes them available to everyone to “pull” into their local copies. Pushing should be done regularly, typically at least once a day, or more frequently if you are making rapid, large scale changes that others need to progress their work.

We recommend against pushing changes directly to the main branch and working in individual feature branches before merging them through a pull/merge request.

Merge conflicts

If you push changes to the remote and someone else also pushes changes to the same code or file in the same branch you risk overwriting each others work, creating complex merge conflicts, and removing auditability of changes.

Branching

Git allows you to “branch” out from an established version of a repository to continue making changes without making them directly in that version. When you create a Git repository by default it will have one branch – main (some historical repos might have the default branch of master). You can then create a new branch and make changes without jeopardising the contents of main before merging at a later point.

Branches can help us explore and experiment with analysis, techniques, and other features before deciding how we integrate them into the codebase in a safe manner. All while allowing others to work from a stable version.

You can create as many branches as you want in Git, including branches of branches of branches of branches…

If you use branches in this way it can quickly become unwieldy, difficult to navigate, and impossible to maintain. We recommend adopting a branching strategy and documenting this for your team or project so everyone knows how to work in the repo.

Branching strategies

There are many branching strategies that can be employed, all very well documented and with their own strengths and weaknesses. The strategy you adopt should be based on the needs of you and your collaborators, and of the project you’re working on. Some popular strategies are:

Git-flow
release branching
trunk-based development

We recommend adopting Git-flow, or a variation thereof, for larger more complex projects or projects with larger teams (more than 3 collaborators).

We recommend adopting trunk-based development for smaller projects or projects with very small teams (1 to 3 collaborators).

Shared branches

We recommend using one branch per collaborator. Author specific branches with peer-reviewed merges help to prevent overwrites and merge conflicts.

Pull requests and peer review

Git hosting services provide tools to conduct code reviews through pull requests (GitHub/DevOps) or merge requests (GitLab). All changes should be peer reviewed before being merged into the main branch (or other protected branches depending on your branching strategy).

Pull requests allow a team to have a shared conversation during the review of code, including making in-line comments, standards to achieve before approval, and transparent discussion of any issues identified.

Pull requests are not limited to just merges into main but are the only pull requests that we deem as mandatory.

Code reviews should be conducted in line with our peer review guidance.

Managing sensitive data

During the course of a project it is likely that you will work with sensitive data such as:

credentials
secrets
API keys
personal identifiable data

If any sensitive data is checked into version control it requires more than just deleting the file and committing the change to remove the data – it would still be available through the commit history of the repo.

To fully remove the data requires the commit history to be rewritten which is a difficult process – Git is designed to prevent this as much as possible to maintain an audit of all changes.

As always, prevention is the best form of treatment.

Environment variables

Any secrets or credentials that you require should always be stored in an environment variable file which is not tracked by Git and never directly in code or a file that is checked into version control.

There are many ways to set, use, and reference environment variables depending on the tool, language, or type of project you are working on. However, any environment files should added to a .gitignore file to prevent them being tracked.

.gitignore

.gitignore files are used to prevent certain files or file types being checked into version control. For example, data should never be checked into version control. Your .gitignore might look like:

.env
*.csv
*.xlsx
*.rds
*.html

Advanced practice

After you have been using Git for a while and gotten more comfortable using the CLI you might be looking to expand your practice and how you use it in your projects.

Semantic versioning

Semantic versioning is a standard that is used across software engineering to denote meaningful version numbers. You will have likely seen them already when downloading software such as R (R-4.5.3 is the latest version as of writing).

The format most commonly followed is:

MAJOR.MINOR.PATCH

This concept can be adapted and adopted for analytical code. When you start your project, start your version at 0.1.0, and increment each number when required. Once initial development of your project is complete, you should increment your major version to 1.0.0.

Major

Increment your major version when you have changes that affect interpretability or reproducibility, such as:

methodological changes – new statistical models or imputation approaches
changes in definitions, classifications, or analytical scope of measures
breaking changes to file formats or schemas
a new data set version that cannot be compared to previous outputs
decommissioning and removal of key indicators

Minor

Increment your minor version when making backwards compatible changes, such as:

adding new analyses, charts, or indicators
extending data sets with new variables without breaking old ones
adding configurable parameters that default to existing behaviour
improving documentation
adding optional features

These are safe changes that do not invalidate or break previous outputs.

Patch

Increment your patch version when there are no output or interpretation changes, such as:

fixing a typo in the code or other minor typographical changes
improving performance
refactoring code without altering logic
updating metadata
fixing a small bug that does not change analytical results

Versioning data

Data changes do not determine that a version needs to be incremented. The code version should only be incremented if the new data requires new logic to be developed. For example, if variable definitions change within the data you would increment the MAJOR version.

Data should be versioned separately and is not covered in this guidance.

Versioning outputs

Semantic versioning can also be applied to analytical outputs or outputs of analytical code. A simple approach to this is to combine the code and data versions to make a traceable ID. For example:

code version: 1.3.0
data version: 2025-26
output version: 1.3.0-202526

This can be applied to reports, data tables, and other outputs.

Tagging

Git allows the use of tags to reference specific points in a repo’s history. Tags are more user friendly ways of identifying specific commits that we want to be able to easily find in the future – commit hashes have no inherent meaning. For example, we might want to tag in a repo the point in history that was used to produce a particular release, or when the version was incremented due to a new model or methodology changing.

git tag -a v1.3.0 -m "New patient age analyses and charts"
git push origin v1.3.0

Once a tag has been created, these locations in a project’s history can be easily found and recovered.

Version Controlled

Suggested reading

At a glance

What is version control?

What is Git?

Why should we use version control?

How do we use version control?

Setting Git up

Commits

Pushing

Branching

Branching strategies

Shared branches

Pull requests and peer review

Managing sensitive data

Environment variables

.gitignore

Advanced practice

Semantic versioning

Major

Minor

Patch

Versioning data

Versioning outputs

Tagging

Learn more