Analytical Code Assurance

Pre-requisite reading

AF Duck book: Modular code

NHSBSA DDaT playbook: Development - Coding

What is modular code?

Complex analysis projects and workflows can often lead to a lot of code being written with some or a lot of that code performing repetitive tasks.
Breaking down your code and workflow into smaller, independent, reusable chunks (modules) can help you maintain your project and share components across a series of analyses.
Each module ideally handles a specific task (aka separation of concerns) and interacts with others through well-defined interfaces (you should be able to use a module from the outside without needing to understand the internal complexity on the inside)
Languages such as R, Python and SQL, and workflow tools like Alteryx provide ways to break up code into smaller logical units, including modules, classes, functions and macros.

Why should I do this?

Writing your code in a modular way provides a range of benefits:

promotes readability by preventing long, complex walls of code that are hard to understand
multiple team members working on same codebase will have less code conflicts
simplifies testing of the code (and its outputs)
smaller modules are easier to maintain or improve in future
speeds up the peer review process
makes sharing and reusing code easier

How do we do it then?

Writing modular code or developing modular workflows can differ depending on the your tool of choice, the needs of the project or piece of work, the intended customer, or the expected users of the code in the future. However, a simple hierarchy of actions to take are:

split complex code and workflows in multiple (ordered) scripts or smaller workflows
use a code notebook to organise your analysis
write re-usable code as functions
organise related functions into modules
organise and document modules as packages
starting from ‘monolithic’ code and refactoring over time is a valid strategy!

Let’s dive into how we can do some of these.

Split your code into multiple scripts

Monolithic code should be avoided as much as possible. A large and complex piece of code that runs into hundreds, if not thousands, of lines is unwieldy, difficult to read, debug, and maintain, and your peer reviewer won’t thank you either! Splitting your monolithic code into ordered scripts can help yourself and others understand the steps you’ve taken and the reasons why.

For example, in R a large project could be split into:

project
-- 01_data_build.R
-- 02_data_clean.R
-- 03_analysis.R
-- 04_visualisation.R

these scripts can now be run in their intended order and splits out the parts of your code into related areas, making peer review easier.

Using scripts doesn’t inherently make your code reproducible or reusable, but are a good first step towards better and higher quality code.

split your code up
group related code together
organise your scripts so you and others know how to run it
make your scripts self contained and runnable from the command line

Don’t

write monolithic code
split up your code too much - 100 scripts with 10 lines of code in are just as hard to digest
write overly complex code where simpler solutions might exist

Using notebooks

Tools such as Jupyter notebooks or R Markdown allow you write code, commentary and visualisations alongside each other. They can be incredibly useful starting points for exploring data, developing an analysis and are great for communicating results.

However, as your analysis matures it can be difficult to align notebooks with other principles for producing high quality code:

notebooks are inherently difficult to review and audit using version control
code cells can be ran out of order, reducing reproducibility
it’s hard to test functions written in the notebook

How do we define success?

Modular code should feel well-organised and easy to navigate.
Repetition will be minimised through the use of small, reusable functions.
The purpose of each part of the code will be clear to the reader, with similar pieces of logic grouped together.
A reviewer should help spot any potentially issues and suggest improvements.

Lookout for: * Finding yourself scrolling-and-scrolling to look for a particular part of the code could indicate you should break it up into smaller modules! * Alternatively, needing to look through many, many files and maintain multiple tabs to understand and make changes might suggest reducing the modularity!