In this step, we turn our data into insights. The work done here can vary. It can include statistical analysis, both modelled and descriptive, as well as more sophisticated Machine Learning (AI) – the choice being dependent on the output requirements of the customer. The output can be presented in the form of a report, a dashboard, or a slide deck.
Selection of methods
The analysis first requires an appropriate method that can provide the desired outputs.
- Create a short-list of methods that could be applicable to the problem.
- Consider the pros and cons of each candidate method:
- are there examples of this method being applied to the problem archetype?
- will the available data support the method? If not, is it feasible to transform the data into a more suitable form?
- will the tools available support the method?
- Decide which methods are best suited to the problem:
- often the best method will be clear
- sometimes you may need to try a small number of methods before deciding on the overall best one
- get the opinion of your critical friend, or other team members
Keep notes of your decision-making and method trials for the final write-up of the initiative.
Application of methods
Apply the selected model(s) with RIGOUR; it should be Repeatable, Independent, Grounded in reality, Objective, Uncertainty-managed, and Robust.
- Repeatable: For an analytical process to be considered ‘valid’ it might reasonably be expected that for the same inputs and constraints the analysis produces the same outputs. It is important to note that different analysts will consider the analytical problem differently, potentially resulting in differing results, however if any one approach is repeated the results should be as expected.
- Independent: To produce analysis that is free of prejudice or bias. In doing so, care should be taken to appropriately balance the views across all stakeholders and experts.
- Grounded in reality: Quality analysis takes the commissioner and analyst on a journey as views and perceptions are challenged and connections are made between the analysis and its real consequences. Connecting with reality in this way guards against failing to properly grasp the context of the problem which is being analysed.
- Objective: Effective engagement and suitable challenge reduces potential bias and enables the commissioner and the analyst to be clear about the interpretation of the analytical results.
- Uncertainty-managed: Uncertainties have been identified, managed, and communicated throughout the analytical process.
- Robust: Provide the analytical result in the context of residual uncertainty and limitations to ensure it is used appropriately.
See the internal Data Science Wiki for guides and links to resources on specific methods.
Quality assurance (QA)
Check your analysis thoroughly and often, not as a last step before delivery.
- Documentation is important because:
- it allows us to transfer knowledge about the analysis
- it keeps the analysis transparent
- it provides instructions on how to use any interactive output
- Good analysis structure helps:
- in avoiding errors
- make it easier to check
- maintain ongoing workflows
- to re-use or re-purpose later
- Verify the outputs by:
- testing your analysis, preferably in a repeatable and automated way
- asking for peer review of all aspects, including code, text and decisions made
- Validate your analysis by:
- checking outputs are reasonable for a wide range of inputs
- comparing to existing data
- asking a subject-matter expert to review (the earlier the better)
- Be open about uncertainty caused by data and assumptions and communicate this clearly to stakeholders.
Useful QA resources
Tools used in analysis
Starting from the source tables, we want to create the agreed outputs. There are a range of tools we have at our disposal. Which tools are best will depend on the problem being tackled.
- R is especially suited to use for statistical analysis and visualisation and supports reproducible (Rmarkdown) and interactive (R Shiny) reports.
- Python has most widespread support for training Artificial Intelligence Models and is more widely integrated with cloud-based platforms.
- Power BI is a possible alternative to R Shiny for creating dashboards, but due to licensing may not be suitable for certain end users.
- If it is appropriate, Excel and/or SQL can be used for generating simple summary statistics.
Please see the internal Data Science Wiki for tips on how to use all of these tools.
Tips for effective programming
Data will be a factor in all initiatives. Most initiatives will also have some element of coding involved, even when using low-or-no-code tools like Power BI.
- Use Github to handle version control, issues and code reviews.
- In Python, use object-oriented programming (OOP) if you can. Please see the internal Data Science Wiki for information.
- Although R does have support for OOP, it is not a big paradigm like in Python.
- Be careful of premature optimisation (called out by computer scientist Donald Knuth as ‘the root of all evil’ in software development).
- Start small, such as limiting data to a manageable subset when developing the analysis; this makes it easier to spot issues and allows for faster iteration of code.
- Check the data looks correct at each step of reshaping and transforming it; better still, know what it should be before writing and applying the step (the principle behind test driven development, TDD).
- Work on one component of output at a time, for example using
shiny
with the modulargolem
framework, as in our shiny app template. - Frequently run code as you write it, or check the output for low-or-no-code tools, so any issues are encountered early and their source is clearer.
- You may produce multiple smaller data sets to be used in your outputs; the considerations applying to base tables apply equally to these.
- Where appropriate, the final code should use RAP principles.
Check the quality
As with data preparation, it is important to test and validate your results.
- Where possible, check results against other sources.
- Other sources could be data from previous initiatives, Management Information dashboards or ePACT2.
Additionally, you must be careful about visualising sensitive results. Please contact your critical friend or the Data Science Lead if you are unsure. Results for external audience or general public must be subjected to Statistical Disclosure Control before release.
Before pushing any code to GitHub, it’s essential to ensure that the output is cleared, and that no sensitive information remains within it.
RAP considerations
- It is worth spending time on making the data preparation easy to repeat.
- Make sure the code is easy to maintain, read and re-run when required.
- Rather than just doing adhoc testing, write formal tests that can be repeated.
- Working in a RAP-ful way is especially important when the output of an initiative is to be refreshed periodically, such as monthly or annually.
- Comment your code where its purpose is not obvious.
- Maintain a clear project structure with clean scripts.
- Separate data, scripts, and output into their own sub-directories.
- Move custom functions into their own script.
- Document your custom functions.