Data preparation
Data preparation can be done with any data manipulation tool such as SQL, R or Python.
- Large datasets with millions of rows may require use of SQL initially.
- SQL can be used via libraries, such as
dplyr
and the database back-enddbplyr
in R. - Once data has been aggregated, switch to your tool of choice for analysis and modelling.
When preparing data there are some considerations that always apply.
Data quality
- Check the Data Quality profiles, if they exist for the data.
- How are values distributed? In particular are there any unbalanced data fields?
- Check summary statistics such as mean, median, mode, minima and maxima.
- How complete is the data? Any null or otherwise unknown values?
- Are any fields strongly correlated?
- Discuss any data quality issues with the critical friend and/or customer.
There is a page on data quality in the internal Data Science Wiki.
Data cleansing
- Handling missing values.
- Using appropriate data types.
- Following the data minimisation principle.
There is a page on data cleansing in the internal Data Science Wiki.
Note keeping
- Save the results of data quality checking.
- Save examples of queries that highlight any issues.
data_summary.Rmd
andEDA figures.xlsx
can be used, found in template initiative Data folder.
Validation
- Where possible, check results using your base tables and other sources match.
- Internal sources include data from previous initiatives, Management Information dashboards or ePACT2.
- External sources include Official Statistics and third-party datasets.
Creating base tables
After checking and preparing the data, you should have enough knowledge to create one or more base tables.
- Decide to keep or discard fields based on what is relevant to the initiative.
- Fix or remove columns that contain data errors or missing entries.
- Create new columns based on the existing ones.
- Change the data types of the columns to a more appropriate format.
- Save the output of your transformations.
- Use any format that makes sense considering the size and how it is used; formats include CSV, SQL, and Rda.
Once the tables have been created they should be sense checked and code reviewed as outlined in the Review section of the playbook.
Sensitive data
- Wherever base tables are saved, it must be in a secure location as per data security policy. This will usually mean in the Data Warehouse, and SharePoint otherwise.
- Sensitive data should be removed or obscured unless it is absolutely necessary to retain it; bear in mind the data governance policies.
- For certain usage of sensitive data, the Caldicott Guardians should be consulted.