How to choose data tools and infrastructure that are flexible, scalable, sustainable and secure.

This guidance will help you choose the software and standards for working with data. Read this together with the Data Ethics Framework and the Technology Code of Practice.

There are 4 main areas to consider:

Choose analytical tools that keep pace with user needs.
Use the cloud for the whole development cycle.
Use appropriate security when using data in the cloud.
Choose open data standards for better interoperability.

1. Choose analytical tools that keep pace with user needs

As data analysis and data science evolves, you should choose tools and techniques that can adapt and support best practices. The UK Statistics Authority’s Code of Practice for Statistics provides information on best practices including:

keeping up to date with innovations that can improve statistics and data
improving data presentation for their users
testing and releasing new official statistics

Government workers responsible for providing or procuring software for data analysis should choose a loosely coupled modular system. These systems are flexible enough to use with a variety of tools and connect to a range of data sources and architecture.

You should build data architecture in an agile way, to iterate and add value with each change. If you make up-front long-term decisions on one type of software you risk being unable to meet evolving user needs.

Choosing open source languages

Data scientists and analysts often use common open source languages such as Python and R. The benefits of using these languages include:

good support and open training – which means reduced training costs
new data analytics methods
the ability to create your own methods using open source languages

The R and Python communities develop large collections of packages for data analysis. The packages for R and the packages for Python, including many data science packages, provide extensive analytical functionality benefits.

Choosing tools that work with open technology

Choosing tools which work with open technology supports robust and appropriate data analysis, as set out in Principle 5 of the Data Ethics Framework.

Tools which work with open technology, such as Docker or Apache Spark, give your team the flexibility to meet your users’ needs. Open tools are usually designed to work together and across vendors. Benefits include the ability to:

script a data pipeline using the best software for each task
run your code anywhere – using commodity container platforms, platform as a service or a Hadoop cluster

Other benefits include better:

collaboration
support software for engineering practices
capabilities in big data and machine learning

You can achieve better quality assurance in your software development with continuous integration and unit tests. The Reproducible Analytical Pipeline community has guidance about doing quality assurance in analytical pipelines.

If you use spreadsheets and business intelligence software you should be aware they:

do not often scale well to large datasets or intensive computation
do not often integrate well into automated pipelines, or support best practices in quality assurance
often need paid-for licences, making them expensive to deploy in the cloud

Case study – using data science with the Ministry of Justice analytical platform

The Ministry of Justice Analytical Platform supports the latest data analysis tools. This platform allows easy integration of new open source software and leading cloud services into a platform for 300 staff in the data analysis professions.

The platform is a flexible and secure environment, where:

analysts use a web browser to sign-in once and then develop code in tools such as R and Python
you can access data and create live charts and dashboards that are accessible to end users in a web browser with no special software or licences
it runs software using standardised containers on an auto-scaled Kubernetes cluster which allows the platform to run any of the latest open source tools, data stores and custom-built data analysis components
you can add innovative services such as a new graph database or machine learning framework
you can process datasets of almost unlimited size at low cost

The platform has helped the Ministry of Justice produce several national statistics more reliably and efficiently by using reproducible automated pipelines.

2. Use the cloud for the whole development cycle

In most circumstances, you should store data in the cloud and code in cloud-based version control repositories. You should run live data products in the cloud, as set out in the government’s Cloud First policy, and you should use the cloud throughout the whole development cycle.

Keeping your data in the cloud

You can use cloud services for data analysis work. It’s usually more efficient to use software-as-a-service and only pay for what you use, rather than setting up and running your own cluster for data. With cloud services it’s important to be alert to supplier ‘lock-in’ and always consider the cost of switching to another supplier.

The benefits of storing your data in the cloud are that:

it scales well to large quantities of data that would not comfortably fit on a user’s machine
you can take advantage of cloud-scale databases to process complicated queries in a reasonable time-frame
you can use it for all stages from exploration through to production systems
it’s simpler to combine different datasets from your organisation
it’s usually the cheapest option, due to commoditisation and pay-as-you-go pricing, but evaluate this against your own needs

Rather than sharing data files by email, the cloud enables data sharing by sending a link. This is a better practice because it helps you:

control and monitor access to the data
maintain connection to the original source data so you can avoid duplication and poor version control
get reports with live updates

When using data in the cloud make sure that your data is accessible through a stable URL or other endpoint, as this will help you to make reproducible analysis code.

Maintaining cloud-based version control

To maintain cloud-based version control and support collaboration, you should:

use a cloud-hosted repository, such as GitHub, to create pull requests
peer review code on a regular basis, to make sure you maintain the appropriate quality and keep all stakeholders up to date with any changes
share code outside your team and organisation
manage a list of issues
encourage reviews and invite comment

Reproducibility with the cloud

Cloud-based version control allows you to run automatic tests, which help you to make data analysis ‘reproducible’.

You should aim to make your data analysis reproducible so it’s easy for someone else to work with it. For example, sharing your code and data so that someone else can run your data model on another computer at a different time with the same results. This is important because someone can:

check how your data analysis works
test your data analysis with different queries
run the analysis on a different dataset, or build on the analysis

Data analysts can make their data analysis reproducible by:

writing code that runs their analysis, rather than doing analysis through a series of manual steps, such as manual clicks in a graphical user interface
using the cloud for storing their data
setting up continuous integration and automated testing on all users’ platforms
specifying the library dependencies as well as their version numbers

It’s standard practice to specify dependencies and automate testing throughout software development. Unless you’re doing quick, throw-away experiments you should aim to make all your code reproducible.

Using a cloud development environment

Teams using a data development environment running on a shared cloud platform (such as RStudio and Jupyter) benefit from a more streamlined process to onboard new users. This is because the team:

does not need to install something on their machine that would demand maintenance and updates (a benefit of software as a service)
can install code libraries across all users’ environments, which makes the code easier to share and reproduce
is not tied to a particular corporate network, enabling users and collaboration from outside the organisation

Using a cloud environment for data development also means that you:

often have easy access to other cloud data services and cloud-hosted data due to the platform’s built-in credentials
can decide which software to install
can decide who is best placed to install software, such as using a platform team who understand analysts’ needs
are less likely to see users keeping local copies of the data on their laptop for development, especially if the data is also in the cloud

Teams using a development environment on local machines might risk:

not having administrator access for security reasons, which will prevent installation of development software
having longer installation times in Python or R installations and libraries
having less access to cloud-hosted data which might cause users to create workarounds, such as using email to circulate data

Sometimes, your data analyst may prefer to use their own custom environment. Where practical, you should aim to be flexible and try to replicate the essential elements of the cloud environment on their local machine.

The cloud environment offers a baseline of libraries, but as soon as you need more libraries you should specify all the dependencies and their version numbers, to make sure work is still reproducible.

3. Use appropriate security when using data in the cloud

The government’s approach to security in the cloud is set out in the Cloud Security Principles from the National Cyber Security Centre (NCSC). Also, in the Risk Management Principles, NCSC states that the commercial public cloud is an acceptable place to put OFFICIAL data.

NCSC considers the cloud to have acceptable security because:

there is less information on end user devices
the supplier applies regular upgrades and security patches
the supplier often has rigorous methods to audit data, and control access and monitoring

Whether you’re procuring SaaS or developing your own solution for a platform of tools and services, you should put in place mitigations such as:

data encryption
single sign-on
two-factor authentication (2FA)
fine-grained access control
usage monitoring and alerts
timely patching

Other security challenges for data analysts include developing code on a platform with:

real data
internet access

When platforms have internet access and hold real data, threat actors or attackers may try to steal or alter the data. Also, there is a greater risk of an accidental real data leak.

You should integrate security controls and monitoring with the data and network flows. This should be proportionate to the risks faced in experimental, collaborative and production environments.

Balance security choices with user needs

Security should protect data, but not stop users from accessing the data they need for their work. The Service Manual has guidance on securing information for government services.

You should build security into a system so it’s as invisible to the user as possible. Adding complicated login procedures, and restricting access to the tools users need, does not make your security better. Restrictive security makes shadow IT more likely, with users avoiding security measures and finding workarounds.

Case study – using Ministry of Justice data in the public cloud

There is a government policy supporting the use of the cloud for personal and sensitive data. Most UK departments have assessed the risks, put in appropriate safeguards and moved sensitive data into the public cloud.

An example of this is from the Ministry of Justice who moved their prisoner data into the public cloud. This data has an OFFICIAL classification and often the ‘SENSITIVE’ handling caveat. It includes information such as health records and the security arrangements for prisoners.

The project team makes sure the appropriate security is used, such as:

careful isolation between elements using cloud sub-accounts, Virtual Private Clouds (VPCs) and firewall rules
finely grained user and role permissions
users logging in with two-factor authentication (2FA)
being able to quickly revoke or rotate secrets, encryption keys and certificates
frequent and reliable updates using peer-review and continuous deployment
extensive audit trails

Hosting the data in the cloud has enabled the Ministry of Justice to perform additional analysis using modern open source tools and scalable computing resources through its Analytical Platform.

It’s possible to achieve this level of security and functionality with a private data centre, but it would be a huge investment in hardware, software and expert staff to design and maintain it. You can reduce these issues by using the public cloud and taking advantage of the continuous investment and developments made by the suppliers.

4. Choose open data standards for better interoperability

An open data standard specifies a way of formatting and storing data. This can make data compatible with a wide range of tools in a predictable fashion, and prevents lock-in to proprietary tools. Open standards allow organisations to:

share information even when they do not have access to the same tools
replace their tools and still have access to their data
make a strategic decision to provide an agile environment that changes with the needs and capabilities of the users

The Open Standards Board selects open standards for use by government.

Examples of open standards include the:

publishing job vacancies, which helps job seekers by aggregating jobs from third-party job search sites
Open Contracting Data Standard (OCDS), which allows you to compare the procurement practices of different organisations

Published 1 July 2019 Contents

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Recent Posts

Recent Posts

Recent Posts

Recent Posts

Topics

Recent Posts

Recent Posts

Recent Posts

Recent Posts

Topics

Recent Posts

Recent Posts

Recent Posts

Recent Posts

Contribute

Recent Posts

Recent Posts

Recent Posts

Recent Posts

Topics

Recent Posts

Recent Posts

Recent Posts

Recent Posts

Topics

Recent Posts

Recent Posts

Topics

Recent Posts

Recent Posts

Recent Posts

Recent Posts

Topics

Recent Posts

Recent Posts

Recent Posts

Recent Posts

Topics

Recent Posts

Recent Posts

Recent Posts

Recent Posts

Recent Posts

Recent Posts

Recent Posts

Recent Posts

Topics

1. Choose analytical tools that keep pace with user needs

Choosing open source languages

Choosing tools that work with open technology

Case study – using data science with the Ministry of Justice analytical platform

2. Use the cloud for the whole development cycle

Keeping your data in the cloud

Maintaining cloud-based version control

Reproducibility with the cloud

Using a cloud development environment

3. Use appropriate security when using data in the cloud

Balance security choices with user needs

Case study – using Ministry of Justice data in the public cloud

4. Choose open data standards for better interoperability

Related Articles

Responses