Robustness, Security and Privacy

COMP63301 Data Engineering Concepts

 

Stian Soiland-Reyes

Robustness

Software Quality in Data Engineering

  • Data-driven software (e.g. an ETL pipeline with data cleaning) can be considered “research software”, and so should follow recommendations for Research Software Quality and consider Compatibility, FAIRness, Flexibility, Functional suitability, Interaction Capability, Maintainability, Performance efficiency, Safety, Security, Sustainability

following slides are adapted from
RSQkit licensed under CC-BY 4.0

everse.software/RSQKit/rs_quality

Compatibility

Co-existence

Degree to which a product can perform its required functions efficiently while sharing a common environment and resources with other products, without detrimental impact on any other product.

Interoperability

Degree to which a system, product or component can exchange information with other products and mutually use the information that has been exchanged.

FAIRness

FAIR principles adapted for research software, aim to enhance the discoverability, accessibility, interoperability, and reusability of software, thereby maximizing its value and impact in scientific research

Barker et al. (2022): Introducing the FAIR Principles for research software https://doi.org/10.1038/s41597-022-01710-x

Illustrations from The Turing Way https://doi.org/10.5281/zenodo.13882307

Adaptability: Degree to which a product or system can effectively and efficiently be adapted for or transferred to different hardware, software or other operational or usage environments.

Installability: Degree of effectiveness and efficiency with which a product or system can be successfully installed and/or uninstalled in a specified environment.

Flexibility

Degree to which a product can be adapted to changes in its requirements, contexts of use or system environment

Scalability: Degree to which a product can handle growing or shrinking workloads or to adapt its capacity to handle variability.

Replaceability: Degree to which a product can replace another specified software product for the same purpose in the same environment.

Functional completeness: Degree to which the set of functions covers all the specified tasks and intended users' objectives.


Functional correctness: Degree to which a product or system provides accurate results when used by intended users.


Functional appropriateness: Degree to which the functions facilitate the accomplishment of specified tasks and objectives.

Functional suitability

Degree to which a product or system provides functions that meet stated and implied needs when used under specified conditions

  1. Recognizable as appropriate
  2. Learnability
  3. Operability
  4. User error protection
  5. User engagement
  6. Inclusivity
  7. User assistance
  8. Self-descriptiveness

Interaction Capability

Degree to which a product or system can be interacted with by specified users to exchange information via the user interface to complete specific tasks in a variety of contexts of use

LCARS by caseorganic, Flickr

Maintainability

degree of effectiveness and efficiency with which a product or system can be modified to improve it, correct it or adapt it to changes in environment, and in requirements

  1. Modularity: split into components, minimizing impact of change
  2. Reusability: can be used as an asset in more than one system
  3. Analysability: impact of change can be effectively assessed, diagnosed, understood
  4. Modifiability: can be modified without degrading existing product quality
  5. Testability: test criteria can be established, tests performed to determine whether those criteria are (still) met.

Performance Efficiency

Degree to which a product performs its functions within specified time and throughput parameters and is efficient in the use of resources

Text

Resource utilization: amounts and types of resources used by a product or system

 

Resources: CPU, memory, storage, network devices, energy, materials, ...

Time behaviour: Considering response time and throughput rates

Capacity: Consider the maximum limit of usage

Reliability

Degree to which a system, product or component performs specified functions under specified conditions for a specified period of time

 

Faultlessness - Degree to which a system, product or component performs specified functions without fault under normal operation.

Availability: Degree to which a system, product or component is operational and accessible when required for use.

Fault tolerance: Degree to which a system, product or component operates as intended despite the presence of hardware or software faults.

Recoverability: Degree to which, in the event of an interruption or a failure, a product or system can recover the data directly affected and re-establish the desired state of the system.

Safety

Degree to which a product under defined conditions to avoid a state in which human life, health, property, or the environment is endangered

Operational constraint: Degree to which a product or system constrains its operation to within safe parameters or states when encountering operational hazard.

Risk identification: Degree to which a product can identify a course of events or operations that can expose life, property or environment to unacceptable risk.
 

Fail safe: Degree to which a product can automatically place itself in a safe operating mode, or to revert to a safe condition in the event of a failure.

Hazard warning: Degree to which a product or system provides warnings of unacceptable risks to operations or internal controls so that they can react in sufficient time to sustain safe operations.

Safe integration: Degree to which a product can maintain safety during/after integration with one or more components.

Security

defends against attack patterns by malicious actors and protects information and data so that persons or other products or systems have the degree of data access appropriate to their types and levels of authorization

Confidentiality: Degree to which a product or system ensures that data are accessible only to those authorized to have access.

Integrity: Degree to which a system, product or component ensures that the state of its system and data are protected from unauthorized modification or deletion either by malicious action or computer error.

Accountability: Degree to which the actions of an entity can be traced uniquely to the entity.

Non-repudiation: Degree to which actions or events can be proven to have taken place so that the events or actions cannot be repudiated later.

Authenticity: Degree to which the identity of a subject or resource can be proved to be the one claimed.

Resistance: Degree to which the product or system sustains operations while under attack from a malicious actor.

Reproducible data engineering

Recommendations for Reproducible Research suggest using:

  • open source
  • version control (e.g. git)
  • reproducible environments (e.g. Conda),
  • code documentation
  • code review and code quality checks
  • testing & continuous integration

Illustrations from The Turing Way https://doi.org/10.5281/zenodo.13882307

Ten Simple Rules for Reproducible Computational Research

  1. For Every Result, Keep Track of How It Was Produced
  2. Avoid Manual Data Manipulation Steps
  3. Archive the Exact Versions of All External Programs Used
  4. Version Control All Custom Scripts
  5. Record All Intermediate Results, When Possible in Standardized Formats
  6. For Analyses That Include Randomness, Note Underlying Random Seeds
  7. Always Store Raw Data behind Plots
  8. Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected
  9. Connect Textual Statements to Underlying Results
  10. Provide Public Access to Scripts, Runs, and Results

Sandve et al. (2013): Ten Simple Rules for Reproducible Computational Research
https://doi.org/10.1371/journal.pcbi.1003285

Illustrations from The Turing Way https://doi.org/10.5281/zenodo.13882307

Security

Security

Ensuring data security requires attention to physical security, network security, and the security of computer systems and files to prevent unauthorised access or unwanted changes to data, disclosure, or destruction.

  • Apply stricter security for personal, sensitive, or confidential data.
  • Control physical access to rooms, devices, and printed materials.
  • Encrypt files before storing or sharing, especially when transmitting online.
  • Separate directly identifiable data (e.g. names) from research data and store securely.
  • Keep systems updated with firewalls, antivirus software, and patches.
  • Use password protection and permission controls for digital files and folders.
  • Avoid using general-purpose cloud file-sharing tools for personal data.
  • Only share sensitive data via secure transfer methods approved by your institution.
  • Include non-disclosure agreements for anyone handling confidential data.
  • Document who has access, review permissions regularly, and remove access when no longer required.

Data protection

Legislation on data protection in the UK

  • Data Protection Act 2018 (DPA)
  • EU General Data Protection Regulation (GDPR)
    • UK General Data Protection Regulation (UK GDPR)
  • Data (Use and Access) Act 2025 (DUAA)

 

Anyone responsible for using personal data must make sure the information is:

  • used fairly, lawfully and transparently
  • used for specified, explicit purposes
  • used in a way that is adequate, relevant and limited to only what is necessary
  • accurate and, where necessary, kept up to date
  • kept for no longer than is necessary
  • handled in a way that ensures appropriate security, including protection against unlawful or unauthorised processing, access, loss, destruction or damage

 

Individuals have rights in relation to their personal data, with some exceptions. These include the right to:

  • be informed about how their data is being used
  • access personal data
  • have incorrect data updated
  • have data erased
  • stop or restrict the processing of their data
  • data portability (allowing them to get and reuse their data for different services)
  • object to how their data is processed in certain circumstances

     

You also have rights when an organisation is using your personal data for:

  • automated decision-making processes (without human involvement)
  • profiling, for example to predict your behaviour or interests

 

Privacy

personal data’ means any information relating to an identified or identifiable natural person (‘data subject’);

an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person

Personal data

 

Personal data that is processed can be more sensitive in nature and therefore requires a higher level of protection if it includes an individual's:

  • race
  • ethnic origin
  • political opinions
  • religious or philosophical beliefs
  • trade union membership
  • genetic data
  • biometric datas (where used for identification)
  • health data
  • sex life or orientation
  • criminal convictions and offences

Special categories of personal data

pseudonymisation’ means the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person

Pseudonymisation

  • Techniques that replace, remove or transform information that identifies people, and keep that information separate
  • Reduce the risks your data processing poses:
    • implement data protection by design;
    • ensure appropriate security; and
    • make better use of personal data (eg for research purposes and general analysis).
  • Pseudonymisation differs from anonymisation
  • Data protection legislation still applies!

Pseudonymisation

COMP63301 Robustness, security and privacy

By Stian Soiland-Reyes

COMP63301 Robustness, security and privacy

Lecture in COMP63301 Data Engineering Concepts at The University of Manchester.

  • 2