Processing sensitive data is one of the most complex aspects of research data management, requiring special attention and knowledge. Improper use of sensitive data can have serious consequences for both individuals and organizations. In this section, you will find practical advice and methods for working safely with sensitive data while complying with both ethical and legal requirements.
The main goal is to ensure that your research can continue while protecting individual privacy and organizational confidentiality. When working with sensitive data, it is essential to apply data pseudonymization and anonymization to reduce identification risks.
Sensitive data is any information whose disclosure or improper use could threaten the privacy, security of a person, group, or organization, create discrimination, or be contrary to public interests.
Data Type | Description | Examples |
---|---|---|
Personal data | Information that identifies or can identify a specific person | Name, personal code, email address, GPS coordinates, IP address |
Special category personal data | Specially protected data defined by GDPR | Race, sexual orientation, political views, religious beliefs, genetic and health data |
Confidential data | Data whose disclosure creates risks for organizations or society | Financial reports and transaction data, passwords, trade secrets, intellectual property, investigation data, sensitive public administration data, security or dual-use information |
Biological data | Data related to health and biological diversity | Blood test results, DNA sequence, fingerprints, location of endangered or extinct species |
How to identify sensitive data?
- Direct identifiers – directly identify a specific person:
- first name, last name
- personal code
- email address (if it contains a name)
- location (GPS coordinates, IP address)
- Indirect identifiers – in combination can identify a person:
- date of birth or age
- gender, ethnicity
- postal code, residential address
- unique socio-economic data
Sometimes it is difficult to determine whether research data is sensitive and how to safely collect, store, process, and analyse it. In such cases, we encourage you to contact UL data curators.
Pseudonymization is a process where personal data is replaced with fictitious identifiers (pseudonyms), for example, with codes or unique numbers, while maintaining the ability to restore the original information if necessary. Additionally, a key file is created where it is decoded which participant number corresponds to each participant's name. This key file must be stored separately from the pseudonymized dataset to ensure data protection.
When to use pseudonymization?
- data restoration or supplementation is needed – if there are plans to supplement existing data
- data linking to a specific person – if it's necessary to maintain a connection between data and participants
- identification must be ensured during the project – when the need to identify specific participants may arise during the project
Essence of pseudonymization:
- data is still linked to a specific person, but it is protected from direct identification
- personal data protection requirements still apply to pseudonymized data, as there is a risk that persons could be identified using additional information
- suitable for cases where data restoration or supplementation is necessary
Pseudonymization methods:
- Code tables – simplest method – names and personal codes are replaced with short codes (for example, ID-001, ID-002). Additionally, a separate table is created where information about which code corresponds to which person is stored.
- advantages: simple implementation, easily understood by any researcher, full control over code format and structure
- disadvantages: if the code table falls into the wrong hands, the entire pseudonymization system collapses and persons become identifiable
- Hash functions – special algorithms that transform any text or data into a fixed-length string of digits and letters.
- advantages: no separate table with codes needed, safer method than code tables, lower data leak risk
- disadvantages: almost impossible to restore original data if needed to contact participants or supplement data
- Encryption – original data is transformed into an unrecognizable form using a special key. Unlike other methods, encrypted data can be restored to its original form if the decryption key is available.
- advantages: better security than code tables, data can be restored if needed, flexible method
- disadvantages: encryption key security must be ensured – if it's lost, data becomes inaccessible; if it falls into wrong hands, data is compromised
- Tokenization – original data is replaced with "tokens" while maintaining the original data format and structure. For example, an 11-digit personal code is replaced with another 11-digit code.
- advantages: maintains data structure and format, easily integrated into existing systems, doesn't interfere with data analysis processes
- disadvantages: may preserve part of identifying information (for example, date of birth), which increases reidentification risk
Anonymization is a process where data is completely transformed so that it is impossible to identify a specific person, even using additional information.
When to use anonymization?
- data will be made publicly available
- no need to maintain connection to a specific person
- no plans to supplement datasets
Anonymization methods:
- Data deletion – simplest method – complete removal of sensitive data columns or rows from the dataset. For example, deleting names, personal codes, or contact information.
- advantages: very simple implementation, eliminates identification risk for deleted data
- disadvantages: may lose important information for analysis, reduces dataset value
- Data generalization – specific values are replaced with broader categories. For example, precise age "32 years" becomes age group "30-39 years," or specific address becomes city name.
- advantages: preserves analytically useful information, easily understood and implemented
- disadvantages: loses precision, may affect analysis result quality
- Microaggregation – similar records are grouped, and individual values are replaced with group average, median, or another statistical indicator.
- advantages: preserves statistical properties, suitable for quantitative data
- disadvantages: may mask important individual differences, more complex implementation
- Data shuffling – changing the sequence of data values in columns – person A's age is assigned to person B, but person B's education to person C, etc.
- advantages: preserves data distribution and statistical properties, individual connections become unrecognizable
- disadvantages: loses correlations between variables, may affect analyses based on variable interrelationships
- Data evaluation:
- identify all sensitive data – go through all fields and determine which contain personal data, confidential information, or other sensitive content
- determine direct and indirect identifiers – direct (name, personal code) and indirect (age + gender + postal code in combination)
- assess reidentification risk – how easily someone could recognize a specific person by combining available information
- Method selection:
- pseudonymization: if you need to maintain connection with persons for future contacts, data supplementation, or longitudinal studies
- anonymization: if data will be made publicly available, published in a repository, or restoration is not needed
- Security measures:
- store key tables separately and securely – never store code tables in the same folder or system as pseudonymized data
- restrict access to only necessary persons, for example, only to the project leader or specially authorized persons
- document all performed actions – describe in DMP what method was used, why, and how
- Quality control:
- check whether anonymization is sufficient – whether it's still possible to identify persons, especially when combining with other data
- test reidentification risk – try to "guess" persons from anonymized data
- consult experts in complex cases – contact UL data curators or data protection specialists.