Working with Sensitive Data

Processing sensitive data is one of the most complex aspects of research data management, requiring special attention and knowledge. Improper use of sensitive data can have serious consequences for both individuals and organizations. In this section, you will find practical advice and methods for working safely with sensitive data while complying with both ethical and legal requirements.

The main goal is to ensure that your research can continue while protecting individual privacy and organizational confidentiality. When working with sensitive data, it is essential to apply data pseudonymization and anonymization to reduce identification risks.

Sensitive data is any information whose disclosure or improper use could threaten the privacy, security of a person, group, or organization, create discrimination, or be contrary to public interests.

Data Type	Description	Examples
Personal data	Information that identifies or can identify a specific person	Name, personal code, email address, GPS coordinates, IP address
Special category personal data	Specially protected data defined by GDPR	Race, sexual orientation, political views, religious beliefs, genetic and health data
Confidential data	Data whose disclosure creates risks for organizations or society	Financial reports and transaction data, passwords, trade secrets, intellectual property, investigation data, sensitive public administration data, security or dual-use information
Biological data	Data related to health and biological diversity	Blood test results, DNA sequence, fingerprints, location of endangered or extinct species

How to identify sensitive data?

Direct identifiers – directly identify a specific person:
- first name, last name
- personal code
- email address (if it contains a name)
- location (GPS coordinates, IP address)
Indirect identifiers – in combination can identify a person:
- date of birth or age
- gender, ethnicity
- postal code, residential address
- unique socio-economic data

Sometimes it is difficult to determine whether research data is sensitive and how to safely collect, store, process, and analyse it. In such cases, we encourage you to contact UL data curators.

Pseudonymization is a process where personal data is replaced with fictitious identifiers (pseudonyms), for example, with codes or unique numbers, while maintaining the ability to restore the original information if necessary. Additionally, a key file is created where it is decoded which participant number corresponds to each participant's name. This key file must be stored separately from the pseudonymized dataset to ensure data protection.

When to use pseudonymization?

data restoration or supplementation is needed – if there are plans to supplement existing data
data linking to a specific person – if it's necessary to maintain a connection between data and participants
identification must be ensured during the project – when the need to identify specific participants may arise during the project

Essence of pseudonymization:

data is still linked to a specific person, but it is protected from direct identification
personal data protection requirements still apply to pseudonymized data, as there is a risk that persons could be identified using additional information
suitable for cases where data restoration or supplementation is necessary

Pseudonymization methods:

Code tables – simplest method – names and personal codes are replaced with short codes (for example, ID-001, ID-002). Additionally, a separate table is created where information about which code corresponds to which person is stored.
- advantages: simple implementation, easily understood by any researcher, full control over code format and structure
- disadvantages: if the code table falls into the wrong hands, the entire pseudonymization system collapses and persons become identifiable
Hash functions – special algorithms that transform any text or data into a fixed-length string of digits and letters.
- advantages: no separate table with codes needed, safer method than code tables, lower data leak risk
- disadvantages: almost impossible to restore original data if needed to contact participants or supplement data
Encryption – original data is transformed into an unrecognizable form using a special key. Unlike other methods, encrypted data can be restored to its original form if the decryption key is available.
- advantages: better security than code tables, data can be restored if needed, flexible method
- disadvantages: encryption key security must be ensured – if it's lost, data becomes inaccessible; if it falls into wrong hands, data is compromised
Tokenization – original data is replaced with "tokens" while maintaining the original data format and structure. For example, an 11-digit personal code is replaced with another 11-digit code.
- advantages: maintains data structure and format, easily integrated into existing systems, doesn't interfere with data analysis processes
- disadvantages: may preserve part of identifying information (for example, date of birth), which increases reidentification risk

Anonymization is a process where data is completely transformed so that it is impossible to identify a specific person, even using additional information.

When to use anonymization?

data will be made publicly available
no need to maintain connection to a specific person
no plans to supplement datasets

Anonymization methods:

Data deletion – simplest method – complete removal of sensitive data columns or rows from the dataset. For example, deleting names, personal codes, or contact information.
- advantages: very simple implementation, eliminates identification risk for deleted data
- disadvantages: may lose important information for analysis, reduces dataset value
Data generalization – specific values are replaced with broader categories. For example, precise age "32 years" becomes age group "30-39 years," or specific address becomes city name.
- advantages: preserves analytically useful information, easily understood and implemented
- disadvantages: loses precision, may affect analysis result quality
Microaggregation – similar records are grouped, and individual values are replaced with group average, median, or another statistical indicator.
- advantages: preserves statistical properties, suitable for quantitative data
- disadvantages: may mask important individual differences, more complex implementation
Data shuffling – changing the sequence of data values in columns – person A's age is assigned to person B, but person B's education to person C, etc.
- advantages: preserves data distribution and statistical properties, individual connections become unrecognizable
- disadvantages: loses correlations between variables, may affect analyses based on variable interrelationships

Data evaluation:
- identify all sensitive data – go through all fields and determine which contain personal data, confidential information, or other sensitive content
- determine direct and indirect identifiers – direct (name, personal code) and indirect (age + gender + postal code in combination)
- assess reidentification risk – how easily someone could recognize a specific person by combining available information
Method selection:
- pseudonymization: if you need to maintain connection with persons for future contacts, data supplementation, or longitudinal studies
- anonymization: if data will be made publicly available, published in a repository, or restoration is not needed
Security measures:
- store key tables separately and securely – never store code tables in the same folder or system as pseudonymized data
- restrict access to only necessary persons, for example, only to the project leader or specially authorized persons
- document all performed actions – describe in DMP what method was used, why, and how
Quality control:
- check whether anonymization is sufficient – whether it's still possible to identify persons, especially when combining with other data
- test reidentification risk – try to "guess" persons from anonymized data
- consult experts in complex cases – contact UL data stewards or data protection specialists.

Study programmes

For international students

Scholarships and student loans

Study process

UL Library

Student life

More than studies

Lifelong education and continuing education

Science UL

Achievements

Research support

Science communication

Projects

International cooperation

Governance

Advisory institutions

The University of Latvia brand

History

Structure

UL media

Contacts

Working with Sensitive Data