Pseudonymization vs Tokenization Explained

Ariel Shiftan

CTO & Co-founder

November 14, 2021

Data leaks hurt consumers. Though the true number of breaches and compromised data remains unknown, those we know of have compromised billions of records--including highly sensitive customer data. Even the largest corporations with the most competitive security teams have failed to prevent data leaks. Until this is properly addressed, consumers cannot trust enterprises to keep their information safe and will continue to see the private information they share with enterprises dangerously exposed, stolen, and shared.

Mounting public concern over how enterprises interact with our personal information has led to the development of laws and regulations, such as the General Data Protection Regulation (GDPR), and the California Consumer Privacy Act (CCPA). Personal data now officially requires special handling and care to meet today’s compliance standards and consumer expectations. This post will discuss one of the most effective techniques to mitigate and reduce the risk of compromised consumer data.

Data Pseudonymization and Data De-Identification

Personal data de-identification is the process of removing identifiers from a data set to prevent any possibility of linking individuals to its information. When de-identification is applied in a way that makes it impossible to link it back to individuals, to re-identify the data, the data is considered anonymized.

Today’s business analysts and data scientists require access to data. However, due to lacking security awareness and problematic infrastructure, direct access introduces unacceptable risk--especially if the data-in-use is subsequently copied outside of controlled systems. Applying the best practice of anonymization or pseudonymization at the very least dramatically reduces these risks.

What is Pseudonymization?

Anonymized data sets are out of the scope of privacy regulations like GDPR. However, full anonymization can disrupt many legitimate data uses that require identifiers (e.g., your bank probably needs to know who you are as a user). GDPR proposes pseudonymization as a practical alternative for reducing the risk of data exposure, relieving compliance obligations, and maintaining optimal data utilization:

‘pseudonymisation’ means the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person (link)

Pseudonymization is the process of replacing all sensitive identifiers with pseudonyms/aliases/tokens in a specific data set (e.g., a table in a database or a CSV file). The original identifiers have to be kept securely elsewhere. C’est tout. Unfortunately, it’s much more complicated to execute. Building systems that work with pseudonymized data requires a whole lot of designing from the get-go. This is incredibly difficult to do--so much so that GDPR can only recommend it due to unenforceability.

What is Tokenization?

Tokenization is the process of substituting a sensitive data element with a non-sensitive equivalent, referred to as a token, that has no exploitable meaning or value. The token is a reference that maps back to the sensitive data through a tokenization system.

Data Anonymization Techniques

Pseudonymization is typically implemented by using a tokenization technique (see below). Note that, like anonymization, pseudonymized data cannot be linked to a person's specific identity on its own. However, unlike anonymization, it is possible to re-identify pseudonymized data using the additional piece of information kept secure elsewhere. Effectively, when systems require the original plaintext identifiers, they can still translate the pseudonyms back.

Example of an original table (with PIIs of emails and SSNs):

user_id email_address ssn salary
1 liz@example.com 000-11-1111 25K
2 darcy@altostrat.com 000-22-2222 39K
3 fiona@example.com 000-33-3333 32K
4 Jon@examplepetstore.com 000-44-4444 29K
5 tomL@test.com 000-55-5555 41K

Okay, if you get the concept and want to protect PII, you can sign up for a free account, set up a cloud hosted vault, and tokenize data using our APIs in 2 minutes from now.

Social Security Number is now tokenized:

user_id email_address ssn salary
1 liz@example.com 1ffa0bf4002a968e7d87d7dc8815f551895378ac 25K
2 darcy@altostrat.com 1d55ec7079cb0a6aca2423ebb5c2b08e8a3fa1d8 39K
3 fiona@example.com be85b326855e0e748a6e466ffa92dde8e34b3e5c 32K
4 Jon@examplepetstore.com a8018df9bf9b78a98da20058e59fe8d311fbfbf7 29K
5 tomL@test.com 39512b47a68f4c3fb03845660fca79234270e946 41K

Both emails and SSNs are tokenized, and the table is now pseudonymized using PII tokenization.

user_id email_address ssn salary
1 ccecd98685a699bd@0fe3485.com 1ffa0bf4002a968e7d87d7dc8815f551895378ac 25k
2 e74c15f3f39db602@4706c986.com 1d55ec7079cb0a6aca2423ebb5c2b08e8a3fa1d8 39k
3 55715565ab5e1378@f3b1c1bb.com be85b326855e0e748a6e466ffa92dde8e34b3e5c 32k
4 5847be2298b245a@3970680.com a8018df9bf9b78a98da20058e59fe8d311fbfbf7 29k
5 E85d62b7520b21@6eae1.com 39512b47a68f4c3fb03845660fca79234270e946 41k


* Email addresses are tokenized using a format-preserving tokenization mechanism.
* This table is now either anonymized or pseudonymized depending on whether it's possible to restore its original email or SSN values.

This is a 1:1 translation (tokens) table matching tokens to the original

email_address_token email_address
ccecd98685a699bd@0fe3485.com liz@example.com
e74c15f3f39db602@4706c986.com darcy@altostrat.com
55715565ab5e1378@f3b1c1bb.com fiona@example.com
5847be2298b245a@3970680.com Jon@examplepetstore.com
E85d62b7520b21@6eae1.com tomL@test.com

* Authorizing, auditing, and monitoring access to this table are critical for preserving the anonymity of the users in the original table.

Data Tokenization Explained

Tokenization is the process of substituting a single piece of sensitive information with non-sensitive information. The non-sensitive substitute information is called a token. It can be created using cryptography, a hash function, or a randomly generated index identifier and used to redact the original sensitive information. For example, tokenizing sensitive information like PII or credit card numbers is important when logging them into a file.

The operation of detokenization, or translating a token to its corresponding data, should only be done on a need-to-know basis (using the right permissions).To support various use cases, tokenization engines that are used for pseudonymization must account for the following possible requirements:

  • Format-preserving tokens: These tokens preserve the format of the original data. They are often required in situations with strict storage formatting, such as when changing the database scheme is impossible.
  • Deterministic vs. unique tokens: This refers to whether the same value is always mapped to the same token or if each value is mapped to its own unique token. Deterministic tokenization enables users to lookup exact matches and performs joins between tokenized directly on the columns of a pseudonymized dataset. However, this means that it still leaks some information about original identifiers and consequently reduces the data protection provided by the tokenization engine. For example, it exposes the fact that two different identifiers are the same and, using statistical methods, can be used to reveal the original data.
  • Ephemeral tokens: These tokens are valid for a limited amount of time. Their expiry limits exposure. They are often used for regular tasks, for example, identifiers exported on a nightly job from a transactional system to an analytical data pipeline.
  • Querying the data: The ability to perform lookups on the tokenized data, such as searching for all emails with the domain “@example.com” in the example above.
  • Efficiently updating/deleting identifiers: The process of only updating referenced identity (e.g., inside the translation/tokens table) instead of updating multiple tables holding a specific token while keeping the token as-is.

Tokenization engines for pseudonymization can be implemented in two main ways: based on a translation table or based on encryption. Table-based tokenization maintains a mapping between identifiers to randomly generated tokens in a table that is stored in a centralized location. Encryption-based tokenization uses a cryptographic algorithm and corresponding key to translate identifiers to tokens and vice versa. Both the table and the key must be secured and stored separately from the original database.

Create your account today and get started for free!

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

To Summarize

The following table summarizes important considerations for both approaches:

Table Encryption
Operational cost High - requires maintaining an always-available table Low - requires a key that can be easily backed up and copied if needed
Security risk Moderate - the table requires a high degree of protection. It contains all identifiers and represents a potential single point of failure Low - requires protection for the key. Access to identifiers requires both the key and the tokens
Queries Any query can be supported Only supports exact matches for deterministic tokenization
Efficiently updating / deleting identifiers Can update or delete identifiers without requiring a token change Tokens must be updated or deleted
Format preserving Supported Supported
Deterministic tokens Supported Supported
Unique tokens In some cases, it’s not possible to generate unique tokens for format-preserving tokens Requires extra context. In some cases it’s not possible to generate unique tokens for format-preserving tokens
Ephemeral tokens Supported Possible for non format-preserving tokens in a trusted execution environment

Sometimes it’s hard to tell which model is better for you. In such cases, it’s best determined by the use case at hand. Typically, it’s best to use a table when you need more searchability over the data. If you need more performance and data security, but less searchability, then encryption may be best. Sometimes, if you know the architecture and how it uses the data, you can roll your own solution to play around with these characteristics.

Whatever it is you choose to do around pseudonymization, you'll have to go through building the tokenization engine first. It takes time and effort, not to mention expertise to build something that is production-grade that can easily scale for thousands of requests per second (RPS). Instead, you can sign up for a free Vault trial, and check out our rich tokenization APIs.

About the author

Ariel Shiftan

CTO & Co-founder

Follow

Ariel, despite holding a PhD in Computer Science, doesn't strictly conform to the traditional academic archetype. His heart lies in the realm of hacking, a passion he has nurtured since his early years. As a proficient problem solver, Ariel brings unmatched practicality and resourcefulness to every mission he undertakes.

Why Piiano Vault

Continue your reading

Back to all blogs
You agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.