Ariel Shiftan
November 14, 2021
Data leaks hurt consumers. Though the true number of breaches and compromised data remains unknown, those we know of have compromised billions of records--including highly sensitive customer data. Even the largest corporations with the most competitive security teams have failed to prevent data leaks. Until this is properly addressed, consumers cannot trust enterprises to keep their information safe and will continue to see the private information they share with enterprises dangerously exposed, stolen, and shared.
Mounting public concern over how enterprises interact with our personal information has led to the development of laws and regulations, such as GDPR and CCPA, that define consumer privacy rights. Personal data now officially requires special handling and care to meet today’s compliance standards and consumer expectations. This post will discuss one of the most effective techniques to mitigate and reduce the risk of compromised consumer data.
Personal data de-identification is the process of removing identifiers from a data set to prevent any possibility of linking individuals to its information. When de-identification is applied in a way that makes it impossible to link it back to individuals, to re-identify the data, the data is considered anonymized.
Today’s business analysts and data scientists require access to data. However, due to lacking security awareness and problematic infrastructure, direct access introduces unacceptable risk--especially if the data-in-use is subsequently copied outside of controlled systems. Applying the best practice of anonymization or pseudonymization at the very least dramatically reduces these risks.
Anonymized data sets are out of the scope of privacy regulations like GDPR. However, full anonymization can disrupt many legitimate data uses that require identifiers (e.g., your bank probably needs to know who you are as a user). GDPR proposes pseudonymization as a practical alternative for reducing the risk of data exposure, relieving compliance obligations, and maintaining optimal data utilization:
‘pseudonymisation’ means the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person (link)
Pseudonymization is the process of replacing all sensitive identifiers with pseudonyms/aliases/tokens in a specific data set (e.g., a table in a database or a CSV file). The original identifiers have to be kept securely elsewhere. C’est tout. Unfortunately, it’s much more complicated to execute. Building systems that work with pseudonymized data requires a whole lot of designing from the get-go. This is incredibly difficult to do--so much so that GDPR can only recommend it due to unenforceability.
Tokenization is the process of substituting a sensitive data element with a non-sensitive equivalent, referred to as a token, that has no exploitable meaning or value. The token is a reference that maps back to the sensitive data through a tokenization system.
Pseudonymization is typically implemented by using a tokenization technique (see below). Note that, like anonymization, pseudonymized data cannot be linked to a person's specific identity on its own. However, unlike anonymization, it is possible to re-identify pseudonymized data using the additional piece of information kept secure elsewhere. Effectively, when systems require the original plaintext identifiers, they can still translate the pseudonyms back.
Example of an original table (with PIIs of emails and SSNs):
SSN is now tokenized:
Both emails and SSNs are tokenized, and the table is now pseudonymized using PII tokenization.
* Email addresses are tokenized using a format-preserving tokenization mechanism.
* This table is now either anonymized or pseudonymized depending on whether it's possible to restore its original email or SSN values.
This is a 1:1 translation (tokens) table matching tokens to the original
* Authorizing, auditing, and monitoring access to this table are critical for preserving the anonymity of the users in the original table.
Tokenization is the process of substituting a single piece of sensitive information with non-sensitive information. The non-sensitive substitute information is called a token. It can be created using cryptography, a hash function, or a randomly generated index identifier and used to redact the original sensitive information. For example, tokenizing sensitive information like PII or credit card numbers is important when logging them into a file.
The operation of detokenization, or translating a token to its corresponding data, should only be done on a need-to-know basis (using the right permissions).To support various use cases, tokenization engines that are used for pseudonymization must account for the following possible requirements:
Tokenization engines for pseudonymization can be implemented in two main ways: based on a translation table or based on encryption. Table-based tokenization maintains a mapping between identifiers to randomly generated tokens in a table that is stored in a centralized location. Encryption-based tokenization uses a cryptographic algorithm and corresponding key to translate identifiers to tokens and vice versa. Both the table and the key must be secured and stored separately from the original database.
The following table summarizes important considerations for both approaches:
Sometimes it’s hard to tell which model is better for you. In such cases, it’s best determined by the use case at hand. Typically, it’s best to use a table when you need more searchability over the data. If you need more performance and security over the data, but less searchability, then encryption may be best. Sometimes, if you know the architecture and how it uses the data, you can roll your own solution to play around with these characteristics.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.
CTO & Co-founder
Ariel, despite holding a PhD in Computer Science, doesn't strictly conform to the traditional academic archetype. His heart lies in the realm of hacking, a passion he has nurtured since his early years. As a proficient problem solver, Ariel brings unmatched practicality and resourcefulness to every mission he undertakes.
Increased complexity as the number of keys and systems grow.
Adopt a centralized key management solution such as a Hardware Security Module (HSM) or cloud-based KMS to securely manage and control cryptographic keys at scale.
Ensuring secure and timely key distribution and synchronization at scale.
Automate key rotation processes to maintain synchronization, reduce human intervention, and minimize errors as the system grows.