Everything You Need To Know About Data Masking Software (FAQs)
What Is Data Masking Software?
Most organizations today must comply with strict data privacy and protection regulations, which often require them to prove that they’re taking steps to secure any sensitive data they handle—including customer data such as personally identifiable information (PII), protected health information (PHI), and financial information. Data masking software, also known as “data obfuscation” or “data sanitization” software, can help achieve compliance with data privacy regulations by hiding the original values (letters and numbers) of sensitive data, and replacing them with a realistic and structurally similar, but fictitious counterpart.
There are numerous techniques for data masking, which you can read more about below. Once the data has been masked, only someone with the original dataset can restore it back to its original values. This keeps the original data safe from unauthorized viewing and exfiltration, while maintaining most of its functional properties so it can still be used in situations where the real values aren’t needed.
Because of this, achieving compliance isn’t the only use case for data masking—it can also be applied in user training and sales demos, and is particularly useful in the world of software development. Software developers need to use real-world data for testing purposes. However, they need to do so without compromising security. Data masking enables developers to build and test their products effectively using realistic, but non-sensitive data—eliminating exposure of production data and allowing them to share and innovate freely.
How Does Data Masking Software Work?
There are a few different types of data masking:
Static data masking enables you to create a realistic, fictitious copy of an entire database. Usually, static data masking software creates a backup copy of the database, loads it to a separate masking environment, then removes any traces such as logs or changes. It then masks the data while it’s static in the masking environment. The masked dataset can be used to generate test and analytical results that mirror those of the original dataset. Because of this, static data masking is commonly used to create “sanitized” versions of production databases, which can then be used in non-production environments such as development and testing.
On-the-fly data masking enables users to read and mask a small subset of data when required. It masks data while it’s being transferred from production environments to development or testing environments, before the data is saved. This means that the data is never present in its unmasked format in the dev/testing environment or the transaction log of that environment. On-the-fly data masking is commonly used in continuous software development environments, where developers need to be able to stream data continuously from production to test environment—without backing up the entire source database and masking it each time, as is done with static data masking.
Dynamic data masking streams data from the production environment directly to a system in the dev/testing environment in response to a request, without saving it in a secondary database. As with on-the-fly masking, it masks data in real-time while the data is in transit.
Deterministic data masking uses two sets of data, then replaces the values from one dataset directly with a corresponding value from the other dataset wherever it appears. For example, you could use deterministic data masking to replace the name “Robert” with “Jack”, and that change would be made wherever “Robert” had previously appeared in the dataset.
Synthetic data generation is not a data masking technique, but some data masking software solutions still offer it, so it’s worth mentioning here. Instead of replacing the values of a dataset with fictitious ones, it generates a completely separate, synthetic dataset that captures and reflects the relationships and distributions within the original dataset. This enables the synthetic dataset to function just as the original dataset would, making synthetic data generation useful for application development.
What Are The Different Data Masking Techniques?
Most data masking solutions offer a variety of different masking techniques, i.e., ways in which they can make your original data unreadable. Here are some of the most common techniques used for data masking:
- Encryption uses a mathematical algorithm to turn the data into a seemingly random collection of characters (“ciphertext”) that’s completely illegible. The data can only be read by someone with the correct decryption key. Encryption is one of the most secure forms of masking, but it requires technology for continuous encryption, and for encryption key management. It’s best applied when you plan on returning the data back to its original values.
- Scrambling re-orders the alphanumeric characters in your dataset in a completely random order. For example, the phone number 3332221234 in a production environment could be replaced with 1223432332 in a dev/testing environment. Scrambling is an easy method of data masking, but it only works on some types of data and is also less secure than most other methods.
- Nulling out replaces data values with a “null” value, which causes the data to appear missing when viewed by an unauthorized person. Nulling out is easy to implement, but it makes the data less useful for development and testing as the nullified data cannot be used in queries or analysis.
- Value variance applies a variance to each value in the original dataset, which modifies that value based on the variance allowed. For example, if you wanted to mask salary information, you could add a variance of 5%, and the original values would be replaced with new ones within 5% of the original. You could also add a variance that enables new values to sit anywhere between the lowest and highest values in the original dataset. Value variance is good for providing useful, realistic datasets.
- Substitution swaps out values for fictitious but realistic alternatives, often using a lookup table. For example, you could swap out a list of names with a different list of names, or a list of phone numbers with a different list of numbers that all meet the criteria needed to be a phone number (e.g., correct length and format).
- Shuffling is similar to substitution, except that it swaps values out randomly for other values within the same dataset. The result is a dataset with re-ordered columns, so the dataset looks accurate but actually doesn’t reveal any sensitive information. For example, “Bob Smith” and “Jack Jones” could be shuffled to “Bob Jones” and “Jack Smith”.
What Features Should You Look For In Data Masking Software?
Data masking solutions offer a variety of different features and capabilities to meet specific use cases, so it’s important that you identify your most critical needs—such as the type of data you need to mask, how frequently you need to mask data, and how secure you need that data to be—before you start comparing solutions. That being said, there are a few key features that you should look for in any strong data masking software:
- While the data produced by data masking software will be fictitious, it still needs to be realistic, so that you can use it for non-production use cases. This means that the masked data needs to be the same format and structure as the original data; you can’t swap out a list of names with a list of numbers because that wouldn’t be functional.
- Your chosen solution needs to be compatible with all the different data sources and types that your organization uses.
- The masking process should be automatic and fast—particularly if you need to mask data continuously to reflect changes in the original dataset.
- If you’re working with particularly large datasets, you may want your making solution to help you identify and classify sensitive information within your dataset that needs to be masked, such as names, contact details, and financial information.
- If you’re operating in a highly regulated industry and need to comply with strict data privacy regulations, you should look for policy-based data masking This will enable you to tokenize and mask data in accordance with specific compliance requirements.