De-Identifying Data for Confidentiality |
Post Reply |
Author | ||||||||||||||||||||||||||||||||||||||||||||||||
Roger Carlson
Microsoft MVP Joined: 20 Feb 2008 Status: Offline Points: 261 |
Post Options
Thanks(0)
Posted: 02 Apr 2018 at 7:41am |
|||||||||||||||||||||||||||||||||||||||||||||||
This sample illustrates two different ways to de-identify data.
There
are a number of circumstances when you as a developer might need to randomize
data within a database. You may need to
create sample data to test or demonstrate a database application. Or you may need to mask or
"de-identify" confidential data to comply with regulatory agency
rules. This article demonstrates two
methods that can be used, either separately or in conjunction, to scramble your
database.
Creating test data for a database application is always a difficult task. You need to create enough records to adequately test database performance. A design that works well with a hundred records, might not work with ten thousand. You also need to create data varied enough to mimic real world situations, data that will test the limits of the business rules. Developers have a tendency to create data that will work with their application, not data that will break it.
The very best source for creating test data is actual customer data. Nothing mimics real-world data like the real-world data. Unfortunately, your customers may not appreciate their data being used in this way. Worse yet, there may be regulatory considerations. For instance, the health care industry has to comply with HIPAA (Health Insurance Portability and Accountability Act) regulations, which has very strict compliance rules.
Randomizing Records
Let's start by considering the simplest case. You simply want to mask customer data without the necessity of complying with regulations. It is probably sufficient to simply randomize each field that could identify a record to effectively mask the data.
Tables 1 and 2 show the before and after of a small Customer table that has been randomized.
Table 1: Original Customer table
Table 2: The Customer table after randomization
Of course, with only three records, the randomization is not very random, but the larger the result set, the better the randomization will be.
De-identifying Field Data
In certain instances, simply randomizing records is not sufficient. Certain field values have inherent information regardless of the record that holds them. Phone numbers, social security number, patient numbers, and the like all contain information all by themselves that may breach confidentiality. HIPAA regulations in particular are very strict about displaying any data that could identify a patient.
So to further mask your data, it may be necessary to scramble character data with in the field itself. Now, it doesn't make sense to randomize some types of fields. Name and address fields in particular will look strange if randomized in this way.
Not only does it look strange, the capitalization of the names makes it fairly easy to re-identify the field values. So this function needs to be used with some discretion. It is best used with character data that is composed of numbers.
A further complication is that sometimes character data uses an input mask that displays characters that may either be stored or not. For instance, when you see a social security number in a field 111-22-3333, you don't really know if the number was stored as displayed or stored as 111223333. If the dashes are stored, the number might be randomized as 2-33311-13. So any function that randomizes character data must take these complications into account.
|
||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||
Sponsored Links | Want a good read? Try The Summer of His Life (available on Amazon!) |
|||||||||||||||||||||||||||||||||||||||||||||||
Post Reply | |
Tweet
|
Forum Jump | Forum Permissions You cannot post new topics in this forum You cannot reply to topics in this forum You cannot delete your posts in this forum You cannot edit your posts in this forum You cannot create polls in this forum You cannot vote in polls in this forum |