Generating Synthetic Data with LLMs
A guide to using AI to generate synthetic data
May 21st, 2024
Postgres is one of the most popular databases in the world and it comes with a pretty extensive library of extensions and open source tools that the Postgres team and outside developers have created to make working with Postgres easier and safer. One of the more popular plugins is Postgres Anonymizer or usually called PGAnonymizer. PGAnonymizer is used to anonymize your Postgres database. This is useful if you're working with sensitive data and want a way to protect the security and privacy of that data as developers are working with it.
In this blog, we're going to dive deeper into PGAnonymizer, what it is, how it works and how you can install it and use it in your Postgres database.
PGAnonymizer is a open source Python-based extension built to anonymize data within Postgres. It allows you to obfuscate sensitive information such as names, addresses, emails, and other personally identifiable details (PII). You can then use this data for testing and development purposes without compromising the security of real-world information.
It's configured declaratively (more on this below) using the Postgres Data Definition Language (DDL). This makes it easy to configure and follows git-ops best practices. Once the masking rules are defined, you can access the anonymized data in three ways:
PGRESTORE
.Lastly, PGAnonymizer does come with some detection functions that will try to guess which columns need to be anonymized. This is a form of trying to understand which columns seem to hold sensitive data based on the name of the column.
There are a number of use cases for data anonymization and creating test data. Here are some of the main ones:
PGAnonymizer has a series of anonymization functions to transform sensitive data into anonymized data. You can implement one or multiple of these functions depending on how you want to anonymize your data. We can categorize the functions into a few buckets:
john
-> kwpz
john@gmail.com
-> wefw@gmail.com
john
-> bill
01/23/1990
-> 01/01/1990
$150,000
-> $100,000 - $500,000
The best way to understand how these functions work is by looking at a few examples. First let's install PGAnonymizer and then see how we can use it.
Depending on your set up, you can install PGAnonymizer in a few different ways. In this example, we'll walk through how to do it locally but if you're using something like docker, you can follow this guide.
Let's get started. First, we'll clone the repo down so that we can install install.
git clone https://gitlab.com/dalibo/postgresql_anonymizer.git
Next build it like any other Postgres extension.
make extension
sudo make install
Now we can load the extension using another bash script:
ALTER DATABASE foo SET session_preload_libraries = 'anon';
Close your existing session and then open a new one and then run:
CREATE EXTENSION anon CASCADE;
Lastly, initialize it using:
SELECT anon.init();
That's it! Easy enough to get started. Let's dig into some of the anonymization functions that PGAnonymizer has.
Partial scrambling is a masking function that leaves out some part of the data. For instance : a credit card number can be replaced by '40XX XXXX XXXX XX96'. There are two function available for partial scrambling:
By passing in the email function, PGAnonymizer will recognize that it's an email address and partially scramble the username and domain which are the identifying elements of the email address.
Static masking rules are ways of declaring rules that will mask an entire database. For each column in a table, you declare a rule that defines how you want to transform that data.
Say that we have a full_name
column that we want to anonymize. Here is what a rule could look like:
SECURITY LABEL FOR anon ON COLUMN customer.full_name
IS 'MASKED WITH FUNCTION anon.fake_first_name() || '' '' || anon.fake_last_name()';
Another example could be with an employer
column. Here is that rule:
SECURITY LABEL FOR anon ON COLUMN customer.employer
IS 'MASKED WITH FUNCTION anon.fake_company()';
Once you're ready to anonymize your table, you can run SELECT anon.anonymize_database();
or if you want to just anonymize a specific table, you can run SELECT anon.anonymize_table('customer');
This is just the start of what you can do with PGAnonymizer. There are more functions and ways to anonymize, encrypt, hash and obfuscate data. For more information, check out the documentation.
PGAnonymizer is pretty flexible but there are some limitations that you should be aware of. Here are the biggest ones:
PGanonymizer can be a great option depending on the use-case, but for those who need more power and flexibility, Neosync takes PGAnonymizer to the next level by addressing one many of the weaknesses of PGAnonymizer. Namely, referentially integrity and orchestration. Additionally, Neosync has a GUI and doesn't require the user to define everything in SQL which makes for a better developer experience.
PGAnonymizer is a powerful Postgres extension that allows you to anonymize sensitive data making it usable for development and testing. It has a lot of flexibility but also has some limitations to consider. At the end of the day, the tool that fits your use case is the right tool for you and you should consider all available options to protect customer data and privacy.
A guide to using AI to generate synthetic data
May 21st, 2024
The best way to protect sensitive data in LLMS - synthetic data and tokenization?
April 23rd, 2024