While you may be able to change or delete obvious PII, like names, every bit of real data in aggregate leads to revealing someone's identity. They are male? That's half the population. They also live in Seattle, are Hispanic, age 18-25? Down to a few hundred thousand. They use Firefox? That might be like 10 people.
This is why browser fingerprinting is so effective. It's how Ad targeting works.
Just stick with fuzzing random data during development. Many web frameworks already have libraries for doing this. Django for example has factory_boy[0]. You just tell it what model to use, and the factory class will generate data based on your schema. You'll catch more issues this way anyway because computers are better at making nonsensical data.
Keep production data in production.
This topic is relevant to what I'm currently working on, and I'm finding it exhausting to be honest. After considering several options for anonymizing both Postgres and ClickHouse data, I've been evaluating clickhouse-obfuscator[1] for a few weeks now. The idea in principle is great, since ClickHouse allows you to export both its data and Postgres data (via their named collections feature) into Parquet format (or a bunch of others, but we settled on Parquet), and then using clickhouse-obfuscator to anonymize the data and store it in Parquet as well, which can then be imported where needed.
The problem I'm running into is referential integrity, as importing the anonymized data is raising unique and foreign key violations. The obfuscator tool is pretty minimal and has few knobs to tweak its output, so it's difficult to work around this, and I'm considering other options at this point.
Your tool looks interesting, and it seems that you directly address the referential integrity issue, which is great.
I have a couple of questions:
1. Does Neosync ensure that anonymized data has the same statistical significance (distribution, cardinality, etc.) as source data? This is something that clickhouse-obfuscator put quite a lot of effort in addressing, as you can see from their README. Generating synthetic data doesn't solve this, and some anonymization tools aren't this sophisticated either.
2. How does it differ from existing PG anonymization solutions, such as PostgreSQL Anonymizer[2]? Obviously you also handle MySQL, but I'm interested in PG specifically.
As a side note, I'm not sure I understand what the value proposition of your Cloud service would be. If the original data needs to be exported and sent to your Cloud for anonymization, it defeats the entire purpose of this process, and only adds more risk. I don't think that most companies looking for a solution like this would choose to rely on an external service. Thanks for releasing it as open source, but I can't say that I trust your business model to sustain a company around this product.
[1]: https://github.com/ClickHouse/ClickHouse/blob/master/program...
[2]: https://postgresql-anonymizer.readthedocs.io/en/stable/
Referential constraint refer to ensuring some coherence / basic logic in the output data (ie. the anonymized street name must exist in the anonymized city). This was the most time consuming phase of the pseudonymization process. They were working on introducing pseudonymization with cross-referential constraint which is a mess as constraint were often strongly intertwined. Also, a lot of the time client had no proper idea of what the field were and what they were truly containing (what format of phone number, we did find a lot of unusual things).
[0] (LINO, PIMO, SIGO, etc.) https://github.com/CGI-FR/PIMO
I am also among the many people who have created a solution similar[0] to this :). The approach I took though is being metadata-driven (given most anonymisation solutions cannot guarantee sensitive data not leaking and also open up network access from prod to test envs, security teams did not accept it whilst I was working at a bank), offering the option to validate based on the generated data (i.e. check if your service or job has consumed the data correctly) and ability to clean up the generated or consumed data.
Being metadata-driven opened up the possibility of linking to existing metadata services like data catalogs (OpenMetadata, Amundsen), data quality (Great Expectations, Soda), specification files (OpenAPI/Swagger), etc., which are often underutilized.
The other part that I found whilst building and getting feedback from customers, was having referential integrity across data sources. For example, account create events coming through Kafka, consumed and stored in Postgres whilst, at the end of the day, a CSV file of the same accounts would also be consumed by a job.
I'm wondering if you have come across similar thoughts or feedback from your users?
We'd built a tool which can traverse data extract the pii data and put back a token into the data. Before one of our allowed systems would read the data we'd swap in the actual data or an anonymized version if we didn't permission to use it anymore. So we sort of get the best of both worlds we can use the actual data of our customers because we require it, but can safely use data for analytics and retain a lot of the statistical variance of our data.
Crazy complex project to work on given our limited resources but very fulfilling in the end.
It should be mentioned that I don't mention the difference between anonymization and pseudo anonymization in the article mostly because I didn't know it was really a thing. I just implemented a solution given or requirements
To ensure that we are marking columns as PII, we run a job that compares the anonymization configuration to a comment on the column- we have a comment on every column to mark it as PII (or not).
[1] https://github.com/datanymizer/datanymizer
[2] https://github.com/digitalmint/datanymizer/tree/digitalmint
[3] https://github.com/GreenmaskIO/greenmask
Other tools I found that do some anonymization but didn't meet my needs: