gregwebs
Great to see such a project. We are using datanymizer [1] right now but it has gone unmaintained and we are using my patched version [2] and it is working pretty well for us. I saw a new project that is getting close in terms of having the feature set I need and has them on their roadmap [3].

To ensure that we are marking columns as PII, we run a job that compares the anonymization configuration to a comment on the column- we have a comment on every column to mark it as PII (or not).

[1] https://github.com/datanymizer/datanymizer

[2] https://github.com/digitalmint/datanymizer/tree/digitalmint

[3] https://github.com/GreenmaskIO/greenmask

Other tools I found that do some anonymization but didn't meet my needs:

  * https://github.com/DivanteLtd/anonymizer
  * https://postgresql-anonymizer.readthedocs.io/en/stable
  * https://nitzano.github.io/dbzar/
  * https://github.com/Qovery/Replibyte
blopker
I don't know exactly how this works, but I wanted to share my experience trying to anonymize data. Don't.

While you may be able to change or delete obvious PII, like names, every bit of real data in aggregate leads to revealing someone's identity. They are male? That's half the population. They also live in Seattle, are Hispanic, age 18-25? Down to a few hundred thousand. They use Firefox? That might be like 10 people.

This is why browser fingerprinting is so effective. It's how Ad targeting works.

Just stick with fuzzing random data during development. Many web frameworks already have libraries for doing this. Django for example has factory_boy[0]. You just tell it what model to use, and the factory class will generate data based on your schema. You'll catch more issues this way anyway because computers are better at making nonsensical data.

Keep production data in production.

[0]: https://factoryboy.readthedocs.io/en/stable/orms.html

imiric
Congrats on the launch!

This topic is relevant to what I'm currently working on, and I'm finding it exhausting to be honest. After considering several options for anonymizing both Postgres and ClickHouse data, I've been evaluating clickhouse-obfuscator[1] for a few weeks now. The idea in principle is great, since ClickHouse allows you to export both its data and Postgres data (via their named collections feature) into Parquet format (or a bunch of others, but we settled on Parquet), and then using clickhouse-obfuscator to anonymize the data and store it in Parquet as well, which can then be imported where needed.

The problem I'm running into is referential integrity, as importing the anonymized data is raising unique and foreign key violations. The obfuscator tool is pretty minimal and has few knobs to tweak its output, so it's difficult to work around this, and I'm considering other options at this point.

Your tool looks interesting, and it seems that you directly address the referential integrity issue, which is great.

I have a couple of questions:

1. Does Neosync ensure that anonymized data has the same statistical significance (distribution, cardinality, etc.) as source data? This is something that clickhouse-obfuscator put quite a lot of effort in addressing, as you can see from their README. Generating synthetic data doesn't solve this, and some anonymization tools aren't this sophisticated either.

2. How does it differ from existing PG anonymization solutions, such as PostgreSQL Anonymizer[2]? Obviously you also handle MySQL, but I'm interested in PG specifically.

As a side note, I'm not sure I understand what the value proposition of your Cloud service would be. If the original data needs to be exported and sent to your Cloud for anonymization, it defeats the entire purpose of this process, and only adds more risk. I don't think that most companies looking for a solution like this would choose to rely on an external service. Thanks for releasing it as open source, but I can't say that I trust your business model to sustain a company around this product.

[1]: https://github.com/ClickHouse/ClickHouse/blob/master/program...

[2]: https://postgresql-anonymizer.readthedocs.io/en/stable/

mathisd
During an internship, I was part of a team that developed a collection of tools [0] intended to provide pseudonymization of production database for testing and development purposes. These tools were developed while used in parallel with clients that had a large number of database.

Referential constraint refer to ensuring some coherence / basic logic in the output data (ie. the anonymized street name must exist in the anonymized city). This was the most time consuming phase of the pseudonymization process. They were working on introducing pseudonymization with cross-referential constraint which is a mess as constraint were often strongly intertwined. Also, a lot of the time client had no proper idea of what the field were and what they were truly containing (what format of phone number, we did find a lot of unusual things).

[0] (LINO, PIMO, SIGO, etc.) https://github.com/CGI-FR/PIMO

enahs-sf
I love that it's open-source. Great project and very applicable across a lot of industries, especially those deeply affected by compliance.
pitah1
Thanks for sharing. Happy to see another solution that doesn't just slap on AI/ML to try to solve it.

I am also among the many people who have created a solution similar[0] to this :). The approach I took though is being metadata-driven (given most anonymisation solutions cannot guarantee sensitive data not leaking and also open up network access from prod to test envs, security teams did not accept it whilst I was working at a bank), offering the option to validate based on the generated data (i.e. check if your service or job has consumed the data correctly) and ability to clean up the generated or consumed data.

Being metadata-driven opened up the possibility of linking to existing metadata services like data catalogs (OpenMetadata, Amundsen), data quality (Great Expectations, Soda), specification files (OpenAPI/Swagger), etc., which are often underutilized.

The other part that I found whilst building and getting feedback from customers, was having referential integrity across data sources. For example, account create events coming through Kafka, consumed and stored in Postgres whilst, at the end of the day, a CSV file of the same accounts would also be consumed by a job.

I'm wondering if you have come across similar thoughts or feedback from your users?

[0]: https://github.com/data-catering/data-caterer

kjuulh
I just published our approach to pseudo anonymization and sort of anonymization.

We'd built a tool which can traverse data extract the pii data and put back a token into the data. Before one of our allowed systems would read the data we'd swap in the actual data or an anonymized version if we didn't permission to use it anymore. So we sort of get the best of both worlds we can use the actual data of our customers because we require it, but can safely use data for analytics and retain a lot of the statistical variance of our data.

Crazy complex project to work on given our limited resources but very fulfilling in the end.

It should be mentioned that I don't mention the difference between anonymization and pseudo anonymization in the article mostly because I didn't know it was really a thing. I just implemented a solution given or requirements

https://tech.lunar.app/blog/data-anonymization-at-scale

aj__chan
Amazing open source project! I can see pretty broad application to basically every application developers stack as they're building out their tools, but also working with real world production data in developer environments that don't break compliance. Great work, Evis & Nick!
chairmanwow1
Interesting, but why does it matter that I actually keep the same the same statistical distributions of data in development as in production? What are the use cases for that kind of feature?
ngcazz
Hey, this looks quite cool! Just spotted this link on your site's frontpage is 404ing https://www.neosync.dev/solutions/keep-environments-in-sync (was quite keen to read this one specifically)
DerCommodore
[flagged]