Ask HN: Best practice to protect from back end data exfiltration via website?

markden

8h ago

2

solardev

Web dev here, but not cybersec focused... if I'm wrong, someone will be along to correct me shortly :)

That said, I'm reasonably confident that what you want isn't doable/practical, unfortunately :(

While there are certainly companies that make valuable datasets available over the web, the usual way they prevent mass scraping is by enforcing account limits, making retrieval expensive and also limited to only one tiny slice of data at a time. An example industry that does this are the mass data harvesting/targeting companies like Meta, Alphabet, or political companies (NGPVan, Actblue, etc.). They cross-reference a lot of PII floating around the internet, and/or harvest their own and then sell that to advertisers or political campaigns, but only a slice at a time, and at prices that they determine. You can of course pay to scrape any one slice of it, but if you wanted the whole dataset, you'd probably end up paying more than the entire company's worth.

That, or their data is inherently time-sensitive, such that older copies of it aren't as valuable. Stocks, real estate sites, news tickers, etc. come to mind, where sure, you can scrape their stuff, but unless you perform some sort of value-added collation/analysis on top of it, it's going to be stale by the time you serve it to your own users. The data originators are always one step ahead of you.

If your data isn't proprietary to begin with (i.e. you're not the one making it and adding updates) AND you want it to be publicly accessible without an account... it's only a matter of time before some botnet or another scrapes all of it.

You can do things to slow down the scraping, such as adding Cloudflare, but realistically, bots and labor are very cheap in much of the world, and if someone really wants your data, they'll get it. It's essentially free to them, especially if you've done all the hard work of collecting it and putting it all on a single website.

It will always take more time for you add to manually add filter permutations than it takes a script & botnet to enumerate through them. They can just tweak parameters and send them through thousands of headless browsers running in dispersed instances across the world.

You can require account signup and verification before accessing the data, but that's also trivially faked unless you're requiring real payments.

Identifying real users vs bots is anything BUT trivial. Google and Cloudflare and hCaptcha have spent decades trying to solve that with huge teams and world-class researchers. And even they only have limited success rates, especially since anybody can spend pennies to hire real humans to run through your captchas. And that problem is only going to get harder, much harder, with all the advancements in machine learning, natural language processing, and machine vision.

Sorry for the bad news =/ I hope I'm wrong, but I'm fairly confident you can't really accomplish this.