My initial reaction is that I think one of the reasons you're seeing a hypertable being slower is almost certainly that it creates an index on the timestamp column by default. You don't have an index on your standard table which lets it go faster.
You can use create_hypertable with create_default_indexes=>false to skip creating the index, or you can just drop the index before you ingest data. You'll eventually want that index - but it's best created after ingestion in a one-shot load like this.
I'd also be interested in how the HDD you're reading data from is holding up in some of the highly parallel setups?
Most weather and climate datasets - including ERA5 - are highly structured on regular latitude-longitude grids. Even if you were solely doing timeseries analyses for specific locations plucked from this grid, the strength of this sort of dataset is its intrinsic spatiotemporal structure and context, and it makes very little sense to completely destroy the dataset's structure unless you were solely and exclusively to extract point timeseries. And even then, you'd probably want to decimate the data pretty dramatically, since there is very little use case for, say, a point timeseries of surface temperature in the middle of the ocean!
The vast majority of research and operational applications of datasets like ERA5 are probably better suited by leveraging cloud-optimized replicas of the original dataset, such as ARCO-ERA5 published on the Google Public Datasets program [1]. These versions of the dataset preserve the original structure, and chunk it in ways that are amenable to massively parallel access via cloud storage. In almost any case I've encountered in my career, a generically chunked Zarr-based archive of a dataset like this will be more than performant enough for the majority of use cases that one might care about.
[1]: https://cloud.google.com/storage/docs/public-datasets/era5
I think what's missing here is an analysis of what is gained by moving the weather data into a RDBMS. The motivation is to speed up queries. But what's the baseline?
As someone very familiar with this tech landscape (maintainer of Xarray and Zarr, founder of https://earthmover.io/), I know that serverless solutions + object storage can deliver very low latency performance (sub second) for timeseries queries on weather data--much faster than the 30 minutes cited here--_if_ the data are chunked appropriately in Zarr. Given the difficulty of data ingestion described in this post, it's worth seriously evaluating those solutions before going down the RDBMS path.
One suggestion aimed at the author here: some of your experiments are taking out WAL writing in a sort of indirect way, using pg_bulkload and COPY. There's one thing you could try that wasn't documented yet when my buddy Craig Ringer wrote the SO post you linked to: you can just turn off the WAL in the configuration. Yes, you will lose the tables in progress if there's a crash, and when things run for weeks those happen. With time scale data, it's not hard to structure the loading so you'll only lose the last chunk of work when that happens. WAL data isn't really necessary for bulk loading. Crash, clean up the right edge of the loaded data, start back up.
Here's the full set of postgresql.conf settings I run to disable the WAL and other overhead:
wal_level = minimal max_wal_senders = 0 synchronous_commit = off fsync = off full_page_writes = off autovacuum = off checkpoint_timeout = 60min
Finally, when loading in big chunks, to keep the vacuum work down I'd normally turn off autovac as above then issue periodic VACUUM FREEZE commands running behind the currently loading date partition. (Talking normal PG here) That skips some work of the intermediate step the database normally frets about where new transactions are written but not visible to everyone yet.
https://rdrn.me/optimising-sql/
Similar sequence of investigations, but using regular Postgres rather than Timescale. With my setup I got another ~3x speedup over COPY by copying binary data directly (assuming your data is already in memory).
Love this. It’s the exact opposite of all the other ‘well actually’ mainstream tech posts and I am here for all of it. Props for keeping the reader fully engaged on the journey.
> At a sustained ~462k inserts per second, we’re waiting ~20 days for our ~754 billion rows which is not bad I guess It’s less time than it took me to write this post.
Hah, as I've been gravitating more to writing larger and more in depth blog posts I can relate to the surprising effort it can require.
What I do is download each netCDF file, transpose, and insert into a massive 3D HDF file organized as [lat, lon, hour]. On my workstation it takes about 30 minutes to create one year for one variable (no parallel I/O or processes) but then takes milliseconds to pull a single (lat, lon) location. Initial pain for long-term gain. Simplistic, but I'm just a climatologist not a database guru.
This is a good dataset though and the level of detail in the post is appreciated. I'll give ClicKHouse a go on the same...
Disclaimer: I work for ClickHouse
1) I always wonder of there is a better way than COPY. I tend to quickly get 100% CPU without saturating I/O
2) The row overhead seems big. A row has 23 bytes overhead, this table has 48 bytes data per row, so even without page overhead, we lose ~1/3 of our storage. This is pure data storage, without any index.
[0] - Honestly some of the coolest stuff I ever got to do in it's own right. Building tools that could move data between AutoCAD, Microstation, and Google Earth while also importing other bits with metadata from Trimble units[1]. Also it was probably the most I ever used math in my entire career [2], so there's that.
[1] - I wound up finding a custom font maker, and one of our folks made a font library with the symbols, made it easy to write a parser too :D
[2] - One of those PDFs I always seem to wind up having floating on a hard drive is the USGS 'Map Projections, a working manual'. At one point I used it as a reference to implement a C# library to handle transforms between coordinate systems... alas it was internal.
This is a very good instinct! A pretty major portion of modern climate science is paleoclimatology, with a goal of reaching far beyond reliable human measurements. A lot of earth's previous climate states were wildly different from the range of conditions we have experienced in the past 10,000 years, and a better climate record is essential to predicting the effects of massive carbon emission.
Ice cores from Antarctica/Greenland are the most famous instance of this, but there's a lot of other cool ones - there are chemical records of climate change in cave stalactites, ocean floor sediments, etc.
After that I discovered ClickHouse and forgot about all this performance tuning, since ClickHouse could ingest the same as analytics data at 10x higher speed on the same hardware with default settings. On top of this, it could perform analytical queries by up to 1000 times faster than Postgresql on the same hardware. And, as a bonus, disk space usage dropped by more than 10 times compared to Postgresql.
My prediction: if the ingestion of 750 billion rows takes 20 days on highly tuned TimescaleDB and Postgresql, it will take less than a day on Clickhouse with standard settings on the same hardware.
P.S. I was conducting a similar benchmark on a single-node VictoriaMetrics a few years ago - it was capable ingesting 500 billion of samples in ~2 hours - https://valyala.medium.com/billy-how-victoriametrics-deals-w... . Every sample contains a single measurement, while every row in the original article contains 5 measurements. So, if extrapolating, it would take `(2h * 5)/500*750 = 15h` to ingest 750 billion of rows with weather measurements into a single-node VictoriaMetrics.
Remember that modern global surface temperatures are collected by satellites, and the dataset is comingled with recordings observed visually & made by hand using buckets by sailors who were not primarily academic researchers. Segments of high resolution, low noise data (satellites) are mixed with low resolution, low coverage, high noise records (hand records on a boat made surrounding the united kingdom).
My point is to be in awe of the technical aspects of this effort but also keep in mind that we are only making copies of low resolution, noisy manuscripts from sailors 170 years ago.
The great thing about this data is it is generally append only, unless errors are found in earlier data sets. But it’s something that usually only happens once a year if at all.
I haven't compared how this performs compared to using a low-level sql library but it outperforms everything else I've tried in sqlalchemy.
This is querying a model rather than real observations, so it may more reflect the model's understanding of the world. Still useful; just worth noting I think.
Maybe we should be recording and transmitting the parameters for a function or set of functions that describes these temperature etc changes rather than a set of points to drastically reduce the size of these data sets.
That's interesting. Why store it? Why not compute it as needed using the model?
FWIW, I am not an expert in this space and if someone is, it would be good to understand it.
Could someone post me to where I can download the whole dataset?
Is it free or available if I sign up?
I am curious if we query the data to give us temperature at a given time for all lat n long and plot it geo spatially , would the result give anything on heat distribution of energy received across the lat and long at that point in time?
Million - million
Milliarde - billion
Billion - trillion
> You can use parameterized queries to protect against SQL injection,
No, that't not what parameterized queries are for. That's just a side benefit.
Parameterized queries are so that the database doesn't have to parse the data from SQL to do the query. It's stupid to turn all the data to text, send the text over the wire, escaped and quoted as appropriate, then parse it back to whatever form it was in originally when you can just send it binary straight to the database.
And if you are doing many similar queries with different results, e.g. inserts as here, or maybe querying the user table by user id every page load, the database doesn't have to parse any SQL each time and can even reuse the query plan.
This may be why psycopg3 performed better than pandas df.to_sql() function in the single row insert case.
One thing you quickly realize with geospatial data is that it's incredibly nuanced. You have to be quite careful about understanding which coordinate reference system (CRS) and for visualization which projection is being used. The CRS is somewhat paranoia inducing if you don't have great infrastructure setup with the right tools to carry that metadata with your geospatial data.
I've tested everything AWS has to offer, Postgres/PostGIS, Spark/DB, Snowflake, Trini, and ARCGis.
I'm convinced the best tool for large scale geospatial work is Google BigQuery and it's not even close. It took an expensive multi hour query running on PostGIS deployed on an enormous m6a EC2 instance to less than 5 seconds that ran in the BigQuery free tier. It does make sense if you think about it, Google was early with Maps, but it is just stunning how much better they are in this specific niche domain.
This was using publicly available FEMA data that Snowflake and AWS services would just choke on, because the geometry column exceeded their maximum byte size. Spark doesn't even have geospatial data types and the open source packages that add support leave a lot to be desired.
This guy is running on-prem, so maybe this made sense, but I just would never bother. The storage for BQ would probably be less than $100/months for 20 TB.