ZeroCool2u
I've done a good amount of geospatial analysis for work.

One thing you quickly realize with geospatial data is that it's incredibly nuanced. You have to be quite careful about understanding which coordinate reference system (CRS) and for visualization which projection is being used. The CRS is somewhat paranoia inducing if you don't have great infrastructure setup with the right tools to carry that metadata with your geospatial data.

I've tested everything AWS has to offer, Postgres/PostGIS, Spark/DB, Snowflake, Trini, and ARCGis.

I'm convinced the best tool for large scale geospatial work is Google BigQuery and it's not even close. It took an expensive multi hour query running on PostGIS deployed on an enormous m6a EC2 instance to less than 5 seconds that ran in the BigQuery free tier. It does make sense if you think about it, Google was early with Maps, but it is just stunning how much better they are in this specific niche domain.

This was using publicly available FEMA data that Snowflake and AWS services would just choke on, because the geometry column exceeded their maximum byte size. Spark doesn't even have geospatial data types and the open source packages that add support leave a lot to be desired.

This guy is running on-prem, so maybe this made sense, but I just would never bother. The storage for BQ would probably be less than $100/months for 20 TB.

jamesgresql
This is super cool! I run DevlRel @ Timescale, and I love seeing our community create well written posts like this!

My initial reaction is that I think one of the reasons you're seeing a hypertable being slower is almost certainly that it creates an index on the timestamp column by default. You don't have an index on your standard table which lets it go faster.

You can use create_hypertable with create_default_indexes=>false to skip creating the index, or you can just drop the index before you ingest data. You'll eventually want that index - but it's best created after ingestion in a one-shot load like this.

I'd also be interested in how the HDD you're reading data from is holding up in some of the highly parallel setups?

counters
Why?

Most weather and climate datasets - including ERA5 - are highly structured on regular latitude-longitude grids. Even if you were solely doing timeseries analyses for specific locations plucked from this grid, the strength of this sort of dataset is its intrinsic spatiotemporal structure and context, and it makes very little sense to completely destroy the dataset's structure unless you were solely and exclusively to extract point timeseries. And even then, you'd probably want to decimate the data pretty dramatically, since there is very little use case for, say, a point timeseries of surface temperature in the middle of the ocean!

The vast majority of research and operational applications of datasets like ERA5 are probably better suited by leveraging cloud-optimized replicas of the original dataset, such as ARCO-ERA5 published on the Google Public Datasets program [1]. These versions of the dataset preserve the original structure, and chunk it in ways that are amenable to massively parallel access via cloud storage. In almost any case I've encountered in my career, a generically chunked Zarr-based archive of a dataset like this will be more than performant enough for the majority of use cases that one might care about.

[1]: https://cloud.google.com/storage/docs/public-datasets/era5

rabernat
Great post! Hi Ali!

I think what's missing here is an analysis of what is gained by moving the weather data into a RDBMS. The motivation is to speed up queries. But what's the baseline?

As someone very familiar with this tech landscape (maintainer of Xarray and Zarr, founder of https://earthmover.io/), I know that serverless solutions + object storage can deliver very low latency performance (sub second) for timeseries queries on weather data--much faster than the 30 minutes cited here--_if_ the data are chunked appropriately in Zarr. Given the difficulty of data ingestion described in this post, it's worth seriously evaluating those solutions before going down the RDBMS path.

postgresperf
Contributor to the PG bulk loading docs you referenced here. Good survey of the techniques here. I've done a good bit of this trying to speed up loading the Open Street Map database. Presentation at https://www.youtube.com/watch?v=BCMnu7xay2Y for my last public update. Since then the advance of hardware, GIS improvements in PG15, and osm2pgsql adopting their middle-way-node-index-id-shift technique (makes the largest but rarely used index 1/32 the size) have gotten my times to load the planet set below 4 hours.

One suggestion aimed at the author here: some of your experiments are taking out WAL writing in a sort of indirect way, using pg_bulkload and COPY. There's one thing you could try that wasn't documented yet when my buddy Craig Ringer wrote the SO post you linked to: you can just turn off the WAL in the configuration. Yes, you will lose the tables in progress if there's a crash, and when things run for weeks those happen. With time scale data, it's not hard to structure the loading so you'll only lose the last chunk of work when that happens. WAL data isn't really necessary for bulk loading. Crash, clean up the right edge of the loaded data, start back up.

Here's the full set of postgresql.conf settings I run to disable the WAL and other overhead:

wal_level = minimal max_wal_senders = 0 synchronous_commit = off fsync = off full_page_writes = off autovacuum = off checkpoint_timeout = 60min

Finally, when loading in big chunks, to keep the vacuum work down I'd normally turn off autovac as above then issue periodic VACUUM FREEZE commands running behind the currently loading date partition. (Talking normal PG here) That skips some work of the intermediate step the database normally frets about where new transactions are written but not visible to everyone yet.

carderne
Hey OP (assuming you're the author), you might be interested in this similar experiment I did about four years ago, same dataset, same target, similar goal!

https://rdrn.me/optimising-sql/

Similar sequence of investigations, but using regular Postgres rather than Timescale. With my setup I got another ~3x speedup over COPY by copying binary data directly (assuming your data is already in memory).

d416
“Is a relational database even appropriate for gridded weather data? No idea but we’ll find out.”

Love this. It’s the exact opposite of all the other ‘well actually’ mainstream tech posts and I am here for all of it. Props for keeping the reader fully engaged on the journey.

lawn
What an interesting post!

> At a sustained ~462k inserts per second, we’re waiting ~20 days for our ~754 billion rows which is not bad I guess It’s less time than it took me to write this post.

Hah, as I've been gravitating more to writing larger and more in depth blog posts I can relate to the surprising effort it can require.

RyanHamilton
If you want to plot time-series charts or many other charts directly from sql queries, qStudio is a free SQL IDE and works with everything including TimescaleDB: https://www.timestored.com/qstudio/database/timescale Disclaimer, I am the author.
roter
I too use the ERA5 reanalysis data and I too need quick time series. As the data comes in [lat, lon] grids, stacked by whatever period you've chosen, e.g. [month of hourly data, lat, lon], it becomes a massive matrix transpose problem if you want 20+ years.

What I do is download each netCDF file, transpose, and insert into a massive 3D HDF file organized as [lat, lon, hour]. On my workstation it takes about 30 minutes to create one year for one variable (no parallel I/O or processes) but then takes milliseconds to pull a single (lat, lon) location. Initial pain for long-term gain. Simplistic, but I'm just a climatologist not a database guru.

gingerwizard
Few in this thread have suggested ClickHouse would do well here. We tested 1 trillon rows recently, albeit much simpler data - https://clickhouse.com/blog/clickhouse-1-trillion-row-challe...

This is a good dataset though and the level of detail in the post is appreciated. I'll give ClicKHouse a go on the same...

Disclaimer: I work for ClickHouse

hyperman1
Two remarks with postgres and lots of data:

1) I always wonder of there is a better way than COPY. I tend to quickly get 100% CPU without saturating I/O

2) The row overhead seems big. A row has 23 bytes overhead, this table has 48 bytes data per row, so even without page overhead, we lose ~1/3 of our storage. This is pure data storage, without any index.

to11mtm
As someone who used to do some GIS hacking in an office job[0] before I was a 'Software Developer/engineer' this is super cool.

[0] - Honestly some of the coolest stuff I ever got to do in it's own right. Building tools that could move data between AutoCAD, Microstation, and Google Earth while also importing other bits with metadata from Trimble units[1]. Also it was probably the most I ever used math in my entire career [2], so there's that.

[1] - I wound up finding a custom font maker, and one of our folks made a font library with the symbols, made it easy to write a parser too :D

[2] - One of those PDFs I always seem to wind up having floating on a hard drive is the USGS 'Map Projections, a working manual'. At one point I used it as a reference to implement a C# library to handle transforms between coordinate systems... alas it was internal.

semiquaver
Any idea why hypertable insert rates were slower? I though hypertables were supposed to _increase_ insert rates?
tmiku
> I think it would be cool to have historical weather data from around the world to analyze for signals of climate change we’ve already had rather than think about potential future change.

This is a very good instinct! A pretty major portion of modern climate science is paleoclimatology, with a goal of reaching far beyond reliable human measurements. A lot of earth's previous climate states were wildly different from the range of conditions we have experienced in the past 10,000 years, and a better climate record is essential to predicting the effects of massive carbon emission.

Ice cores from Antarctica/Greenland are the most famous instance of this, but there's a lot of other cool ones - there are chemical records of climate change in cave stalactites, ocean floor sediments, etc.

sammy2255
Clickhouse will eat this for breakfast. And has built-in compression even at the column level
valyala
Very well written article! I remember the time when I was tuning Postgresql for achieving the maximum possible ingestion speed for ad analytics data. I end up with the following recommendations - https://gist.github.com/valyala/ae3cbfa4104f1a022a2af9b8656b... .

After that I discovered ClickHouse and forgot about all this performance tuning, since ClickHouse could ingest the same as analytics data at 10x higher speed on the same hardware with default settings. On top of this, it could perform analytical queries by up to 1000 times faster than Postgresql on the same hardware. And, as a bonus, disk space usage dropped by more than 10 times compared to Postgresql.

My prediction: if the ingestion of 750 billion rows takes 20 days on highly tuned TimescaleDB and Postgresql, it will take less than a day on Clickhouse with standard settings on the same hardware.

P.S. I was conducting a similar benchmark on a single-node VictoriaMetrics a few years ago - it was capable ingesting 500 billion of samples in ~2 hours - https://valyala.medium.com/billy-how-victoriametrics-deals-w... . Every sample contains a single measurement, while every row in the original article contains 5 measurements. So, if extrapolating, it would take `(2h * 5)/500*750 = 15h` to ingest 750 billion of rows with weather measurements into a single-node VictoriaMetrics.

tonymet
I encourage people to look into the ERA5 dataset provenance especially when you approach the observations made toward the "pre industrial date" of 1850 .

Remember that modern global surface temperatures are collected by satellites, and the dataset is comingled with recordings observed visually & made by hand using buckets by sailors who were not primarily academic researchers. Segments of high resolution, low noise data (satellites) are mixed with low resolution, low coverage, high noise records (hand records on a boat made surrounding the united kingdom).

My point is to be in awe of the technical aspects of this effort but also keep in mind that we are only making copies of low resolution, noisy manuscripts from sailors 170 years ago.

hendiatris
If you’re going to work with weather data use a columnar database, like BigQuery. If you set things up right your performance will generally be a few seconds for aggregation queries. I setup a data platform like this at my previous company and we were able to vastly outperform our competitors and at a much lower cost.

The great thing about this data is it is generally append only, unless errors are found in earlier data sets. But it’s something that usually only happens once a year if at all.

islandert
If you don't have access to COPY if the postgres instance is managed, I've had a lot of luck with encoding a batch of rows as a JSON string, sending the string as a single query parameter, and using `json_to_recordset` to turn the JSON back into a list of rows in the db.

I haven't compared how this performs compared to using a low-level sql library but it outperforms everything else I've tried in sqlalchemy.

robertlagrant
> I think it would be cool to have historical weather data from around the world to analyze for signals of climate change we’ve already had rather than think about potential future change.

This is querying a model rather than real observations, so it may more reflect the model's understanding of the world. Still useful; just worth noting I think.

user3939382
> 727,080 snapshots in time for each variable like temperature

Maybe we should be recording and transmitting the parameters for a function or set of functions that describes these temperature etc changes rather than a set of points to drastically reduce the size of these data sets.

koliber
Curious if he could squeeze more performance by using a different structure to store the same data. Some of these float4 cols could probably be stored as int2. Depending on how many decimal places are needed, can divide the int to get the resulting floating point value.
curious_cat_163
> The data is output from a climate model run that is constrained to match weather observations.

That's interesting. Why store it? Why not compute it as needed using the model?

FWIW, I am not an expert in this space and if someone is, it would be good to understand it.

dunefox
OT: Does anyone know if DuckDB would be of use here?
smellybigbelly
Can anyone give some advice on how they run TimeScale in Kubernetes? I’m seeing they dropped support for their Helm chart.
rkwasny
Yeah, don't use TimescaleDB, use ClickHouse - I have 10 years of NOAA climate data on my desktop that I query when I want to go on holidays ;-)
nojvek
This should be a benchmark.

Could someone post me to where I can download the whole dataset?

dcreater
Nice! Can you upload the data to hf, oxen.ai, kaggle or something?
Aeroi
What did Google use to train GraphCast?
sigmonsays
where can I download the weather data?

Is it free or available if I sign up?

kinj28
It's a tangent.

I am curious if we query the data to give us temperature at a given time for all lat n long and plot it geo spatially , would the result give anything on heat distribution of energy received across the lat and long at that point in time?

Linda231
[dead]
_bax
[flagged]
jhoechtl
For the German speaking among us: That should be 1 000 000 000 000 rows

Million - million

Milliarde - billion

Billion - trillion

rjmunro
This annoys me:

> You can use parameterized queries to protect against SQL injection,

No, that't not what parameterized queries are for. That's just a side benefit.

Parameterized queries are so that the database doesn't have to parse the data from SQL to do the query. It's stupid to turn all the data to text, send the text over the wire, escaped and quoted as appropriate, then parse it back to whatever form it was in originally when you can just send it binary straight to the database.

And if you are doing many similar queries with different results, e.g. inserts as here, or maybe querying the user table by user id every page load, the database doesn't have to parse any SQL each time and can even reuse the query plan.

This may be why psycopg3 performed better than pandas df.to_sql() function in the single row insert case.