regexman1
3d ago
323
159
magicalhippo
I'll add some of mine:

Learn your DB server. Check the query plans often. You might get surprised. Tweak and recheck.

Usually EXISTS is faster than IN. Beware that NOT EXISTS behaves differently than EXCEPT in regards to NULL values.

Instead of joining tables and using distinct or similar to filter rows, consiser using subquery "columns", ie in SELECT list. This can be much faster even if you're pulling 10+ values from the same table, even if your database server supports lateral joins. Just make sure the subqueries return at most one row.

Any query that's not a one-off should not perform any table scans. A table scan today can mean an outage tomorrow. Add indexes. Keep in mind GROUP BY clause usually dictates index use.

If you need to filter on expressions, say where a substring is equal something, you can add a computed column and index on that. Alternatively some db's support indexing expressions directly.

Often using UNION ALL can be much faster than using OR, even for non-trivial queries and/or multiple OR clauses.

edit: You can JOIN subqueries. This can be useful to force the filtering order if the DB isn't being clever about the order.

silveraxe93
The "readability" section has 3 examples. The first 2 are literally sacrificing readability so it's easier to write, and the last has an unreadable abomination that indenting is really not doing much.
rawgabbit
My tips for working with complex Stored Procedures.

1. At the beginning of the proc, immediately copy any permanent tables into temporary tables and specify/limit/filter only for the rows you need.

2. In the middle of the proc, manipulate the temporary tables as needed.

3. At the end of the proc, update the permanent tables enclosed within a transaction. Immediately rollback transaction/exit the proc, if an error is detected. (By following all three steps, this will improve concurrency and lets you restart the proc without manually cleaning up any data messes).

4. Use extreme caution when working with remote tables. Remote tables do not reside in your RDBMS and most likely will not utilize any statistics/indexes your RDBMS has. In many cases, it is more performant to dump/copy the entire remote table into a temporary table and then work with that. The most you can expect from a remote table is to execute a Where clause. If you attempt Joins or something complicated, it will likely timeout.

5. The Query Plan is easily confused. In some cases, the Query Plan will resort to perform row by row processing which will bring performance to a halt. In many cases, it is better to break up a complex stored procedure into smaller steps using temporary tables.

6. Always check the Query Plan to see what the RDBMS is actually doing.

WuxiFingerHold
I'm not a fan of "just in case" development. Not when it comes to interfaces and also not regarding this `where 1=1` placeholder. Do things when you need them. Not if you think you might need them someday in the futures. Also, production code is not the place to keep dev helpers around. Do what you want in dev time, but for prod code readability and clear intent is much more important.
youdeet
One more point in the "Anti Join". Use EXISTS instead of IN and LEFT JOIN if you only want to check existence of a row in another large table / subquery based on the conditions. EXISTS returns true as soon as it has found a hit. In case of LEFT JOIN and IN engine collects all results before evaluating.
Semaphor
Regarding "Comment your code!": At least for MSSQL, it’s often recommended not to use -- for comments but instead /**/, because many features like the query store save queries without line breaks, so if you get the query from there, you need to manually fix everything instead of simply using your IDEs formatter.
the_gorilla
Leading comma is nice in SELECT statements because you can comment toggle individual lines. Indenting your code to make it more readable is basically what anyone with room temperature IQ does automatically. A lot of these other tips look like they're designed to deal with SQL design flaws, like how handling nulls isn't well defined in the spec so it's up to each implementation to do whatever it wants.
wodenokoto
Everybody is up in arms about the comma suggestion but everyone thinks the 1=1 is a good idea in the where clause? If I saw that in a code review I don’t know what I’d think of the author.
sgarland
Not shown: stop using SELECT *. You almost certainly do not need the entire width of the table, and by doing so, you add more data to filter and transmit, and also prevent semijoins, which are awesome.
shmerl
I don't get the point of the dummy value. How does it help doing anything? I can add conditions with ease without it.
password4321
Is anyone willing to share general guidance on where to draw the line when it comes to using DB configuration to speed things up ( almost "buy") vs. basically doing things manually ("build")? In my limited experience it often falls to app developers because competent DB admins are all getting paid much more to work elsewhere (as mentioned above, it is important to know the DB).

My canonical example is large volumes of data that accrue over time with the most recent accessed most often, where the DB admins can partition things or do partial indexes to keep access fast, but the app developers can move records into a separate archive table sometimes behind the scenes while still supporting things like (eventual) search of the whole data set. (A note here that it feels like a tool could do a lot of the initial heavy lifting to automate splitting one table into many when it makes sense -- perhaps when limited by a cloud DB's missing features)

Another management option sometimes accommodated by the DB vs. doing manually is to store all large blobs/files in their own separate database (filesystem?!) for a different storage configuration etc.

I imagine it can go as far as basically implementing an index manually: one massive table with just an auto-incrementing primary key but tons of columns then setting up a table with that ID and a few searchable columns (including up to going full text search/vectors I guess).

Edit: one useful tip manually implementing the Materialized View pattern with MSSQL 2016+: use partition switching as well explained and implemented by https://github.com/cajuncoding/SqlBulkHelpers?tab=readme-ov-... (incidentally the most commercially useful out-SEO'd tiny-star-count library I've ever found, focused on bulk inserts into MSSQL using .NET). I think this is a good example of drawing the buy/build line in the right place with the automation of the partition switching.

MK2k
Maybe off-topic, but: is "just closing without comment or discussion" an acceptable way of a maintainer to deal with pull requests?

Asking as someone who only occasionally contributed (or tried to) with a repository.

Examples: https://github.com/ben-n93/SQL-tips-and-tricks/pulls?q=is%3A...

brikym
I like SQL but I think it's time for the big players like MySQL, MSSQL, Postgres etc to start using FROM-first and piping syntax. I've had the pleasure of using Kusto query language and it's a huge leap forward in DX.
otteromkram
Ugh. Don't put opening braces on new lines.

As for formatting, indent the first field, too

  SELECT
     employee_id
  ,  employee_name
  ,  job
  ,  salary
  FROM employees
  ;
galkk
Idk, I feel it's missing really useful convenience stuff that exists here and there...

Examples, from the top of my head:

1. JOIN USING, for databases that support it. In some databases you can replace

    FROM t1 JOIN t2 ON t1.c1 = t2.c1 AND t1.c2 = t2.c2 ...
with

    FROM t1 JOIN t2 USING (c1, c2)
much shorther and cleaner

2. Ability to exclude columns in select *

DuckDB:

    SELECT * EXCLUDE (c1)
Spanner

    SELECT * EXCEPT (c1)
elchief
use sqlfluff linter and do what it says
philippta
I really like the formatting presented in this article:

https://www.sqlstyle.guide/#spaces

AtNightWeCode
A common mistake I see is that people think foreign keys will automatically create indexes. Missing indexes is a general problem in SQL. Missing indexes on columns that are in foreign keys are even worse.
rldjbpin
i noticed a lot of these advise used by senior devs in my team or in legacy code. but as someone just starting out, a lot of these (like "1=1") was very odd and made the queries less accessible.

nice to finally found the term for the "anti-query"; learning about it really changed how i write queries. equally good to see that most of these apply regardless of the RDBMS of choice.

isoprophlex
Wow, that EXCEPT trick is neat! ~10 years of using SQL almost daily, and I never knew...
l5870uoo9y
And I take it CTEs are implicitly being discouraged.
mergisi
Great post! If you're looking to speed up your SQL queries, you might want to check out AI2sql https://ai2sql.io/. It can generate SQL queries quickly from plain English prompts, which can be super helpful when you're in a rush or dealing with complex queries. Definitely worth giving a try for anyone looking to streamline their workflow!
alex5207
Never knew about QUALIFY. That's great
jason-phillips
Also, window functions.
dooer
I am bad at SQL so this is great
dspillett
On readability, I often find aligning things in two columns is more readable. To modify the two examples in TFA:

    SELECT e.employee_id
         , e.employee_name
         , e.job
         , e.salary
      FROM employees e
     WHERE 1=1 -- Dummy value.
       AND e.job IN ('Clerk', 'Manager')
       AND e.dept_no != 5
         ;
and with a JOIN:

    SELECT e.employee_id
         , e.employee_name
         , e.job
         , e.salary
         , d.name
         , d.location
      FROM employees e
      JOIN departments d
           ON d.dept_no = e.dept_no
     WHERE 1=1 -- Dummy value.
       AND e.job IN ('Clerk', 'Manager')
       AND e.dept_no != 5
         ;
In the join example, for a simple ON clause like that I'll usually just have JOIN ... ON in the one line, but if there are multiple conditions they are usually clearer on separate lines IMO.

In more complicated queries I might further indent the joins too, like:

    SELECT *
      FROM employees e
           JOIN departments d
             ON d.dept_no = e.dept_no
     WHERE 1=1 -- Dummy value.
       AND e.job IN ('Clerk', 'Manager')
       AND e.dept_no != 5
         ;
YMMV. Some people strongly agree with me here, others vehemently hate the way I align such code…

WRT “Always specify which column belongs to which table”: this is particularly important for correlated sub-queries, because if you put the wrong column name in and it happens to match a name in an object in the outer query you have a potentially hard to find error. Also, if the table in the inner query is updated to include a column of the same name as the one you are filtering on in the outer, the meaning of your sub-query suddenly changes quite drastically without it having changed itself.

A few other things off the top of my head:

1. Remember that as well as UNION [ALL], EXCEPT and INTERSECT exist. I've seen (and even written myself) some horrendous SQL that badly implements these behaviours. TFA covers EXCEPT, but I find people who know about that don't always know about INTERSECT. It is rarely useful IME, but when it is useful it is really useful.

2. UPDATEs that change nothing still do everything else: create entries in your transaction log (could be an issue if using log-shipping for backups or read-only replicas etc.), fire triggers, create history rows if using system-versioned tables, and so forth. UPDATE a_table SET a_column = 'a value' WHERE a_column <> 'a value' can be a lot faster than without the WHERE.

3. Though of course be very careful with NULLable columns and/or setting a value NULL with point 2. “WHERE a_column IS DISTINCT FROM 'a value'” is much more maintainable if your DB supports that syntax (added in MS SQL Server 2022 and Azure SQL DB a little earlier, supported by Postgres years before, I don't know about other DBs without checking) than the more verbose alternatives.

4. Trying to force the sort order of NULLs with something like “ORDER BY ISNULL(a_column, 0)”, or doing similar with GROUP BY, can be very inefficient in some cases. If you expect few rows to be returned and there are relatively few NULLs in the sort target column it can be more performant to SELECT the non-NULL case and the NULL case then UNION ALL the two and then sort. Though if you do expect many rows this can backfire badly and you and up with excess spooling to disk, so test, test, and test again, when hacking around like this.

eezing
[flagged]
egeozcan
I remember doing the "WHERE 1=1" trick in my last job and it causing a... let's say "unproductive", discussion in the pull-request.
AtNightWeCode
Never use WHERE 1=1. It is both a security risk and a performance risk to run dynamic ad-hoc queries.