1. At the beginning of the proc, immediately copy any permanent tables into temporary tables and specify/limit/filter only for the rows you need.
2. In the middle of the proc, manipulate the temporary tables as needed.
3. At the end of the proc, update the permanent tables enclosed within a transaction. Immediately rollback transaction/exit the proc, if an error is detected. (By following all three steps, this will improve concurrency and lets you restart the proc without manually cleaning up any data messes).
4. Use extreme caution when working with remote tables. Remote tables do not reside in your RDBMS and most likely will not utilize any statistics/indexes your RDBMS has. In many cases, it is more performant to dump/copy the entire remote table into a temporary table and then work with that. The most you can expect from a remote table is to execute a Where clause. If you attempt Joins or something complicated, it will likely timeout.
5. The Query Plan is easily confused. In some cases, the Query Plan will resort to perform row by row processing which will bring performance to a halt. In many cases, it is better to break up a complex stored procedure into smaller steps using temporary tables.
6. Always check the Query Plan to see what the RDBMS is actually doing.
My canonical example is large volumes of data that accrue over time with the most recent accessed most often, where the DB admins can partition things or do partial indexes to keep access fast, but the app developers can move records into a separate archive table sometimes behind the scenes while still supporting things like (eventual) search of the whole data set. (A note here that it feels like a tool could do a lot of the initial heavy lifting to automate splitting one table into many when it makes sense -- perhaps when limited by a cloud DB's missing features)
Another management option sometimes accommodated by the DB vs. doing manually is to store all large blobs/files in their own separate database (filesystem?!) for a different storage configuration etc.
I imagine it can go as far as basically implementing an index manually: one massive table with just an auto-incrementing primary key but tons of columns then setting up a table with that ID and a few searchable columns (including up to going full text search/vectors I guess).
Edit: one useful tip manually implementing the Materialized View pattern with MSSQL 2016+: use partition switching as well explained and implemented by https://github.com/cajuncoding/SqlBulkHelpers?tab=readme-ov-... (incidentally the most commercially useful out-SEO'd tiny-star-count library I've ever found, focused on bulk inserts into MSSQL using .NET). I think this is a good example of drawing the buy/build line in the right place with the automation of the partition switching.
Asking as someone who only occasionally contributed (or tried to) with a repository.
Examples: https://github.com/ben-n93/SQL-tips-and-tricks/pulls?q=is%3A...
As for formatting, indent the first field, too
SELECT
employee_id
, employee_name
, job
, salary
FROM employees
;
Examples, from the top of my head:
1. JOIN USING, for databases that support it. In some databases you can replace
FROM t1 JOIN t2 ON t1.c1 = t2.c1 AND t1.c2 = t2.c2 ...
with FROM t1 JOIN t2 USING (c1, c2)
much shorther and cleaner2. Ability to exclude columns in select *
DuckDB:
SELECT * EXCLUDE (c1)
Spanner SELECT * EXCEPT (c1)
nice to finally found the term for the "anti-query"; learning about it really changed how i write queries. equally good to see that most of these apply regardless of the RDBMS of choice.
SELECT e.employee_id
, e.employee_name
, e.job
, e.salary
FROM employees e
WHERE 1=1 -- Dummy value.
AND e.job IN ('Clerk', 'Manager')
AND e.dept_no != 5
;
and with a JOIN: SELECT e.employee_id
, e.employee_name
, e.job
, e.salary
, d.name
, d.location
FROM employees e
JOIN departments d
ON d.dept_no = e.dept_no
WHERE 1=1 -- Dummy value.
AND e.job IN ('Clerk', 'Manager')
AND e.dept_no != 5
;
In the join example, for a simple ON clause like that I'll usually just have JOIN ... ON in the one line, but if there are multiple conditions they are usually clearer on separate lines IMO.In more complicated queries I might further indent the joins too, like:
SELECT *
FROM employees e
JOIN departments d
ON d.dept_no = e.dept_no
WHERE 1=1 -- Dummy value.
AND e.job IN ('Clerk', 'Manager')
AND e.dept_no != 5
;
YMMV. Some people strongly agree with me here, others vehemently hate the way I align such code…WRT “Always specify which column belongs to which table”: this is particularly important for correlated sub-queries, because if you put the wrong column name in and it happens to match a name in an object in the outer query you have a potentially hard to find error. Also, if the table in the inner query is updated to include a column of the same name as the one you are filtering on in the outer, the meaning of your sub-query suddenly changes quite drastically without it having changed itself.
A few other things off the top of my head:
1. Remember that as well as UNION [ALL], EXCEPT and INTERSECT exist. I've seen (and even written myself) some horrendous SQL that badly implements these behaviours. TFA covers EXCEPT, but I find people who know about that don't always know about INTERSECT. It is rarely useful IME, but when it is useful it is really useful.
2. UPDATEs that change nothing still do everything else: create entries in your transaction log (could be an issue if using log-shipping for backups or read-only replicas etc.), fire triggers, create history rows if using system-versioned tables, and so forth. UPDATE a_table SET a_column = 'a value' WHERE a_column <> 'a value' can be a lot faster than without the WHERE.
3. Though of course be very careful with NULLable columns and/or setting a value NULL with point 2. “WHERE a_column IS DISTINCT FROM 'a value'” is much more maintainable if your DB supports that syntax (added in MS SQL Server 2022 and Azure SQL DB a little earlier, supported by Postgres years before, I don't know about other DBs without checking) than the more verbose alternatives.
4. Trying to force the sort order of NULLs with something like “ORDER BY ISNULL(a_column, 0)”, or doing similar with GROUP BY, can be very inefficient in some cases. If you expect few rows to be returned and there are relatively few NULLs in the sort target column it can be more performant to SELECT the non-NULL case and the NULL case then UNION ALL the two and then sort. Though if you do expect many rows this can backfire badly and you and up with excess spooling to disk, so test, test, and test again, when hacking around like this.
Learn your DB server. Check the query plans often. You might get surprised. Tweak and recheck.
Usually EXISTS is faster than IN. Beware that NOT EXISTS behaves differently than EXCEPT in regards to NULL values.
Instead of joining tables and using distinct or similar to filter rows, consiser using subquery "columns", ie in SELECT list. This can be much faster even if you're pulling 10+ values from the same table, even if your database server supports lateral joins. Just make sure the subqueries return at most one row.
Any query that's not a one-off should not perform any table scans. A table scan today can mean an outage tomorrow. Add indexes. Keep in mind GROUP BY clause usually dictates index use.
If you need to filter on expressions, say where a substring is equal something, you can add a computed column and index on that. Alternatively some db's support indexing expressions directly.
Often using UNION ALL can be much faster than using OR, even for non-trivial queries and/or multiple OR clauses.
edit: You can JOIN subqueries. This can be useful to force the filtering order if the DB isn't being clever about the order.