Jeff wrote "Mastering Regular Expressions", which grew from that talk. You probably want a copy even though it was first released in 1997. For the mindset of RE, you can't beat it.
Learning REs is a roll through:
* how matching happens (advancing, matching, backtracking)
* using * ? and {} to match repetitions
* greediness and stinginess within the RE
* character classes, both [manual] and escapes like \s \W etc
* anchors and "what a line is"
* grouping and backreferences
* accessing groups outside the RE
* substitution and access backrefs in substitutions
* find ALL the matches
* complex parsing (just don't, it's rare not to regret it)
and then it's an absolutely epic deep-dive into the minutiae of what line starts and ends might be, Unicode and regex, code to be executed from within the regex enging, using code to BUILD regex and worrying about when escaping happens or doesn't, denial of service regex, etc. that will take you through ASCII, various Unix tool chains over time, and a bunch of other fun stuff.
`perldoc perlre` from your terminal.
or https://perldoc.perl.org/perlre
A simple way to test a regex you're building is this website, which offers immediate parsing and documentation of your regex, lets you test it against various inputs, and lets you choose which language's regex parser you are targeting.
Regex101 is an excellent tool.
https://github.com/qntm/greenery
So, you can observe what kind of state machine is produced from any given Regular Expression. You can also use it to merge and such manipulate state machines, or simplify Regular Expressions.
Quite helpful.
Then I forget it, and have unreadable mystery functions laying around that I hope don’t have bugs.
But at least it’s a single line!
Seriously though, my actual need for them is low, so I avoid the things as much as I would avoid inlining assembly.
Then, as an expert in linguistic morphology, I started learning about things like subregular languages, as talked about in works such as Aural Pattern Recognition Experiments and the Subregular Hierarchy, by Rogers and Pullum (https://www.cs.earlham.edu/~jrogers/JoLLI.pdf). And I continue to wonder what the relationship is between these classes of languages and word formation.
Then I learned Perl and started learning RegEx properly. Now somehow I've turned into one of those wizards I admired in the Stack overflow answers section. It wasn't until I had to teach RegEx to a junior that I realized how far I'd come.
One of the things I remember being difficult at the beginning was the subtle differences between implementations, like `^` meaning "beginning of line" in Ruby (and others) but meaning "beginning of string" in JavaScript (and others).
If you're just starting out, it'd be helpful to read about how a regex engine evaluates an expression against a string so that you can understand the "order of operations" and how repeating elements are matched.
It's been many years but I remember it as both thorough and easy to understand.
Then it was about knowing a situation or a problem when regexes would apply and knowing how to look up the things I needed to solve that problem. Some regex 'phrases' are good for grepping, others for find and replace. Some will help you swap names around, some to reformat phone numbers.
After a while the phrases give way to general understanding and certain things become fluent.
I still only really write short or basic regexes, but I use them all the time in editing text or doing things that are a little bit complicated but actually a short regex just turns it from a hard problem into an easy problem.
To learn it, I played a lot of regex golf [1]
I also enabled regular expressions in my code editor's Find feature so every search I'd make used regex. Having it enabled in my editor made learning it more immersive and useful, especially when combined with things like find-and-replace. I highly recommend permanently enabling that in your editor as well
Also, challenge your coworkers to see who can make the shortest patterns for a variety of cases and see whose is the most versatile. It's always a fun time
Later, grepping logs was a pretty similar application that needed and extended those skills.
The former I can never remember beyond the basics (*, +, ?, |). Even the | I go extra cautious and put in tons of parenthesis. If I ever need matching and grouping I resort to rtfm.
Now that latter, that's the more interesting and fun one!! Learnt it in college decades ago but really drilled it in by reimplementing Russ Cox's amazing Thomson nfa blog and breakdown in typescript!
For example
"\\`\\(?:[^^]\\|\\^\\(?: \\*\\|\\[\\)\\)"
can be written as (: bos
(| (not (in "^"))
(: "^"
(| " *" "["))))
Emacs also has other features to highlight matches and groups to help understand regexes better.https://www.emacswiki.org/emacs/Rx_Notation https://github.com/mattiase/xr https://www.emacswiki.org/emacs/RegularExpression
Mid way through my 20 year career I realised that every job I'd had really came down to parsing data and outputting something a company finds value in. It's regex all the way down.
Simply try to parse some complex information like movie strings, as an exercise you can try to parse these movie names to produce a result like this.
``` { "name": "Dawn Of The Planet of The Apes", "year": "2014", "resolution": "1080p", "codec": "h264", "source": "web-dl", "audio": "AAC5.1", "group": "RARBG" } ```
https://raw.githubusercontent.com/dobladov/video-parser/main...
That and practice. I frequently check them with online regex tools to make sure the regex does what I want before I implement them.
Perl 5 regex familiarity seems like it futureproofed.
Now I suppose I mostly use JS or Vim which is such a subset.
Textmate grammars are basically hundreds of nested regex snippets that recursively apply tags to regions of text. This is made worse by the fact that the grammar is written in JSON, so any escapes need to be double escaped, which means you can't easily copy your regex into something like regex101. You sort of just have to suffer until you get good at it.
TextSoap has been around forever and must be the most underrated app on the Mac. It’s amazing — I rank it alongside Keyboard Maestro, if that tells you anything. It’s also available on SetApp. I can’t say enough good things about it.
If you get into it, there is an Alfred workflow that lets you search for and apply cleaners to selected text.
I had also done a tiny amount of regex in a college programming course, but really I didn't get "good" at them until I used them on the job.
And then about 6 months later I had completely forgotten it.
It's one of those things you need to use regularly to keep it in memory. At least that's the case for me.
I tend to shy away from it these days for a lot of cases (ever try to regex validate an email??) but when I do use it I it's honestly just a process of re-learning for about 15 minutes each time.
Learn the rest "on demand" whenever you need it, it's not something to spend a lot of practice time on. Because if you don't use it a lot, you'll forget most of what you learned anyway, and if you do use it a lot, then you don't need to spend dedicated learning time, you'll get good quite quickly.
Just build the regex you need with ChatGPT along with an online regex tester.
The most common uses in JavaScript are in the RegExp test method and the String replace method. The replace method is cool because it can receive a function as the second argument and the argument of that function is the value matched by the RegExp that can be modified and returned.
Not sure they ever get easier to read though.
Actually that book is also what helped demystify async programming for me.
Then perl followed by Python regex when I needed to recognise specific strings.
I didn't use books for this. I remember reading Python docs and howtos.
These days, ChatGPT is pretty good at both making and explaining them.
https://regexr.com/ is one of the most amazing interactive resources, I can't recommend enough. Back in the day I used it to go from beginer to intermediate. And while I never used this next site to learn, https://regex-vis.com/ is a great place to check out. From intermediate to master I've pretty much relied on rexegg.com/ for discovering the advanced stuff and engine differences. After that https://regex101.com/ was helpful for performance analysis. I first learned regex just mucking around in the CLI with some guidance from a programmer friend. Pure trial and error learning.
While I am inclined to say "the only way to learn regex is to use it", after reading the comments I must agree it would've been nice to have examples of pitfalls and misconceptions. There's a lot of them that can take a very very long time to learn without direct examples. I've never even heard of Jeff Fried (not till this post at least). So props to people who can actually read those kinds of books.
Cheat sheets are the way to go though, especially because of the different versions. If you do enough with them one day the main stuff will just stick. Once you're fairly productive you will realize you missed a feature or trick that is particularly useful for what you've been doing, and after getting mad at yourself for missing it you will add that into your repertoire. repeat.
Also, don't be afraid to just split/cut and do it in your language of choice instead of regex. Most of the time it doesn't make too much of a difference performance wise. Many times it can be faster and/or more readable. The best approach is often a combination. Nobody likes the wizard that tries to put everything into one regex to rule them all.
Regarding versions, I learned with PCRE, have mostly worked with python, and have hit problems using other various implementations over the years. Though it's never enough of a problem that I can remember what those differences are, I just look it up and move on. Unless it's going to be an ongoing project, in which case I print out a new cheatsheet and hang it up.
Then wrote a regex engine. It's now extremely obvious how regular expressions work as they're very simple. The spurious divergence in syntax and semantics is still infuriating but at least I know what they're supposed to be desugaring to now. Recommended as a worthwhile exercise.
Regular expressions have "generate a letter of the alphabet" as a primitive. It might be ascii and use 'b' to generate that letter, or a hex escape like \x42 or similar. The notation varies a bit. Another primitive is "generate the empty string".
Then there are compound operations. One regex or another, one followed by another. Intersection, complement. All the set operations, for the reason that a regular expression is literally a notation for a set of strings.
Some things like "lookahead" are notation for intersection. The match previous construct, \2 or similar, takes you out of regular expressions but works like checking equality on the fly.
Finally anchors, $ or ^ etc, are specific to the match problem. It's still find an element of the denoted set but with some extra constraints on where the element can occur.
I'm pretty sure that's all of it. How anchors interact with the set description is a nuisance but seems well formed - I haven't bothered to work through that part yet because I'm mostly interested in string generation, not matching.
{ ... } CHATSTR "(.*)" GREPSTR IF
Since I was about 11 years old, my brain was plastic enough to take the postfix notation in stride, and regular expression syntax is still second nature to me.
# a regex that selects <something>
regex =
And then copilot/supermaven auto-completes it. If that doesn't work, I ask GPT-4o/Sonnet. If that doesn't work, I assume that whatever I'm asking for is not really a natural fit for regex and I should accomplish my task in a different way.In general I try not to use regex in production code. IMO it is an obsolete technology at this point. Most people do not know it well and trying to debug it is a nightmare. May I suggest a simple function or loop that is readable?
I only used the bare minimum for years.
I also hung out on a #regex IRC channel, so I got exposed to questions and answers by many people.
Later I read up on https://www.regular-expressions.info/ which has a lot of very good explanations.
The #regex IRC channel had an IRC bot with a quiz with 28 levels.
All sensibility ended after level 14 or so. At that point it was just "how deep does the PCRE rabbit-hole go?"
But there was a lot of useful, non-trivial stuff, too. Most specifically, look-aheads/lookbehinds, non-greedy matching, back-references, named capture-groups, character classes, anchors,
When I learned jq, I went much the same way: Started hanging out on #jq IRC channels and started trying to answer jq questions on StackOverflow. Sadly, I got outperformed the first six months, until it finally clicked.