What Are Regular Expressions?
Regular expressions (regex or regexp) are sequences of characters that define search patterns. They're one of the most powerful tools in a developer's toolkit for text processing, validation, and data extraction. While they can appear cryptic at first, understanding regex fundamentals opens up efficient solutions to complex text manipulation problems.
Regular expressions originated in theoretical computer science and formal language theory in the 1950s. Today, they're implemented in virtually every programming language and text editor, making them an essential skill for developers.
Basic Syntax
Literal Characters
Most characters match themselves literally. The pattern cat matches the string "cat".
Metacharacters
Special characters with specific meanings in regex:
| Character | Meaning | Example |
|---|---|---|
. |
Any single character (except newline) | c.t matches "cat", "cot", "cut" |
^ |
Start of string/line | ^Hello matches "Hello" at start |
$ |
End of string/line | world$ matches "world" at end |
* |
Zero or more of preceding | ab*c matches "ac", "abc", "abbc" |
+ |
One or more of preceding | ab+c matches "abc", "abbc" (not "ac") |
? |
Zero or one of preceding | colou?r matches "color", "colour" |
| |
Alternation (OR) | cat|dog matches "cat" or "dog" |
Character Classes
Character classes match any single character from a set:
[abc] - matches 'a', 'b', or 'c' [a-z] - matches any lowercase letter [A-Z] - matches any uppercase letter [0-9] - matches any digit [a-zA-Z] - matches any letter [^abc] - matches anything EXCEPT 'a', 'b', or 'c'
Shorthand Character Classes
\d - digit [0-9] \D - non-digit [^0-9] \w - word character [a-zA-Z0-9_] \W - non-word character \s - whitespace (space, tab, newline) \S - non-whitespace
Quantifiers
Quantifiers specify how many times a pattern should match:
{n} - exactly n times
{n,} - n or more times
{n,m} - between n and m times
* - 0 or more (same as {0,})
+ - 1 or more (same as {1,})
? - 0 or 1 (same as {0,1})
Greedy vs Lazy Matching
By default, quantifiers are greedy—they match as much as possible. Adding ? makes them lazy (matching as little as possible):
// Given: <div>content</div> <.*> - greedy: matches "<div>content</div>" <.*?> - lazy: matches "<div>" only
Groups and Capturing
Parentheses create groups for applying quantifiers and capturing matches:
// Grouping for quantifiers
(ab)+ - matches "ab", "abab", "ababab"
// Capturing groups for extraction
(\d{3})-(\d{4}) - captures area code and number separately
// Non-capturing groups (grouping without capturing)
(?:ab)+ - groups but doesn't capture
// Named groups (in supported languages)
(?<year>\d{4})-(?<month>\d{2})
Backreferences
Reference captured groups later in the pattern:
// Find repeated words \b(\w+)\s+\1\b - matches "the the", "is is" // \1 refers to whatever was captured by the first group
Lookahead and Lookbehind
These assertions match a position without consuming characters:
// Lookahead foo(?=bar) - matches "foo" only if followed by "bar" foo(?!bar) - matches "foo" only if NOT followed by "bar" // Lookbehind (?<=foo)bar - matches "bar" only if preceded by "foo" (?<!foo)bar - matches "bar" only if NOT preceded by "foo"
Practical Example: Password Validation
// Password must contain at least one digit, one lowercase,
// one uppercase, and be 8+ characters
^(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{8,}$
Common Regex Flags
i - Case Insensitive/hello/i matches "Hello", "HELLO", "hello"
g - GlobalFind all matches, not just the first one
m - Multiline^ and $ match line starts/ends, not just string
s - Dotall. matches newline characters too
Common Patterns
Email Validation (Basic)
^[\w.-]+@[\w.-]+\.\w{2,}$
URL Matching
https?:\/\/[\w.-]+(?:\/[\w.-]*)*\/?
Phone Number (US)
^\(?(\d{3})\)?[-.\s]?(\d{3})[-.\s]?(\d{4})$
Date (YYYY-MM-DD)
^\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$
IPv4 Address
^(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)$
Performance Considerations
Poorly written regex can cause severe performance issues:
- Catastrophic backtracking: Nested quantifiers like
(a+)+can cause exponential time complexity. - Be specific: Use
[0-9]instead of.*when you know the format. - Anchor patterns: Use
^and$when matching whole strings. - Compile once: In performance-critical code, compile regex patterns once and reuse them.
- Test with edge cases: Test patterns against long strings and worst-case inputs.
Regex in Different Languages
JavaScript
const pattern = /\d+/g; const matches = "abc123def456".match(pattern); // ["123", "456"] const result = "hello".replace(/l/g, "L"); // "heLLo"
Python
import re
pattern = re.compile(r'\d+')
matches = pattern.findall('abc123def456') # ['123', '456']
result = re.sub(r'l', 'L', 'hello') # 'heLLo'
PHP
preg_match_all('/\d+/', 'abc123def456', $matches);
// $matches[0] = ['123', '456']
$result = preg_replace('/l/', 'L', 'hello'); // 'heLLo'
Tips for Learning Regex
- Start simple: Master basic patterns before tackling complex ones.
- Use a tester: Visual regex testers help you understand how patterns match.
- Read patterns aloud: "Match one or more digits followed by a space" helps comprehension.
- Build incrementally: Test each part of a complex pattern separately.
- Comment complex patterns: Use verbose mode or comments to explain complex regex.
- Know when not to use regex: Sometimes simple string methods are clearer and faster.
Test Your Regex Patterns
Use our regex tester to experiment with patterns and see matches in real-time.
Open Regex Tester →