Why Encoding Matters
Computers store everything as numbers—including text. Character encoding is the system that maps characters (letters, symbols, emojis) to numeric values. Understanding encoding prevents garbled text, broken applications, and data corruption.
If you've ever seen "é" instead of "é" or "????" instead of emojis, you've encountered an encoding problem. These issues occur when text is written with one encoding but read with another.
ASCII: The Foundation
ASCII (American Standard Code for Information Interchange) was created in 1963 and remains the foundation of text encoding. It uses 7 bits to represent 128 characters:
- 0-31: Control characters (newline, tab, etc.)
- 32-126: Printable characters (letters, digits, punctuation)
- 127: Delete character
Character → ASCII Code 'A' → 65 'a' → 97 '0' → 48 ' ' → 32 '\n' → 10
ASCII Limitations
ASCII only includes English characters. No accented letters (é, ñ), no non-Latin scripts (中文, العربية, ελληνικά), and certainly no emojis. This limitation led to many encoding schemes.
The Encoding Chaos (1980s-1990s)
Different regions created their own extended ASCII variants:
- ISO-8859-1 (Latin-1): Western European languages
- ISO-8859-5: Cyrillic alphabet
- Windows-1252: Microsoft's Western European variant
- Shift_JIS: Japanese
- GB2312: Simplified Chinese
This created chaos—a document from one system appeared as garbage on another. The byte 0xE9 meant "é" in Latin-1 but "щ" in Windows-1251 (Cyrillic).
Unicode: The Universal Standard
Unicode aims to assign a unique number (called a "code point") to every character ever used—from ancient scripts to modern emojis. Currently, Unicode defines over 150,000 characters.
Code Points
Each character has a unique code point written as U+ followed by hexadecimal:
U+0041 → A U+00E9 → é U+4E2D → 中 U+1F600 → 😀
Unicode Planes
Unicode organizes code points into 17 planes:
- BMP (Basic Multilingual Plane): U+0000 to U+FFFF—most common characters
- SMP (Supplementary Multilingual Plane): U+10000 to U+1FFFF—emojis, historic scripts
- SIP (Supplementary Ideographic Plane): Rare CJK characters
UTF-8: The Web Standard
Unicode defines what number each character gets. UTF-8 defines how to store those numbers as bytes. It's a variable-width encoding:
ASCII characters (A-Z, 0-9, etc.) — identical to ASCII
Latin extensions, Greek, Cyrillic, Hebrew, Arabic
Most CJK characters, other BMP characters
Emojis, historic scripts, rare characters
Why UTF-8 Won
- Backward compatible: ASCII text is valid UTF-8
- Self-synchronizing: Easy to find character boundaries
- No byte-order issues: Unlike UTF-16, no endianness concerns
- Efficient for English: ASCII characters use only 1 byte
Over 98% of web pages use UTF-8. It's the default for HTML5, JSON, and most modern systems.
UTF-16 and UTF-32
UTF-16
Uses 2 or 4 bytes per character. Common in Windows, Java, and JavaScript (internal string representation):
- BMP characters: 2 bytes
- Characters outside BMP: 4 bytes (surrogate pairs)
- Has endianness issues (UTF-16LE vs UTF-16BE)
UTF-32
Uses exactly 4 bytes per character. Simple but wasteful—rarely used for storage or transmission.
Common Encoding Problems
Mojibake (文字化け)
When text is decoded with the wrong encoding:
"Café" in UTF-8: 43 61 66 C3 A9 Decoded as Latin-1: "Café" "日本語" in UTF-8: E6 97 A5 E6 9C AC E8 AA 9E Decoded as Latin-1: "æ¥æ¬èª"
Replacement Characters
When bytes don't form valid characters in the target encoding, you see:
�(U+FFFD) — Unicode replacement character?or□— System-dependent substitutes
Double Encoding
When already-encoded text is encoded again:
"é" → UTF-8 → C3 A9 → treated as Latin-1 → UTF-8 again → C3 83 C2 A9 Result: "é"
Best Practices
Always Specify UTF-8
<!-- HTML -->
<meta charset="UTF-8">
/* HTTP Header */
Content-Type: text/html; charset=utf-8
/* MySQL */
CREATE DATABASE mydb CHARACTER SET utf8mb4;
/* Python */
open('file.txt', 'r', encoding='utf-8')
Use utf8mb4 in MySQL
MySQL's utf8 only supports 3-byte characters (no emojis!). Use utf8mb4 for full Unicode support:
ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Handle String Length Carefully
// JavaScript
"Café".length // 4 (code units)
"😀".length // 2 (surrogate pair!)
[..."😀"].length // 1 (actual characters)
// Python
len("Café") // 4 (characters)
len("Café".encode()) // 5 (bytes in UTF-8)
Detecting Encoding
When you receive text without encoding information:
- BOM (Byte Order Mark): Some files start with special bytes indicating encoding
- Heuristics: Libraries like chardet (Python) guess based on byte patterns
- Try UTF-8 first: Most modern content is UTF-8
# BOM bytes at file start UTF-8: EF BB BF UTF-16 LE: FF FE UTF-16 BE: FE FF
Encoding in URLs
URLs can only contain ASCII. Non-ASCII characters must be percent-encoded:
Café → UTF-8 bytes: 43 61 66 C3 A9 URL encoded: Caf%C3%A9 日本 → UTF-8 bytes: E6 97 A5 E6 9C AC URL encoded: %E6%97%A5%E6%9C%AC
Key Takeaways
- Unicode assigns unique numbers to all characters; UTF-8 is how we store them
- UTF-8 is backward compatible with ASCII and is the web standard
- Always explicitly specify encoding—never assume
- Use utf8mb4 (not utf8) in MySQL for full Unicode support
- String length ≠ byte length ≠ display width
- When in doubt, use UTF-8 everywhere
Encode and Decode Text
Use our Base64 and URL encoding tools to transform text for safe transmission.