Skip to main content
Ordinary Utils Fast, free tools that respect your time.

Character Encoding: ASCII, UTF-8, and Unicode

Understanding how computers represent text.

Development 12 min read Last updated: June 19, 2026

Why Encoding Matters

Computers store everything as numbers—including text. Character encoding is the system that maps characters (letters, symbols, emojis) to numeric values. Understanding encoding prevents garbled text, broken applications, and data corruption.

If you've ever seen "é" instead of "é" or "????" instead of emojis, you've encountered an encoding problem. These issues occur when text is written with one encoding but read with another.

ASCII: The Foundation

ASCII (American Standard Code for Information Interchange) was created in 1963 and remains the foundation of text encoding. It uses 7 bits to represent 128 characters:

  • 0-31: Control characters (newline, tab, etc.)
  • 32-126: Printable characters (letters, digits, punctuation)
  • 127: Delete character
Character → ASCII Code
'A' → 65
'a' → 97
'0' → 48
' ' → 32
'\n' → 10

ASCII Limitations

ASCII only includes English characters. No accented letters (é, ñ), no non-Latin scripts (中文, العربية, ελληνικά), and certainly no emojis. This limitation led to many encoding schemes.

The Encoding Chaos (1980s-1990s)

Different regions created their own extended ASCII variants:

  • ISO-8859-1 (Latin-1): Western European languages
  • ISO-8859-5: Cyrillic alphabet
  • Windows-1252: Microsoft's Western European variant
  • Shift_JIS: Japanese
  • GB2312: Simplified Chinese

This created chaos—a document from one system appeared as garbage on another. The byte 0xE9 meant "é" in Latin-1 but "щ" in Windows-1251 (Cyrillic).

Unicode: The Universal Standard

Unicode aims to assign a unique number (called a "code point") to every character ever used—from ancient scripts to modern emojis. Currently, Unicode defines over 150,000 characters.

Code Points

Each character has a unique code point written as U+ followed by hexadecimal:

U+0041  →  A
U+00E9  →  é
U+4E2D  →  中
U+1F600 →  😀

Unicode Planes

Unicode organizes code points into 17 planes:

  • BMP (Basic Multilingual Plane): U+0000 to U+FFFF—most common characters
  • SMP (Supplementary Multilingual Plane): U+10000 to U+1FFFF—emojis, historic scripts
  • SIP (Supplementary Ideographic Plane): Rare CJK characters

UTF-8: The Web Standard

Unicode defines what number each character gets. UTF-8 defines how to store those numbers as bytes. It's a variable-width encoding:

1 byte: U+0000 to U+007F

ASCII characters (A-Z, 0-9, etc.) — identical to ASCII

2 bytes: U+0080 to U+07FF

Latin extensions, Greek, Cyrillic, Hebrew, Arabic

3 bytes: U+0800 to U+FFFF

Most CJK characters, other BMP characters

4 bytes: U+10000 to U+10FFFF

Emojis, historic scripts, rare characters

Why UTF-8 Won

  • Backward compatible: ASCII text is valid UTF-8
  • Self-synchronizing: Easy to find character boundaries
  • No byte-order issues: Unlike UTF-16, no endianness concerns
  • Efficient for English: ASCII characters use only 1 byte

Over 98% of web pages use UTF-8. It's the default for HTML5, JSON, and most modern systems.

UTF-16 and UTF-32

UTF-16

Uses 2 or 4 bytes per character. Common in Windows, Java, and JavaScript (internal string representation):

  • BMP characters: 2 bytes
  • Characters outside BMP: 4 bytes (surrogate pairs)
  • Has endianness issues (UTF-16LE vs UTF-16BE)

UTF-32

Uses exactly 4 bytes per character. Simple but wasteful—rarely used for storage or transmission.

Common Encoding Problems

Mojibake (文字化け)

When text is decoded with the wrong encoding:

"Café" in UTF-8: 43 61 66 C3 A9
Decoded as Latin-1: "Café"

"日本語" in UTF-8: E6 97 A5 E6 9C AC E8 AA 9E
Decoded as Latin-1: "æ¥æ¬èª"

Replacement Characters

When bytes don't form valid characters in the target encoding, you see:

  • (U+FFFD) — Unicode replacement character
  • ? or — System-dependent substitutes

Double Encoding

When already-encoded text is encoded again:

"é" → UTF-8 → C3 A9 → treated as Latin-1 → UTF-8 again → C3 83 C2 A9
Result: "é"

Best Practices

Always Specify UTF-8

<!-- HTML -->
<meta charset="UTF-8">

/* HTTP Header */
Content-Type: text/html; charset=utf-8

/* MySQL */
CREATE DATABASE mydb CHARACTER SET utf8mb4;

/* Python */
open('file.txt', 'r', encoding='utf-8')

Use utf8mb4 in MySQL

MySQL's utf8 only supports 3-byte characters (no emojis!). Use utf8mb4 for full Unicode support:

ALTER TABLE users CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

Handle String Length Carefully

// JavaScript
"Café".length        // 4 (code units)
"😀".length          // 2 (surrogate pair!)
[..."😀"].length     // 1 (actual characters)

// Python
len("Café")          // 4 (characters)
len("Café".encode()) // 5 (bytes in UTF-8)

Detecting Encoding

When you receive text without encoding information:

  • BOM (Byte Order Mark): Some files start with special bytes indicating encoding
  • Heuristics: Libraries like chardet (Python) guess based on byte patterns
  • Try UTF-8 first: Most modern content is UTF-8
# BOM bytes at file start
UTF-8:    EF BB BF
UTF-16 LE: FF FE
UTF-16 BE: FE FF

Encoding in URLs

URLs can only contain ASCII. Non-ASCII characters must be percent-encoded:

Café → UTF-8 bytes: 43 61 66 C3 A9
URL encoded: Caf%C3%A9

日本 → UTF-8 bytes: E6 97 A5 E6 9C AC
URL encoded: %E6%97%A5%E6%9C%AC

Key Takeaways

  • Unicode assigns unique numbers to all characters; UTF-8 is how we store them
  • UTF-8 is backward compatible with ASCII and is the web standard
  • Always explicitly specify encoding—never assume
  • Use utf8mb4 (not utf8) in MySQL for full Unicode support
  • String length ≠ byte length ≠ display width
  • When in doubt, use UTF-8 everywhere

Encode and Decode Text

Use our Base64 and URL encoding tools to transform text for safe transmission.