Character Encoding Explained: UTF-8, ASCII, Unicode
UTF-8, ASCII, and Unicode explained with byte-level examples. Learn why mojibake happens, how to fix encoding issues, and why UTF-8 powers 98.2% of the web.
Every character on your screen is stored as a number. The rules mapping numbers to characters are called encodings, and getting them wrong produces garbled text called mojibake. According to W3Techs, UTF-8 is now used by 98.2% of all websites. Yet encoding bugs still account for a stubborn share of production issues because most developers never look below the surface.
This guide breaks down ASCII, Unicode, and UTF-8 at the byte level. You’ll see exactly what happens when encodings mismatch, and how to fix it when they do.
Key Takeaways
- UTF-8 encodes 98.2% of the web and is backward-compatible with ASCII's 128 characters (W3Techs, 2025).
- Unicode defines 154,998 characters across 168 scripts. UTF-8, UTF-16, and UTF-32 are different ways to encode them.
- Mojibake happens when text is decoded with the wrong encoding. The fix is always: identify the original encoding and re-decode.
- UTF-8 uses 1 to 4 bytes per character, making it space-efficient for Latin text while supporting every script on Earth.
Encode and Decode Text
Base64 encoding is one of the most common ways to safely transport binary data (including encoded text) through systems that only handle ASCII. Try it below.
What Is Character Encoding?
Character encoding is a system that maps characters to numbers and then to bytes. The Unicode Consortium maintains the standard used by virtually every modern system. Without an agreed-upon encoding, the byte 0xC3 0xA9 could mean “é” (UTF-8) or “é” (Latin-1 misread as UTF-8).
Think of it as a two-step process. First, each character gets a unique number, called a code point. Second, that code point gets translated into one or more bytes for storage or transmission. ASCII, UTF-8, and Latin-1 are all different answers to that second step.
The confusion starts when the sender uses one encoding and the receiver assumes another. That’s the entire cause of mojibake. It’s not corruption. The bytes are fine. The interpretation is wrong.
Citation capsule: Character encoding maps characters to bytes. The Unicode Consortium defines 154,998 characters across 168 scripts (Unicode 16.0, 2024), and UTF-8 is used on 98.2% of the web (W3Techs, 2025).
How Does ASCII Work?
ASCII defines 128 characters using 7 bits per character. Published in 1963 by the American Standards Association, it became the foundation for nearly every encoding that followed. The 128 code points cover English letters, digits, punctuation, and 33 control characters like newline and tab.
Here’s what ASCII looks like at the byte level:
| Character | Decimal | Hex | Binary |
|---|---|---|---|
| A | 65 | 0x41 | 01000001 |
| z | 122 | 0x7A | 01111010 |
| 0 | 48 | 0x30 | 00110000 |
| Space | 32 | 0x20 | 00100000 |
| Newline | 10 | 0x0A | 00001010 |
| ~ | 126 | 0x7E | 01111110 |
The problem? 128 characters only covers English. No accented letters, no Chinese, no Arabic, no emoji. By the 1980s, dozens of incompatible “extended ASCII” encodings had appeared: Latin-1, Windows-1252, ISO-8859-5, Shift_JIS. Each used the remaining bit (positions 128-255) differently. A file encoded in Latin-1 and opened with Shift_JIS produced garbage. The proliferation of regional encodings is what made the web’s early years a minefield. If you worked with multilingual data before 2000, you remember the pain.
ASCII is a subset of UTF-8
Every valid ASCII byte (0x00 through 0x7F) is also valid UTF-8 with the same meaning. This backward compatibility is the single biggest reason UTF-8 won the encoding wars.
What Is Unicode and Why Does It Matter?
Unicode assigns a unique code point to 154,998 characters across 168 scripts, according to Unicode 16.0 released in September 2024. It separates the problem of “which characters exist” from “how to store them as bytes.” That separation is the key insight.
A code point is written as U+XXXX. The letter “A” is U+0041. The euro sign ”€” is U+20AC. The emoji ”🎉” is U+1F389. Unicode doesn’t say how many bytes each one takes. That’s the job of an encoding format like UTF-8 or UTF-16.
The Unicode planes
Unicode organizes code points into 17 planes of 65,536 code points each:
| Plane | Range | Name | What It Contains |
|---|---|---|---|
| 0 | U+0000 to U+FFFF | Basic Multilingual Plane (BMP) | Most common characters: Latin, Greek, Cyrillic, CJK, symbols |
| 1 | U+10000 to U+1FFFF | Supplementary Multilingual Plane | Emoji, historic scripts, musical symbols |
| 2 | U+20000 to U+2FFFF | Supplementary Ideographic Plane | Rare CJK characters |
| 3-13 | U+30000 to U+DFFFF | Unassigned | Reserved for future use |
| 14 | U+E0000 to U+EFFFF | Supplementary Special-purpose | Tag characters, variation selectors |
| 15-16 | U+F0000 to U+10FFFF | Private Use Areas | Custom characters for private agreements |
But can a single standard really cover every script? Yes. Unicode is the closest thing to a solved problem in computing. The remaining debates are about emoji selection, not the architecture itself.
Citation capsule: Unicode 16.0 defines 154,998 characters spanning 168 writing systems (Unicode Consortium, 2024). It separates character identity (code points) from byte representation (encodings like UTF-8 and UTF-16).
encode text for safe transport
How Does UTF-8 Encoding Actually Work?
UTF-8 uses 1 to 4 bytes per character, depending on the code point’s value. Invented by Ken Thompson and Rob Pike in 1992 at Bell Labs, it’s now the dominant encoding on the web, powering 98.2% of all websites (W3Techs, 2025). Its design is elegant: ASCII characters stay one byte, while rarer characters use more.
Here are the rules:
| Code Point Range | Bytes | Byte Pattern | Example |
|---|---|---|---|
| U+0000 to U+007F | 1 | 0xxxxxxx | A → 0x41 |
| U+0080 to U+07FF | 2 | 110xxxxx 10xxxxxx | é → 0xC3 0xA9 |
| U+0800 to U+FFFF | 3 | 1110xxxx 10xxxxxx 10xxxxxx | € → 0xE2 0x82 0xAC |
| U+10000 to U+10FFFF | 4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | 🎉 → 0xF0 0x9F 0x8E 0x89 |
Decoding UTF-8 step by step
Let’s decode “é” (U+00E9) manually. The code point 0xE9 (decimal 233) falls in the 2-byte range (U+0080 to U+07FF). The pattern is 110xxxxx 10xxxxxx.
- Convert 233 to binary:
11101001 - Split into groups:
00011and101001 - Fill the pattern:
11000011 10101001 - In hex:
0xC3 0xA9
That’s it. When a UTF-8 decoder sees 0xC3, the leading bits 110 tell it “this is a 2-byte sequence, read one more byte.” Self-synchronizing: you can jump to any byte in a UTF-8 stream and find the next character boundary by scanning for a byte that doesn’t start with 10. This self-synchronizing property is underappreciated. It means you can split a UTF-8 file at any arbitrary byte offset and, within at most 3 bytes, find a valid character boundary. That’s why UTF-8 works so well for streaming, chunked transfer, and parallel processing.
Citation capsule: UTF-8 encodes characters in 1 to 4 bytes using a self-synchronizing bit pattern. Designed by Ken Thompson and Rob Pike in 1992, it now dominates with 98.2% web usage (W3Techs, 2025).
What About UTF-16 and UTF-32?
UTF-16 uses 2 or 4 bytes per character, while UTF-32 uses exactly 4 bytes for every character. According to the ICU Project documentation, UTF-16 is the internal encoding of Java, JavaScript, .NET, and Windows. UTF-32 sees almost no use on the web but is common in internal processing where fixed-width access matters.
| Encoding | Bytes per Character | ASCII Efficiency | BMP Efficiency | Used By |
|---|---|---|---|---|
| UTF-8 | 1-4 | 1 byte (excellent) | 1-3 bytes | Web, Linux, macOS, most APIs |
| UTF-16 | 2-4 | 2 bytes (wasteful) | 2 bytes | Java, JavaScript, .NET, Windows internals |
| UTF-32 | 4 | 4 bytes (very wasteful) | 4 bytes | Internal processing, some databases |
The surrogate pair problem
UTF-16 needs special handling for characters outside the BMP (code points above U+FFFF). It uses surrogate pairs: two 16-bit code units that together represent one character. The emoji ”🎉” (U+1F389) becomes the surrogate pair 0xD83C 0xDF89 in UTF-16.
This is why "🎉".length returns 2 in JavaScript, not 1. JavaScript strings are UTF-16 internally. If you’ve ever had string length calculations break on emoji, now you know why.
JavaScript string length is misleading
In JavaScript, "café".length returns 4, but "🎉".length returns 2. Use [..."🎉"].length or "🎉".length with the spread operator for accurate character counts. The Intl.Segmenter API handles grapheme clusters properly.
Citation capsule: UTF-16 is the internal encoding of Java, JavaScript, and .NET (ICU Project). Characters above U+FFFF require surrogate pairs, which is why "🎉".length === 2 in JavaScript.
Why Does Mojibake Happen?
Mojibake occurs when bytes encoded in one character set are decoded using a different one. A Stack Overflow Developer Survey from 2023 found that encoding issues ranked among the top 10 most frustrating bugs developers face. The word “mojibake” itself comes from Japanese, roughly translating to “character transformation.”
Here’s what common mojibake looks like in practice:
| You See | Original Text | What Happened |
|---|---|---|
| é | é | UTF-8 bytes (0xC3 0xA9) decoded as Latin-1 |
| ü | ü | UTF-8 bytes (0xC3 0xBC) decoded as Latin-1 |
| ’ | ' | UTF-8 right single quote (3 bytes) decoded as Windows-1252 |
| – | – | UTF-8 en dash triple-decoded through Latin-1 |
| Привет | Привет | UTF-8 Russian text decoded as Latin-1 |
| 日本語 | 日本語 | UTF-8 Japanese decoded as Latin-1 |
The double-encoding trap
The worst mojibake comes from double encoding. Here’s how it happens:
- Text “café” is encoded as UTF-8:
63 61 66 C3 A9 - A system incorrectly treats those bytes as Latin-1 and re-encodes to UTF-8
- The
0xC3byte (which Latin-1 reads as “Ô) becomes UTF-8C3 83, and0xA9(which Latin-1 reads as ”©”) becomesC2 A9 - Result: “café” stored as
63 61 66 C3 83 C2 A9
Each round of double-encoding makes the problem harder to reverse. Two rounds is recoverable. Three or more often isn’t.
How Do You Fix Encoding Issues?
The fix for encoding problems is always the same: identify the original encoding and re-decode the bytes correctly. According to Mozilla’s MDN Web Docs, the Content-Type header’s charset parameter is the primary mechanism browsers use to determine encoding. Getting this header right prevents most web-facing mojibake.
Step-by-step diagnosis
- Check the raw bytes. Open the file in a hex editor. If you see
C3 A9, the source is UTF-8 for “é.” If you seeE9alone, it’s Latin-1. - Check the declared encoding. Look at the HTTP
Content-Typeheader, the HTML<meta charset>tag, or the file’s BOM. - Try re-decoding. In Python:
broken_text.encode('latin-1').decode('utf-8')often reverses Latin-1 misinterpretation of UTF-8 bytes. - Check your database connection. MySQL’s
SET NAMES utf8mb4and PostgreSQL’sclient_encodingmust match the actual encoding of your data.
The golden rule of encoding
Declare UTF-8 everywhere. Set <meta charset="utf-8"> in HTML, Content-Type: text/html; charset=utf-8 in headers, utf8mb4 in MySQL, and save all source files as UTF-8 without BOM.
Quick fixes by language
# Python: fix UTF-8 decoded as Latin-1
broken = "café"
fixed = broken.encode('latin-1').decode('utf-8')
# Result: "café"
// JavaScript: decode a Uint8Array as UTF-8
const decoder = new TextDecoder('utf-8');
const text = decoder.decode(uint8Array);
# Bash: convert a file from Latin-1 to UTF-8
iconv -f ISO-8859-1 -t UTF-8 input.txt > output.txt
Citation capsule: Mojibake happens when text bytes are decoded with the wrong encoding. The Content-Type header’s charset parameter is the primary mechanism for declaring encoding on the web (MDN Web Docs, Mozilla).
How Do Programming Languages Handle Encoding?
Most modern languages default to UTF-8, but the details vary. According to the Go Blog, Go source files are defined as UTF-8. Python 3 also defaults to UTF-8 for source files (Python docs). Java and JavaScript use UTF-16 internally but accept UTF-8 input through their I/O APIs.
| Language | Internal Encoding | String Type | Gotcha |
|---|---|---|---|
| Python 3 | UTF-8 (source), flexible (runtime) | str (Unicode) | len() counts code points, not graphemes |
| JavaScript | UTF-16 | string (UTF-16 code units) | length counts UTF-16 units, not characters |
| Java | UTF-16 | String (char = UTF-16) | charAt() returns UTF-16 code units |
| Go | UTF-8 | string (byte slice) | len() counts bytes, not characters. Use utf8.RuneCountInString() |
| Rust | UTF-8 | String / &str | len() counts bytes. Use .chars().count() for code points |
| C# | UTF-16 | string (char = UTF-16) | Similar to Java. Use StringInfo for grapheme clusters |
The pattern is clear: languages that predated Unicode’s dominance (Java, JavaScript, C#) chose UTF-16. Languages designed later (Go, Rust) chose UTF-8. Knowing your language’s internal encoding prevents subtle string-handling bugs.
What Is a BOM (Byte Order Mark)?
A BOM is the Unicode character U+FEFF placed at the start of a file to signal its encoding and byte order. According to the Unicode Standard, Section 23.8, the BOM is optional for UTF-8 but required in some contexts for UTF-16. In practice, it causes more problems than it solves for UTF-8 files.
| Encoding | BOM Bytes | Required? |
|---|---|---|
| UTF-8 | EF BB BF | No, and generally discouraged |
| UTF-16 BE | FE FF | Recommended |
| UTF-16 LE | FF FE | Recommended |
| UTF-32 BE | 00 00 FE FF | Recommended |
| UTF-32 LE | FF FE 00 00 | Recommended |
Why UTF-8 BOM causes problems
UTF-8 has no byte-order ambiguity (it’s a byte-oriented encoding), so the BOM serves no technical purpose. But its 3 bytes (EF BB BF) can cause real headaches:
- PHP files with a BOM output those 3 bytes before any content, breaking
header()calls and sessions. - Shell scripts with a BOM fail because the shebang line
#!/bin/bashgets prefixed with invisible bytes. - CSV files open incorrectly in some spreadsheet applications when a BOM is present.
- JSON is explicitly forbidden from having a BOM by RFC 8259, Section 8.1.
The one exception: Microsoft Excel sometimes needs a UTF-8 BOM to correctly detect encoding when opening CSV files. It’s an annoying edge case.
Check for hidden BOMs
If a file “looks fine” in your editor but breaks at runtime, check the first 3 bytes with a hex editor. An invisible UTF-8 BOM (EF BB BF) might be the culprit. In Bash: xxd file.txt | head -1.
Citation capsule: The Byte Order Mark (U+FEFF) signals encoding at the start of a file. For UTF-8, it’s 3 bytes (EF BB BF) and generally discouraged because UTF-8 has no byte-order ambiguity (Unicode Standard, Section 23.8).
Frequently Asked Questions
What is the difference between Unicode and UTF-8?
Unicode is a character set, a catalog that assigns a unique number (code point) to every character. UTF-8 is an encoding format, the rules for converting those code points into bytes. Unicode defines what characters exist. UTF-8 defines how to store them. Other encodings like UTF-16 and UTF-32 encode the same Unicode code points differently. According to the Unicode Consortium, Unicode 16.0 covers 154,998 characters.
Why is UTF-8 the most popular encoding?
UTF-8 dominates because it’s backward-compatible with ASCII, space-efficient for English text, and capable of encoding every Unicode character. It requires no byte-order mark and self-synchronizes, making it robust for network transmission. W3Techs reports 98.2% web adoption as of 2025. No other encoding combines these properties.
How do I detect a file’s encoding?
There’s no 100% reliable way to detect encoding from bytes alone. Tools like chardet (Python) and file (Linux CLI) use statistical heuristics. Check for a BOM at the file’s start. Check HTTP headers or HTML <meta charset> declarations. If all else fails, try decoding as UTF-8 first since it’s the most common encoding. Invalid byte sequences in UTF-8 are easy to detect.
Can UTF-8 encode every character?
Yes. UTF-8 can encode all 1,112,064 valid Unicode code points using 1 to 4 bytes. This covers every script, symbol, and emoji in the Unicode standard. There is no character you can represent in UTF-16 or UTF-32 that you can’t represent in UTF-8. The three encodings are equivalent in coverage, differing only in byte representation.
What is utf8mb4 in MySQL?
MySQL’s utf8 type only supports characters up to 3 bytes, covering the Basic Multilingual Plane but excluding emoji and rare CJK characters. utf8mb4 supports the full 4-byte UTF-8 range. According to MySQL documentation, utf8mb4 is the recommended character set and has been the default since MySQL 8.0. Always use utf8mb4, never utf8.
Wrapping Up
Character encoding isn’t glamorous, but it’s foundational. Every garbled email, every broken CSV import, every mysterious question mark in your database traces back to an encoding mismatch. The rules are simple: use UTF-8 everywhere, declare it explicitly in headers and meta tags, and check your database connection settings.
UTF-8 won the encoding war for good reasons. It’s backward-compatible with ASCII, efficient for the web’s predominantly Latin text, and capable of encoding every character humanity has standardized. With 98.2% web adoption, it’s the safe default for every new project.
When mojibake strikes, don’t panic. Check the raw bytes. Identify the original encoding. Re-decode. And then fix the root cause so it doesn’t happen again.