Character Encoding Explained: UTF-8, ASCII, Unicode

Every character on your screen is stored as a number. The rules mapping numbers to characters are called encodings, and getting them wrong produces garbled text called mojibake. According to W3Techs, UTF-8 is now used by 98.2% of all websites. Yet encoding bugs still account for a stubborn share of production issues because most developers never look below the surface.

This guide breaks down ASCII, Unicode, and UTF-8 at the byte level. You’ll see exactly what happens when encodings mismatch, and how to fix it when they do.

developer tools collection

Key Takeaways

UTF-8 encodes 98.2% of the web and is backward-compatible with ASCII's 128 characters (W3Techs, 2025).
Unicode defines 154,998 characters across 168 scripts. UTF-8, UTF-16, and UTF-32 are different ways to encode them.
Mojibake happens when text is decoded with the wrong encoding. The fix is always: identify the original encoding and re-decode.
UTF-8 uses 1 to 4 bytes per character, making it space-efficient for Latin text while supporting every script on Earth.

Encode and Decode Text

Base64 encoding is one of the most common ways to safely transport binary data (including encoded text) through systems that only handle ASCII. Try it below.

Try it Base64 Encoder/Decoder

Plain Text

Base64 Output

Open full tool

What Is Character Encoding?

Character encoding is a system that maps characters to numbers and then to bytes. The Unicode Consortium maintains the standard used by virtually every modern system. Without an agreed-upon encoding, the byte 0xC3 0xA9 could mean “é” (UTF-8) or “Ã©” (Latin-1 misread as UTF-8).

Think of it as a two-step process. First, each character gets a unique number, called a code point. Second, that code point gets translated into one or more bytes for storage or transmission. ASCII, UTF-8, and Latin-1 are all different answers to that second step.

The confusion starts when the sender uses one encoding and the receiver assumes another. That’s the entire cause of mojibake. It’s not corruption. The bytes are fine. The interpretation is wrong.

hash your encoded data

Citation capsule: Character encoding maps characters to bytes. The Unicode Consortium defines 154,998 characters across 168 scripts (Unicode 16.0, 2024), and UTF-8 is used on 98.2% of the web (W3Techs, 2025).

How Does ASCII Work?

ASCII defines 128 characters using 7 bits per character. Published in 1963 by the American Standards Association, it became the foundation for nearly every encoding that followed. The 128 code points cover English letters, digits, punctuation, and 33 control characters like newline and tab.

Here’s what ASCII looks like at the byte level:

Character	Decimal	Hex	Binary
A	65	0x41	01000001
z	122	0x7A	01111010
0	48	0x30	00110000
Space	32	0x20	00100000
Newline	10	0x0A	00001010
~	126	0x7E	01111110

The problem? 128 characters only covers English. No accented letters, no Chinese, no Arabic, no emoji. By the 1980s, dozens of incompatible “extended ASCII” encodings had appeared: Latin-1, Windows-1252, ISO-8859-5, Shift_JIS. Each used the remaining bit (positions 128-255) differently. A file encoded in Latin-1 and opened with Shift_JIS produced garbage. The proliferation of regional encodings is what made the web’s early years a minefield. If you worked with multilingual data before 2000, you remember the pain.

ASCII is a subset of UTF-8

Every valid ASCII byte (0x00 through 0x7F) is also valid UTF-8 with the same meaning. This backward compatibility is the single biggest reason UTF-8 won the encoding wars.

What Is Unicode and Why Does It Matter?

Unicode assigns a unique code point to 154,998 characters across 168 scripts, according to Unicode 16.0 released in September 2024. It separates the problem of “which characters exist” from “how to store them as bytes.” That separation is the key insight.

A code point is written as U+XXXX. The letter “A” is U+0041. The euro sign ”€” is U+20AC. The emoji ”🎉” is U+1F389. Unicode doesn’t say how many bytes each one takes. That’s the job of an encoding format like UTF-8 or UTF-16.

The Unicode planes

Unicode organizes code points into 17 planes of 65,536 code points each:

Plane	Range	Name	What It Contains
0	U+0000 to U+FFFF	Basic Multilingual Plane (BMP)	Most common characters: Latin, Greek, Cyrillic, CJK, symbols
1	U+10000 to U+1FFFF	Supplementary Multilingual Plane	Emoji, historic scripts, musical symbols
2	U+20000 to U+2FFFF	Supplementary Ideographic Plane	Rare CJK characters
3-13	U+30000 to U+DFFFF	Unassigned	Reserved for future use
14	U+E0000 to U+EFFFF	Supplementary Special-purpose	Tag characters, variation selectors
15-16	U+F0000 to U+10FFFF	Private Use Areas	Custom characters for private agreements

But can a single standard really cover every script? Yes. Unicode is the closest thing to a solved problem in computing. The remaining debates are about emoji selection, not the architecture itself.

Citation capsule: Unicode 16.0 defines 154,998 characters spanning 168 writing systems (Unicode Consortium, 2024). It separates character identity (code points) from byte representation (encodings like UTF-8 and UTF-16).

encode text for safe transport

How Does UTF-8 Encoding Actually Work?

UTF-8 uses 1 to 4 bytes per character, depending on the code point’s value. Invented by Ken Thompson and Rob Pike in 1992 at Bell Labs, it’s now the dominant encoding on the web, powering 98.2% of all websites (W3Techs, 2025). Its design is elegant: ASCII characters stay one byte, while rarer characters use more.

Here are the rules:

Code Point Range	Bytes	Byte Pattern	Example
U+0000 to U+007F	1	0xxxxxxx	A → 0x41
U+0080 to U+07FF	2	110xxxxx 10xxxxxx	é → 0xC3 0xA9
U+0800 to U+FFFF	3	1110xxxx 10xxxxxx 10xxxxxx	€ → 0xE2 0x82 0xAC
U+10000 to U+10FFFF	4	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx	🎉 → 0xF0 0x9F 0x8E 0x89

Decoding UTF-8 step by step

Let’s decode “é” (U+00E9) manually. The code point 0xE9 (decimal 233) falls in the 2-byte range (U+0080 to U+07FF). The pattern is 110xxxxx 10xxxxxx.

Convert 233 to binary: 11101001
Split into groups: 00011 and 101001
Fill the pattern: 11000011 10101001
In hex: 0xC3 0xA9

That’s it. When a UTF-8 decoder sees 0xC3, the leading bits 110 tell it “this is a 2-byte sequence, read one more byte.” Self-synchronizing: you can jump to any byte in a UTF-8 stream and find the next character boundary by scanning for a byte that doesn’t start with 10. This self-synchronizing property is underappreciated. It means you can split a UTF-8 file at any arbitrary byte offset and, within at most 3 bytes, find a valid character boundary. That’s why UTF-8 works so well for streaming, chunked transfer, and parallel processing. Citation capsule: UTF-8 encodes characters in 1 to 4 bytes using a self-synchronizing bit pattern. Designed by Ken Thompson and Rob Pike in 1992, it now dominates with 98.2% web usage (W3Techs, 2025).

What About UTF-16 and UTF-32?

UTF-16 uses 2 or 4 bytes per character, while UTF-32 uses exactly 4 bytes for every character. According to the ICU Project documentation, UTF-16 is the internal encoding of Java, JavaScript, .NET, and Windows. UTF-32 sees almost no use on the web but is common in internal processing where fixed-width access matters.

Encoding	Bytes per Character	ASCII Efficiency	BMP Efficiency	Used By
UTF-8	1-4	1 byte (excellent)	1-3 bytes	Web, Linux, macOS, most APIs
UTF-16	2-4	2 bytes (wasteful)	2 bytes	Java, JavaScript, .NET, Windows internals
UTF-32	4	4 bytes (very wasteful)	4 bytes	Internal processing, some databases

The surrogate pair problem

UTF-16 needs special handling for characters outside the BMP (code points above U+FFFF). It uses surrogate pairs: two 16-bit code units that together represent one character. The emoji ”🎉” (U+1F389) becomes the surrogate pair 0xD83C 0xDF89 in UTF-16.

This is why "🎉".length returns 2 in JavaScript, not 1. JavaScript strings are UTF-16 internally. If you’ve ever had string length calculations break on emoji, now you know why.

JavaScript string length is misleading

In JavaScript, "café".length returns 4, but "🎉".length returns 2. Use [..."🎉"].length or "🎉".length with the spread operator for accurate character counts. The Intl.Segmenter API handles grapheme clusters properly.

Citation capsule: UTF-16 is the internal encoding of Java, JavaScript, and .NET (ICU Project). Characters above U+FFFF require surrogate pairs, which is why "🎉".length === 2 in JavaScript.

Why Does Mojibake Happen?

Mojibake occurs when bytes encoded in one character set are decoded using a different one. A Stack Overflow Developer Survey from 2023 found that encoding issues ranked among the top 10 most frustrating bugs developers face. The word “mojibake” itself comes from Japanese, roughly translating to “character transformation.”

Here’s what common mojibake looks like in practice:

You See	Original Text	What Happened
Ã©	é	UTF-8 bytes (0xC3 0xA9) decoded as Latin-1
Ã¼	ü	UTF-8 bytes (0xC3 0xBC) decoded as Latin-1
â€™	'	UTF-8 right single quote (3 bytes) decoded as Windows-1252
Ã¢â‚¬â€œ	–	UTF-8 en dash triple-decoded through Latin-1
ÐŸÑ€Ð¸Ð²ÐµÑ‚	Привет	UTF-8 Russian text decoded as Latin-1
æ—¥æœ¬èªž	日本語	UTF-8 Japanese decoded as Latin-1

The most insidious mojibake I’ve seen wasn’t garbled characters. It was “smart quotes” silently replaced with question marks in a database migration. The data looked fine in the web UI because the browser substituted missing characters. But downstream systems processing the raw bytes choked. Always check your data at the byte level, not just visually.

The double-encoding trap

The worst mojibake comes from double encoding. Here’s how it happens:

Text “café” is encoded as UTF-8: 63 61 66 C3 A9
A system incorrectly treats those bytes as Latin-1 and re-encodes to UTF-8
The 0xC3 byte (which Latin-1 reads as “Ã”) becomes UTF-8 C3 83, and 0xA9 (which Latin-1 reads as ”©”) becomes C2 A9
Result: “cafÃ©” stored as 63 61 66 C3 83 C2 A9

Each round of double-encoding makes the problem harder to reverse. Two rounds is recoverable. Three or more often isn’t.

How Do You Fix Encoding Issues?

The fix for encoding problems is always the same: identify the original encoding and re-decode the bytes correctly. According to Mozilla’s MDN Web Docs, the Content-Type header’s charset parameter is the primary mechanism browsers use to determine encoding. Getting this header right prevents most web-facing mojibake.

Step-by-step diagnosis

Check the raw bytes. Open the file in a hex editor. If you see C3 A9, the source is UTF-8 for “é.” If you see E9 alone, it’s Latin-1.
Check the declared encoding. Look at the HTTP Content-Type header, the HTML <meta charset> tag, or the file’s BOM.
Try re-decoding. In Python: broken_text.encode('latin-1').decode('utf-8') often reverses Latin-1 misinterpretation of UTF-8 bytes.
Check your database connection. MySQL’s SET NAMES utf8mb4 and PostgreSQL’s client_encoding must match the actual encoding of your data.

The golden rule of encoding

Declare UTF-8 everywhere. Set <meta charset="utf-8"> in HTML, Content-Type: text/html; charset=utf-8 in headers, utf8mb4 in MySQL, and save all source files as UTF-8 without BOM.

Quick fixes by language

# Python: fix UTF-8 decoded as Latin-1
broken = "cafÃ©"
fixed = broken.encode('latin-1').decode('utf-8')
# Result: "café"

// JavaScript: decode a Uint8Array as UTF-8
const decoder = new TextDecoder('utf-8');
const text = decoder.decode(uint8Array);

# Bash: convert a file from Latin-1 to UTF-8
iconv -f ISO-8859-1 -t UTF-8 input.txt > output.txt

validate your data formats

Citation capsule: Mojibake happens when text bytes are decoded with the wrong encoding. The Content-Type header’s charset parameter is the primary mechanism for declaring encoding on the web (MDN Web Docs, Mozilla).

How Do Programming Languages Handle Encoding?

Most modern languages default to UTF-8, but the details vary. According to the Go Blog, Go source files are defined as UTF-8. Python 3 also defaults to UTF-8 for source files (Python docs). Java and JavaScript use UTF-16 internally but accept UTF-8 input through their I/O APIs.

Language	Internal Encoding	String Type	Gotcha
Python 3	UTF-8 (source), flexible (runtime)	str (Unicode)	len() counts code points, not graphemes
JavaScript	UTF-16	string (UTF-16 code units)	length counts UTF-16 units, not characters
Java	UTF-16	String (char = UTF-16)	charAt() returns UTF-16 code units
Go	UTF-8	string (byte slice)	len() counts bytes, not characters. Use utf8.RuneCountInString()
Rust	UTF-8	String / &str	len() counts bytes. Use .chars().count() for code points
C#	UTF-16	string (char = UTF-16)	Similar to Java. Use StringInfo for grapheme clusters

The pattern is clear: languages that predated Unicode’s dominance (Java, JavaScript, C#) chose UTF-16. Languages designed later (Go, Rust) chose UTF-8. Knowing your language’s internal encoding prevents subtle string-handling bugs.

What Is a BOM (Byte Order Mark)?

A BOM is the Unicode character U+FEFF placed at the start of a file to signal its encoding and byte order. According to the Unicode Standard, Section 23.8, the BOM is optional for UTF-8 but required in some contexts for UTF-16. In practice, it causes more problems than it solves for UTF-8 files.

Encoding	BOM Bytes	Required?
UTF-8	EF BB BF	No, and generally discouraged
UTF-16 BE	FE FF	Recommended
UTF-16 LE	FF FE	Recommended
UTF-32 BE	00 00 FE FF	Recommended
UTF-32 LE	FF FE 00 00	Recommended

Why UTF-8 BOM causes problems

UTF-8 has no byte-order ambiguity (it’s a byte-oriented encoding), so the BOM serves no technical purpose. But its 3 bytes (EF BB BF) can cause real headaches:

PHP files with a BOM output those 3 bytes before any content, breaking header() calls and sessions.
Shell scripts with a BOM fail because the shebang line #!/bin/bash gets prefixed with invisible bytes.
CSV files open incorrectly in some spreadsheet applications when a BOM is present.
JSON is explicitly forbidden from having a BOM by RFC 8259, Section 8.1.

The one exception: Microsoft Excel sometimes needs a UTF-8 BOM to correctly detect encoding when opening CSV files. It’s an annoying edge case.

Check for hidden BOMs

If a file “looks fine” in your editor but breaks at runtime, check the first 3 bytes with a hex editor. An invisible UTF-8 BOM (EF BB BF) might be the culprit. In Bash: xxd file.txt | head -1.

Citation capsule: The Byte Order Mark (U+FEFF) signals encoding at the start of a file. For UTF-8, it’s 3 bytes (EF BB BF) and generally discouraged because UTF-8 has no byte-order ambiguity (Unicode Standard, Section 23.8).

encode binary data safely

Frequently Asked Questions

What is the difference between Unicode and UTF-8?

Unicode is a character set, a catalog that assigns a unique number (code point) to every character. UTF-8 is an encoding format, the rules for converting those code points into bytes. Unicode defines what characters exist. UTF-8 defines how to store them. Other encodings like UTF-16 and UTF-32 encode the same Unicode code points differently. According to the Unicode Consortium, Unicode 16.0 covers 154,998 characters.

Why is UTF-8 the most popular encoding?

UTF-8 dominates because it’s backward-compatible with ASCII, space-efficient for English text, and capable of encoding every Unicode character. It requires no byte-order mark and self-synchronizes, making it robust for network transmission. W3Techs reports 98.2% web adoption as of 2025. No other encoding combines these properties.

How do I detect a file’s encoding?

There’s no 100% reliable way to detect encoding from bytes alone. Tools like chardet (Python) and file (Linux CLI) use statistical heuristics. Check for a BOM at the file’s start. Check HTTP headers or HTML <meta charset> declarations. If all else fails, try decoding as UTF-8 first since it’s the most common encoding. Invalid byte sequences in UTF-8 are easy to detect.

Can UTF-8 encode every character?

Yes. UTF-8 can encode all 1,112,064 valid Unicode code points using 1 to 4 bytes. This covers every script, symbol, and emoji in the Unicode standard. There is no character you can represent in UTF-16 or UTF-32 that you can’t represent in UTF-8. The three encodings are equivalent in coverage, differing only in byte representation.

What is utf8mb4 in MySQL?

MySQL’s utf8 type only supports characters up to 3 bytes, covering the Basic Multilingual Plane but excluding emoji and rare CJK characters. utf8mb4 supports the full 4-byte UTF-8 range. According to MySQL documentation, utf8mb4 is the recommended character set and has been the default since MySQL 8.0. Always use utf8mb4, never utf8.

explore all developer tools

Wrapping Up

Character encoding isn’t glamorous, but it’s foundational. Every garbled email, every broken CSV import, every mysterious question mark in your database traces back to an encoding mismatch. The rules are simple: use UTF-8 everywhere, declare it explicitly in headers and meta tags, and check your database connection settings.

UTF-8 won the encoding war for good reasons. It’s backward-compatible with ASCII, efficient for the web’s predominantly Latin text, and capable of encoding every character humanity has standardized. With 98.2% web adoption, it’s the safe default for every new project.

When mojibake strikes, don’t panic. Check the raw bytes. Identify the original encoding. Re-decode. And then fix the root cause so it doesn’t happen again.