Skip to content
Kordu Tools

Character Encoding Explained: UTF-8, ASCII, Unicode

UTF-8, ASCII, and Unicode explained with byte-level examples. Learn why mojibake happens, how to fix encoding issues, and why UTF-8 powers 98.2% of the web.

I
iyda
12 min read
character encoding utf-8 ascii unicode mojibake

Every character on your screen is stored as a number. The rules mapping numbers to characters are called encodings, and getting them wrong produces garbled text called mojibake. According to W3Techs, UTF-8 is now used by 98.2% of all websites. Yet encoding bugs still account for a stubborn share of production issues because most developers never look below the surface.

This guide breaks down ASCII, Unicode, and UTF-8 at the byte level. You’ll see exactly what happens when encodings mismatch, and how to fix it when they do.

developer tools collection

Key Takeaways

  • UTF-8 encodes 98.2% of the web and is backward-compatible with ASCII's 128 characters (W3Techs, 2025).
  • Unicode defines 154,998 characters across 168 scripts. UTF-8, UTF-16, and UTF-32 are different ways to encode them.
  • Mojibake happens when text is decoded with the wrong encoding. The fix is always: identify the original encoding and re-decode.
  • UTF-8 uses 1 to 4 bytes per character, making it space-efficient for Latin text while supporting every script on Earth.

Encode and Decode Text

Base64 encoding is one of the most common ways to safely transport binary data (including encoded text) through systems that only handle ASCII. Try it below.

Try it Base64 Encoder/Decoder

 

What Is Character Encoding?

Character encoding is a system that maps characters to numbers and then to bytes. The Unicode Consortium maintains the standard used by virtually every modern system. Without an agreed-upon encoding, the byte 0xC3 0xA9 could mean “é” (UTF-8) or “é” (Latin-1 misread as UTF-8).

Think of it as a two-step process. First, each character gets a unique number, called a code point. Second, that code point gets translated into one or more bytes for storage or transmission. ASCII, UTF-8, and Latin-1 are all different answers to that second step.

The confusion starts when the sender uses one encoding and the receiver assumes another. That’s the entire cause of mojibake. It’s not corruption. The bytes are fine. The interpretation is wrong.

hash your encoded data

Citation capsule: Character encoding maps characters to bytes. The Unicode Consortium defines 154,998 characters across 168 scripts (Unicode 16.0, 2024), and UTF-8 is used on 98.2% of the web (W3Techs, 2025).

How Does ASCII Work?

ASCII defines 128 characters using 7 bits per character. Published in 1963 by the American Standards Association, it became the foundation for nearly every encoding that followed. The 128 code points cover English letters, digits, punctuation, and 33 control characters like newline and tab.

Here’s what ASCII looks like at the byte level:

Character Decimal Hex Binary
A 65 0x41 01000001
z 122 0x7A 01111010
0 48 0x30 00110000
Space 32 0x20 00100000
Newline 10 0x0A 00001010
~ 126 0x7E 01111110

The problem? 128 characters only covers English. No accented letters, no Chinese, no Arabic, no emoji. By the 1980s, dozens of incompatible “extended ASCII” encodings had appeared: Latin-1, Windows-1252, ISO-8859-5, Shift_JIS. Each used the remaining bit (positions 128-255) differently. A file encoded in Latin-1 and opened with Shift_JIS produced garbage. The proliferation of regional encodings is what made the web’s early years a minefield. If you worked with multilingual data before 2000, you remember the pain.

ASCII is a subset of UTF-8

Every valid ASCII byte (0x00 through 0x7F) is also valid UTF-8 with the same meaning. This backward compatibility is the single biggest reason UTF-8 won the encoding wars.

What Is Unicode and Why Does It Matter?

Unicode assigns a unique code point to 154,998 characters across 168 scripts, according to Unicode 16.0 released in September 2024. It separates the problem of “which characters exist” from “how to store them as bytes.” That separation is the key insight.

A code point is written as U+XXXX. The letter “A” is U+0041. The euro sign ”€” is U+20AC. The emoji ”🎉” is U+1F389. Unicode doesn’t say how many bytes each one takes. That’s the job of an encoding format like UTF-8 or UTF-16.

The Unicode planes

Unicode organizes code points into 17 planes of 65,536 code points each:

Plane Range Name What It Contains
0 U+0000 to U+FFFF Basic Multilingual Plane (BMP) Most common characters: Latin, Greek, Cyrillic, CJK, symbols
1 U+10000 to U+1FFFF Supplementary Multilingual Plane Emoji, historic scripts, musical symbols
2 U+20000 to U+2FFFF Supplementary Ideographic Plane Rare CJK characters
3-13 U+30000 to U+DFFFF Unassigned Reserved for future use
14 U+E0000 to U+EFFFF Supplementary Special-purpose Tag characters, variation selectors
15-16 U+F0000 to U+10FFFF Private Use Areas Custom characters for private agreements

But can a single standard really cover every script? Yes. Unicode is the closest thing to a solved problem in computing. The remaining debates are about emoji selection, not the architecture itself.

Citation capsule: Unicode 16.0 defines 154,998 characters spanning 168 writing systems (Unicode Consortium, 2024). It separates character identity (code points) from byte representation (encodings like UTF-8 and UTF-16).

encode text for safe transport

How Does UTF-8 Encoding Actually Work?

UTF-8 uses 1 to 4 bytes per character, depending on the code point’s value. Invented by Ken Thompson and Rob Pike in 1992 at Bell Labs, it’s now the dominant encoding on the web, powering 98.2% of all websites (W3Techs, 2025). Its design is elegant: ASCII characters stay one byte, while rarer characters use more.

Here are the rules:

Code Point Range Bytes Byte Pattern Example
U+0000 to U+007F 1 0xxxxxxx A → 0x41
U+0080 to U+07FF 2 110xxxxx 10xxxxxx é → 0xC3 0xA9
U+0800 to U+FFFF 3 1110xxxx 10xxxxxx 10xxxxxx € → 0xE2 0x82 0xAC
U+10000 to U+10FFFF 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 🎉 → 0xF0 0x9F 0x8E 0x89

Decoding UTF-8 step by step

Let’s decode “é” (U+00E9) manually. The code point 0xE9 (decimal 233) falls in the 2-byte range (U+0080 to U+07FF). The pattern is 110xxxxx 10xxxxxx.

  1. Convert 233 to binary: 11101001
  2. Split into groups: 00011 and 101001
  3. Fill the pattern: 11000011 10101001
  4. In hex: 0xC3 0xA9

That’s it. When a UTF-8 decoder sees 0xC3, the leading bits 110 tell it “this is a 2-byte sequence, read one more byte.” Self-synchronizing: you can jump to any byte in a UTF-8 stream and find the next character boundary by scanning for a byte that doesn’t start with 10. This self-synchronizing property is underappreciated. It means you can split a UTF-8 file at any arbitrary byte offset and, within at most 3 bytes, find a valid character boundary. That’s why UTF-8 works so well for streaming, chunked transfer, and parallel processing. Citation capsule: UTF-8 encodes characters in 1 to 4 bytes using a self-synchronizing bit pattern. Designed by Ken Thompson and Rob Pike in 1992, it now dominates with 98.2% web usage (W3Techs, 2025).

What About UTF-16 and UTF-32?

UTF-16 uses 2 or 4 bytes per character, while UTF-32 uses exactly 4 bytes for every character. According to the ICU Project documentation, UTF-16 is the internal encoding of Java, JavaScript, .NET, and Windows. UTF-32 sees almost no use on the web but is common in internal processing where fixed-width access matters.

Encoding Bytes per Character ASCII Efficiency BMP Efficiency Used By
UTF-8 1-4 1 byte (excellent) 1-3 bytes Web, Linux, macOS, most APIs
UTF-16 2-4 2 bytes (wasteful) 2 bytes Java, JavaScript, .NET, Windows internals
UTF-32 4 4 bytes (very wasteful) 4 bytes Internal processing, some databases

The surrogate pair problem

UTF-16 needs special handling for characters outside the BMP (code points above U+FFFF). It uses surrogate pairs: two 16-bit code units that together represent one character. The emoji ”🎉” (U+1F389) becomes the surrogate pair 0xD83C 0xDF89 in UTF-16.

This is why "🎉".length returns 2 in JavaScript, not 1. JavaScript strings are UTF-16 internally. If you’ve ever had string length calculations break on emoji, now you know why.

JavaScript string length is misleading

In JavaScript, "café".length returns 4, but "🎉".length returns 2. Use [..."🎉"].length or "🎉".length with the spread operator for accurate character counts. The Intl.Segmenter API handles grapheme clusters properly.

Citation capsule: UTF-16 is the internal encoding of Java, JavaScript, and .NET (ICU Project). Characters above U+FFFF require surrogate pairs, which is why "🎉".length === 2 in JavaScript.

Why Does Mojibake Happen?

Mojibake occurs when bytes encoded in one character set are decoded using a different one. A Stack Overflow Developer Survey from 2023 found that encoding issues ranked among the top 10 most frustrating bugs developers face. The word “mojibake” itself comes from Japanese, roughly translating to “character transformation.”

Here’s what common mojibake looks like in practice:

You See Original Text What Happened
é é UTF-8 bytes (0xC3 0xA9) decoded as Latin-1
ü ü UTF-8 bytes (0xC3 0xBC) decoded as Latin-1
’ ' UTF-8 right single quote (3 bytes) decoded as Windows-1252
– UTF-8 en dash triple-decoded through Latin-1
Привет Привет UTF-8 Russian text decoded as Latin-1
日本語 日本語 UTF-8 Japanese decoded as Latin-1
The most insidious mojibake I’ve seen wasn’t garbled characters. It was “smart quotes” silently replaced with question marks in a database migration. The data looked fine in the web UI because the browser substituted missing characters. But downstream systems processing the raw bytes choked. Always check your data at the byte level, not just visually.

The double-encoding trap

The worst mojibake comes from double encoding. Here’s how it happens:

  1. Text “café” is encoded as UTF-8: 63 61 66 C3 A9
  2. A system incorrectly treats those bytes as Latin-1 and re-encodes to UTF-8
  3. The 0xC3 byte (which Latin-1 reads as “Ô) becomes UTF-8 C3 83, and 0xA9 (which Latin-1 reads as ”©”) becomes C2 A9
  4. Result: “café” stored as 63 61 66 C3 83 C2 A9

Each round of double-encoding makes the problem harder to reverse. Two rounds is recoverable. Three or more often isn’t.

How Do You Fix Encoding Issues?

The fix for encoding problems is always the same: identify the original encoding and re-decode the bytes correctly. According to Mozilla’s MDN Web Docs, the Content-Type header’s charset parameter is the primary mechanism browsers use to determine encoding. Getting this header right prevents most web-facing mojibake.

Step-by-step diagnosis

  1. Check the raw bytes. Open the file in a hex editor. If you see C3 A9, the source is UTF-8 for “é.” If you see E9 alone, it’s Latin-1.
  2. Check the declared encoding. Look at the HTTP Content-Type header, the HTML <meta charset> tag, or the file’s BOM.
  3. Try re-decoding. In Python: broken_text.encode('latin-1').decode('utf-8') often reverses Latin-1 misinterpretation of UTF-8 bytes.
  4. Check your database connection. MySQL’s SET NAMES utf8mb4 and PostgreSQL’s client_encoding must match the actual encoding of your data.

The golden rule of encoding

Declare UTF-8 everywhere. Set <meta charset="utf-8"> in HTML, Content-Type: text/html; charset=utf-8 in headers, utf8mb4 in MySQL, and save all source files as UTF-8 without BOM.

Quick fixes by language

# Python: fix UTF-8 decoded as Latin-1
broken = "café"
fixed = broken.encode('latin-1').decode('utf-8')
# Result: "café"
// JavaScript: decode a Uint8Array as UTF-8
const decoder = new TextDecoder('utf-8');
const text = decoder.decode(uint8Array);
# Bash: convert a file from Latin-1 to UTF-8
iconv -f ISO-8859-1 -t UTF-8 input.txt > output.txt

validate your data formats

Citation capsule: Mojibake happens when text bytes are decoded with the wrong encoding. The Content-Type header’s charset parameter is the primary mechanism for declaring encoding on the web (MDN Web Docs, Mozilla).

How Do Programming Languages Handle Encoding?

Most modern languages default to UTF-8, but the details vary. According to the Go Blog, Go source files are defined as UTF-8. Python 3 also defaults to UTF-8 for source files (Python docs). Java and JavaScript use UTF-16 internally but accept UTF-8 input through their I/O APIs.

Language Internal Encoding String Type Gotcha
Python 3 UTF-8 (source), flexible (runtime) str (Unicode) len() counts code points, not graphemes
JavaScript UTF-16 string (UTF-16 code units) length counts UTF-16 units, not characters
Java UTF-16 String (char = UTF-16) charAt() returns UTF-16 code units
Go UTF-8 string (byte slice) len() counts bytes, not characters. Use utf8.RuneCountInString()
Rust UTF-8 String / &str len() counts bytes. Use .chars().count() for code points
C# UTF-16 string (char = UTF-16) Similar to Java. Use StringInfo for grapheme clusters

The pattern is clear: languages that predated Unicode’s dominance (Java, JavaScript, C#) chose UTF-16. Languages designed later (Go, Rust) chose UTF-8. Knowing your language’s internal encoding prevents subtle string-handling bugs.

What Is a BOM (Byte Order Mark)?

A BOM is the Unicode character U+FEFF placed at the start of a file to signal its encoding and byte order. According to the Unicode Standard, Section 23.8, the BOM is optional for UTF-8 but required in some contexts for UTF-16. In practice, it causes more problems than it solves for UTF-8 files.

Encoding BOM Bytes Required?
UTF-8 EF BB BF No, and generally discouraged
UTF-16 BE FE FF Recommended
UTF-16 LE FF FE Recommended
UTF-32 BE 00 00 FE FF Recommended
UTF-32 LE FF FE 00 00 Recommended

Why UTF-8 BOM causes problems

UTF-8 has no byte-order ambiguity (it’s a byte-oriented encoding), so the BOM serves no technical purpose. But its 3 bytes (EF BB BF) can cause real headaches:

  • PHP files with a BOM output those 3 bytes before any content, breaking header() calls and sessions.
  • Shell scripts with a BOM fail because the shebang line #!/bin/bash gets prefixed with invisible bytes.
  • CSV files open incorrectly in some spreadsheet applications when a BOM is present.
  • JSON is explicitly forbidden from having a BOM by RFC 8259, Section 8.1.

The one exception: Microsoft Excel sometimes needs a UTF-8 BOM to correctly detect encoding when opening CSV files. It’s an annoying edge case.

Check for hidden BOMs

If a file “looks fine” in your editor but breaks at runtime, check the first 3 bytes with a hex editor. An invisible UTF-8 BOM (EF BB BF) might be the culprit. In Bash: xxd file.txt | head -1.

Citation capsule: The Byte Order Mark (U+FEFF) signals encoding at the start of a file. For UTF-8, it’s 3 bytes (EF BB BF) and generally discouraged because UTF-8 has no byte-order ambiguity (Unicode Standard, Section 23.8).

encode binary data safely

Frequently Asked Questions

What is the difference between Unicode and UTF-8?

Unicode is a character set, a catalog that assigns a unique number (code point) to every character. UTF-8 is an encoding format, the rules for converting those code points into bytes. Unicode defines what characters exist. UTF-8 defines how to store them. Other encodings like UTF-16 and UTF-32 encode the same Unicode code points differently. According to the Unicode Consortium, Unicode 16.0 covers 154,998 characters.

UTF-8 dominates because it’s backward-compatible with ASCII, space-efficient for English text, and capable of encoding every Unicode character. It requires no byte-order mark and self-synchronizes, making it robust for network transmission. W3Techs reports 98.2% web adoption as of 2025. No other encoding combines these properties.

How do I detect a file’s encoding?

There’s no 100% reliable way to detect encoding from bytes alone. Tools like chardet (Python) and file (Linux CLI) use statistical heuristics. Check for a BOM at the file’s start. Check HTTP headers or HTML <meta charset> declarations. If all else fails, try decoding as UTF-8 first since it’s the most common encoding. Invalid byte sequences in UTF-8 are easy to detect.

Can UTF-8 encode every character?

Yes. UTF-8 can encode all 1,112,064 valid Unicode code points using 1 to 4 bytes. This covers every script, symbol, and emoji in the Unicode standard. There is no character you can represent in UTF-16 or UTF-32 that you can’t represent in UTF-8. The three encodings are equivalent in coverage, differing only in byte representation.

What is utf8mb4 in MySQL?

MySQL’s utf8 type only supports characters up to 3 bytes, covering the Basic Multilingual Plane but excluding emoji and rare CJK characters. utf8mb4 supports the full 4-byte UTF-8 range. According to MySQL documentation, utf8mb4 is the recommended character set and has been the default since MySQL 8.0. Always use utf8mb4, never utf8.

explore all developer tools

Wrapping Up

Character encoding isn’t glamorous, but it’s foundational. Every garbled email, every broken CSV import, every mysterious question mark in your database traces back to an encoding mismatch. The rules are simple: use UTF-8 everywhere, declare it explicitly in headers and meta tags, and check your database connection settings.

UTF-8 won the encoding war for good reasons. It’s backward-compatible with ASCII, efficient for the web’s predominantly Latin text, and capable of encoding every character humanity has standardized. With 98.2% web adoption, it’s the safe default for every new project.

When mojibake strikes, don’t panic. Check the raw bytes. Identify the original encoding. Re-decode. And then fix the root cause so it doesn’t happen again.