🌐 Unicode

🌐 Universal Character Encoding

Unicode is a computing industry standard for consistent encoding, representation, and handling of text expressed in most of the world's writing systems. It provides a unique number for every character, regardless of platform, program, or language.

📝 What is Unicode?

Unicode is a character encoding standard that:

Aims to include all characters from all writing systems of the world
Assigns a unique code point (number) to each character
Supports modern and historical scripts, symbols, and emoji
Enables consistent text representation across different platforms
Is maintained by the Unicode Consortium

🔑 Key Concepts

🔢 Code Points

A code point is a numerical value that represents a specific character:

Expressed as U+XXXX (where XXXX is a hexadecimal number)
Range from U+0000 to U+10FFFF
Can represent over 1.1 million different characters
Currently, about 144,000 characters are assigned

Examples:

U+0041: Latin capital letter A
U+4E2D: Chinese character for "middle" (中)
U+1F600: Grinning face emoji (😀)

🗺️ Planes

Unicode divides its code points into 17 planes, each containing 65,536 code points:

Basic Multilingual Plane (BMP): U+0000 to U+FFFF
- Contains most commonly used characters
- Includes Latin, Greek, Cyrillic, Chinese, Japanese, Korean, etc.
Supplementary Planes: U+10000 to U+10FFFF
- Include less common scripts, historical symbols, emoji, etc.

⚙️ Character Properties

Unicode assigns various properties to each character:

Category (letter, number, punctuation, symbol, etc.)
Case information (uppercase, lowercase, titlecase)
Directionality (left-to-right, right-to-left)
Combining behavior (how characters combine with others)
Decomposition (equivalent character sequences)

💾 Unicode Encoding Forms

Unicode defines several encoding forms to represent code points in computer memory:

📊 UTF-8

Variable-length encoding using 1 to 4 bytes per character
ASCII characters (U+0000 to U+007F) use just 1 byte
Most common encoding on the web and in modern systems
Backward compatible with ASCII
Space-efficient for Latin script text

📋 UTF-16

Variable-length encoding using 2 or 4 bytes per character
BMP characters use 2 bytes
Characters outside BMP use 4 bytes (surrogate pairs)
Used internally by Windows, Java, JavaScript, etc.

📈 UTF-32

Fixed-length encoding using 4 bytes per character
Simple mapping between code points and encoded form
Less space-efficient but easier to process
Used in some programming environments

🔄 Unicode vs. Earlier Encodings

Unicode addresses limitations of earlier character encodings:

Feature	ASCII	ISO-8859	Big5/GB	Unicode
Character range	128	256	~20,000	Over 140,000
Multilingual support	No	Limited	Limited	Comprehensive
Consistency across systems	Yes	No	No	Yes
Backward compatibility	-	With ASCII	With ASCII	With ASCII (UTF-8)

💻 Unicode in Practice

🔍 Text Processing

Enables consistent sorting, searching, and comparison
Provides rules for text segmentation (word boundaries, line breaking)
Defines normalization forms for equivalent character sequences

🌍 Internationalization

Allows software to support multiple languages
Enables text rendering for complex scripts
Supports bidirectional text (mixing left-to-right and right-to-left)

🕸️ Web Development

HTML5 uses UTF-8 by default
CSS supports Unicode for selectors and values
JavaScript uses UTF-16 internally

📚 Common Unicode Blocks

Unicode organizes characters into logical blocks:

🔤 Basic Latin (ASCII): U+0000 to U+007F
🔡 Latin-1 Supplement: U+0080 to U+00FF
🈶 CJK Unified Ideographs (Chinese, Japanese, Korean): U+4E00 to U+9FFF
😀 Emoji: U+1F600 to U+1F64F (and others)

⚠️ Challenges and Considerations

🔄 Normalization

Multiple ways to represent some characters (e.g., é can be a single code point or e + accent)
Normalization forms (NFC, NFD, NFKC, NFKD) provide standardized representations

🎨 Rendering Complexity

Some scripts require complex rendering rules
Combining marks, ligatures, and contextual forms add complexity
Proper font support is necessary for correct display

⌨️ Input Methods

Inputting thousands of characters requires specialized methods
Input method editors (IMEs) for Asian languages
Virtual keyboards and character pickers for special symbols

Understanding Unicode is essential for:

🌐 Developing internationalized software
🗣️ Working with multilingual text data
🔄 Ensuring proper text handling across different systems
🌍 Supporting global users with diverse language needs

Unicode represents a significant achievement in computing, enabling truly global communication and information processing.

📝 What is Unicode?​

🔑 Key Concepts​

🔢 Code Points​

🗺️ Planes​

⚙️ Character Properties​

💾 Unicode Encoding Forms​

📊 UTF-8​

📋 UTF-16​

📈 UTF-32​

🔄 Unicode vs. Earlier Encodings​

💻 Unicode in Practice​

🔍 Text Processing​

🌍 Internationalization​

🕸️ Web Development​

📚 Common Unicode Blocks​

⚠️ Challenges and Considerations​

🔄 Normalization​

🎨 Rendering Complexity​

⌨️ Input Methods​