π Unicode
π Universal Character Encoding
Unicode is a computing industry standard for consistent encoding, representation, and handling of text expressed in most of the world's writing systems. It provides a unique number for every character, regardless of platform, program, or language.
π What is Unicode?β
Unicode is a character encoding standard that:
- Aims to include all characters from all writing systems of the world
- Assigns a unique code point (number) to each character
- Supports modern and historical scripts, symbols, and emoji
- Enables consistent text representation across different platforms
- Is maintained by the Unicode Consortium
π Key Conceptsβ
π’ Code Pointsβ
A code point is a numerical value that represents a specific character:
- Expressed as U+XXXX (where XXXX is a hexadecimal number)
- Range from U+0000 to U+10FFFF
- Can represent over 1.1 million different characters
- Currently, about 144,000 characters are assigned
Examples:
- U+0041: Latin capital letter A
- U+4E2D: Chinese character for "middle" (δΈ)
- U+1F600: Grinning face emoji (π)
πΊοΈ Planesβ
Unicode divides its code points into 17 planes, each containing 65,536 code points:
- Basic Multilingual Plane (BMP): U+0000 to U+FFFF
- Contains most commonly used characters
- Includes Latin, Greek, Cyrillic, Chinese, Japanese, Korean, etc.
- Supplementary Planes: U+10000 to U+10FFFF
- Include less common scripts, historical symbols, emoji, etc.
βοΈ Character Propertiesβ
Unicode assigns various properties to each character:
- Category (letter, number, punctuation, symbol, etc.)
- Case information (uppercase, lowercase, titlecase)
- Directionality (left-to-right, right-to-left)
- Combining behavior (how characters combine with others)
- Decomposition (equivalent character sequences)
πΎ Unicode Encoding Formsβ
Unicode defines several encoding forms to represent code points in computer memory:
π UTF-8β
- Variable-length encoding using 1 to 4 bytes per character
- ASCII characters (U+0000 to U+007F) use just 1 byte
- Most common encoding on the web and in modern systems
- Backward compatible with ASCII
- Space-efficient for Latin script text
π UTF-16β
- Variable-length encoding using 2 or 4 bytes per character
- BMP characters use 2 bytes
- Characters outside BMP use 4 bytes (surrogate pairs)
- Used internally by Windows, Java, JavaScript, etc.
π UTF-32β
- Fixed-length encoding using 4 bytes per character
- Simple mapping between code points and encoded form
- Less space-efficient but easier to process
- Used in some programming environments
π Unicode vs. Earlier Encodingsβ
Unicode addresses limitations of earlier character encodings:
Feature | ASCII | ISO-8859 | Big5/GB | Unicode |
---|---|---|---|---|
Character range | 128 | 256 | ~20,000 | Over 140,000 |
Multilingual support | No | Limited | Limited | Comprehensive |
Consistency across systems | Yes | No | No | Yes |
Backward compatibility | - | With ASCII | With ASCII | With ASCII (UTF-8) |
π» Unicode in Practiceβ
π Text Processingβ
- Enables consistent sorting, searching, and comparison
- Provides rules for text segmentation (word boundaries, line breaking)
- Defines normalization forms for equivalent character sequences
π Internationalizationβ
- Allows software to support multiple languages
- Enables text rendering for complex scripts
- Supports bidirectional text (mixing left-to-right and right-to-left)
πΈοΈ Web Developmentβ
- HTML5 uses UTF-8 by default
- CSS supports Unicode for selectors and values
- JavaScript uses UTF-16 internally
π Common Unicode Blocksβ
Unicode organizes characters into logical blocks:
- π€ Basic Latin (ASCII): U+0000 to U+007F
- π‘ Latin-1 Supplement: U+0080 to U+00FF
- πΆ CJK Unified Ideographs (Chinese, Japanese, Korean): U+4E00 to U+9FFF
- π Emoji: U+1F600 to U+1F64F (and others)
β οΈ Challenges and Considerationsβ
π Normalizationβ
- Multiple ways to represent some characters (e.g., Γ© can be a single code point or e + accent)
- Normalization forms (NFC, NFD, NFKC, NFKD) provide standardized representations
π¨ Rendering Complexityβ
- Some scripts require complex rendering rules
- Combining marks, ligatures, and contextual forms add complexity
- Proper font support is necessary for correct display
β¨οΈ Input Methodsβ
- Inputting thousands of characters requires specialized methods
- Input method editors (IMEs) for Asian languages
- Virtual keyboards and character pickers for special symbols
Understanding Unicode is essential for:
- π Developing internationalized software
- π£οΈ Working with multilingual text data
- π Ensuring proper text handling across different systems
- π Supporting global users with diverse language needs
Unicode represents a significant achievement in computing, enabling truly global communication and information processing.