Skip to main content

🌐 Unicode

🌐 Universal Character Encoding

Unicode is a computing industry standard for consistent encoding, representation, and handling of text expressed in most of the world's writing systems. It provides a unique number for every character, regardless of platform, program, or language.

πŸ“ What is Unicode?​

Unicode is a character encoding standard that:

  • Aims to include all characters from all writing systems of the world
  • Assigns a unique code point (number) to each character
  • Supports modern and historical scripts, symbols, and emoji
  • Enables consistent text representation across different platforms
  • Is maintained by the Unicode Consortium

πŸ”‘ Key Concepts​

πŸ”’ Code Points​

A code point is a numerical value that represents a specific character:

  • Expressed as U+XXXX (where XXXX is a hexadecimal number)
  • Range from U+0000 to U+10FFFF
  • Can represent over 1.1 million different characters
  • Currently, about 144,000 characters are assigned

Examples:

  • U+0041: Latin capital letter A
  • U+4E2D: Chinese character for "middle" (δΈ­)
  • U+1F600: Grinning face emoji (πŸ˜€)

πŸ—ΊοΈ Planes​

Unicode divides its code points into 17 planes, each containing 65,536 code points:

  • Basic Multilingual Plane (BMP): U+0000 to U+FFFF
    • Contains most commonly used characters
    • Includes Latin, Greek, Cyrillic, Chinese, Japanese, Korean, etc.
  • Supplementary Planes: U+10000 to U+10FFFF
    • Include less common scripts, historical symbols, emoji, etc.

βš™οΈ Character Properties​

Unicode assigns various properties to each character:

  • Category (letter, number, punctuation, symbol, etc.)
  • Case information (uppercase, lowercase, titlecase)
  • Directionality (left-to-right, right-to-left)
  • Combining behavior (how characters combine with others)
  • Decomposition (equivalent character sequences)

πŸ’Ύ Unicode Encoding Forms​

Unicode defines several encoding forms to represent code points in computer memory:

πŸ“Š UTF-8​

  • Variable-length encoding using 1 to 4 bytes per character
  • ASCII characters (U+0000 to U+007F) use just 1 byte
  • Most common encoding on the web and in modern systems
  • Backward compatible with ASCII
  • Space-efficient for Latin script text

πŸ“‹ UTF-16​

  • Variable-length encoding using 2 or 4 bytes per character
  • BMP characters use 2 bytes
  • Characters outside BMP use 4 bytes (surrogate pairs)
  • Used internally by Windows, Java, JavaScript, etc.

πŸ“ˆ UTF-32​

  • Fixed-length encoding using 4 bytes per character
  • Simple mapping between code points and encoded form
  • Less space-efficient but easier to process
  • Used in some programming environments

πŸ”„ Unicode vs. Earlier Encodings​

Unicode addresses limitations of earlier character encodings:

FeatureASCIIISO-8859Big5/GBUnicode
Character range128256~20,000Over 140,000
Multilingual supportNoLimitedLimitedComprehensive
Consistency across systemsYesNoNoYes
Backward compatibility-With ASCIIWith ASCIIWith ASCII (UTF-8)

πŸ’» Unicode in Practice​

πŸ” Text Processing​

  • Enables consistent sorting, searching, and comparison
  • Provides rules for text segmentation (word boundaries, line breaking)
  • Defines normalization forms for equivalent character sequences

🌍 Internationalization​

  • Allows software to support multiple languages
  • Enables text rendering for complex scripts
  • Supports bidirectional text (mixing left-to-right and right-to-left)

πŸ•ΈοΈ Web Development​

  • HTML5 uses UTF-8 by default
  • CSS supports Unicode for selectors and values
  • JavaScript uses UTF-16 internally

πŸ“š Common Unicode Blocks​

Unicode organizes characters into logical blocks:

  • πŸ”€ Basic Latin (ASCII): U+0000 to U+007F
  • πŸ”‘ Latin-1 Supplement: U+0080 to U+00FF
  • 🈢 CJK Unified Ideographs (Chinese, Japanese, Korean): U+4E00 to U+9FFF
  • πŸ˜€ Emoji: U+1F600 to U+1F64F (and others)

⚠️ Challenges and Considerations​

πŸ”„ Normalization​

  • Multiple ways to represent some characters (e.g., Γ© can be a single code point or e + accent)
  • Normalization forms (NFC, NFD, NFKC, NFKD) provide standardized representations

🎨 Rendering Complexity​

  • Some scripts require complex rendering rules
  • Combining marks, ligatures, and contextual forms add complexity
  • Proper font support is necessary for correct display

⌨️ Input Methods​

  • Inputting thousands of characters requires specialized methods
  • Input method editors (IMEs) for Asian languages
  • Virtual keyboards and character pickers for special symbols

Understanding Unicode is essential for:

  • 🌐 Developing internationalized software
  • πŸ—£οΈ Working with multilingual text data
  • πŸ”„ Ensuring proper text handling across different systems
  • 🌍 Supporting global users with diverse language needs

Unicode represents a significant achievement in computing, enabling truly global communication and information processing.