Rechercher Contactez-nous

Suivez-nous sur Twitter

Freely subscribe to our NEWSLETTER

Opinion

Encryption and Tokenization of International Unicode Data

May 2021 by Ulf Mattsson, CTO, Protegrity Corporation

Protecting the increasing use International Unicode characters is required by a growing number of Privacy Laws in many countries and general Privacy Concerns with private data. Current approaches to protect International Unicode characters will increase the size and change the data formats. This will break many applications and slow down business operations. The current approach is also randomly returning data in new and unexpected languages. New approach with significantly higher performance and a memory footprint can be customizable and fit on small IoT devices. We will discuss new approaches to achieve portability, security, performance, small memory footprint and language preservation for privacy protecting of Unicode data. These new approaches provide granular protection for all Unicode languages and customizable alphabets and byte length preserving protection of privacy protected characters. We will focus on UTF-8 since character encodings for websites 2020 reported that UTF-8 is used by 95.4% .

Unicode Code points for the Scripts can be stored in UTF-8 in one to four bytes
1) 128 characters (US-ASCII)
2) 1,920 characters Latin-script, Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac, Thaana and N’Ko alphabets.
3) Characters in common use, including most Chinese, Japanese and Korean characters**.
4) Less common CJK (The commonly used Hanzi/Kanji characters are in the "CJK Unified Ideographs").

Examples of Tokenization of Unicode

Token Fabric generated from input of Unicode Code Points

A fabric of intermediate tokens is created to increase the entropy of each final token. The blue tokens represent temporary results and the final token values are green:

Forward and backward chaining of tokens

The tokenization function can be based on randomized lookup tables or encryption. The chaining can add entropy via additional tokenization input to the tokenization process in each step. This example with short data is based on a two-character input-string “AA” that will generate the middle layer tokens that are temporary results and the final tokens are at the bottom layer. The tokens are chained forward and backwards to increase the entropy:

These are examples of European Scripts

Examples of languages with one to two bytes characters

These are examples of East Asian Scripts

Example of tokenizing five Japanese Scripts in an address label

Summary
We discussed an approach that is not returning tokenized data in new and unexpected languages. New approach with significantly higher performance and a memory footprint can be customizable and fit on small IoT devices. New approaches can achieve portability, security, performance, small memory footprint and language preservation for privacy protecting of Unicode data. These new approaches provide granular protection for all Unicode languages and customizable alphabets and byte length preserving protection of privacy protected characters.

Subscribe

Freely subscribe to our NEWSLETTER

See previous articles

See next articles

Security Vulnerability

Toutes nos news en Francais

Alle unsere News auf deutsch

Your podcast Here

All new podcasts

Global Security Mag Copyright 2011