Detox Unicode Table

I made this Unicode 9.0 to ASCII conversion table for a program called Detox. Detox renames files to make them easier to handle on Unix-like systems.

Download

unicode_detox.txt ~1.1 MiB

Conversion Guidelines

Detox wasn't designed for complex conversions and only handles one character at a time from left to right. This means combining characters, subscripts, and so on will be converted out of context. Then there is still the context of what language the text is in and all sorts of other Unicode rules. So even if it could handle these I am not sure I would want it to for just fixing file names.

While Detox only converts one at a time it can replace that character with more than one character. Symbols can be replaced by entire words or even phrases. This might be a problem if there are too many long named symbols causing it to exceed the maximum file name length. If that is a problem use the max length options talked about in the Detox manuals.

All this means don't use this for anything serious or believe it is correct. I wanted output for file names I could type in on a keyboard.

Multiple Unicode Blocks

The links are for me and sometimes have a conversion table, but you might find them useful. All of them are from Wikipedia unless the link says otherwise.

Unicode Blocks

File Format

As you read this remember that for security reasons to never copy paste anything from a browser into a terminal.

I mostly followed the format of the original Detox Unicode table. The format is

0xUnicode_hexadecimal_number TAB replacement_character TAB # Unicode_name

Here is an example line.

0x0041      A   # LATIN CAPITAL LETTER A

The detox.tbl manual explains there are some other ways of doing it including language specific tables. I didn't use those and don't plan to any time soon if ever.

I used AWK to process the Unicode data text file from the Unicode Consortium. This command says that the delimiter is a semicolon in the original file. It prints 0x and column 1. The Unicode data is in hexadecimal and 0x is required by Detox for hexadecimal. The comments are the Unicode names.

$ awk -F\; '{printf "%06s\t%s\n", $1,$2}' | \
  awk -F\t '{print "0x"$1"\t\t# "$2}'

When Unicode Adds New Characters

When a new version of Unicode is added it is simple to add the characters to the existing table. Only the first 2 columns are relevant to compare for new items. The first column is the Unicode number in hexadecimal and the second column is the name. From what I understand the name should never change so it should be safe to do this without creating duplicate entries.

I used AWK to print the first 2 columns.

$ cat UnicodeData.txt | \
  awk -F\; '{print $1";"$2}' > UnicodeDataOld.txt

Then repeat the same command for the new data from the Unicode Consortium.

$ cat UnicodeData.txt | \
  awk -F\; '{print $1";"$2}' > UnicodeDataNew.txt

I used comm to only print the differences from file 1 which is the new data in this example. The AWK command is the same as the section above.

$ comm -23 UnicodeDataNew.txt UnicodeDataOld.txt | \
  awk -F\; '{printf "%06s\t%s\n", $1,$2}' | \
  awk -F\t '{print "0x"$1"\t\t# "$2}' \
  >> unsorted_detox.txt

Now that file can be sorted. Watch out for the header comments, default character, start point, and end point getting sorted out of place.

Resources

Unicode

Language

Copyright

Unicode Consortium requires me to include the Unicode License when I use their data files. I used their blocks file in the blocks section and their data file for the detox table.

Wikipedia has the same requirement to include their Wikipedia License.


Made by Mr. Satterly
With help from Mrs. Satterly