Best Way To Transliterate Unicode to ASCII? Python Help Needed With Solution.

For creating audio-books I use a text-to-speech engine. One problem is that the application dies on Unicode text. The documents that I encode are too long to correct manually so I want it automated. The correction isn’t as simple as removing all Unicode text though because if possible I don’t want to lose the meaning of the character when it is easily converted to ASCII.

For example here are some transliterations that ought to occur:

  • ¢ → cents
  • © → copyright
  • ™ → trademark
  • ∀ → for all
  • ♥ → heart
  • ∂ → derivative

I’m more concerned with not-breaking the text-to-speech engine but having a large breadth of transliterations would be nice. With that in mind I started looking for solutions and whittling them down to choosing one:

  • Revision 1
    • PKG/URL
      • Package name
      • Github URL
    • Lang
      • Programming language
    • Str
      • Number of stars
    • Notes
  • Revision 2 options. I want well supported and easy to run.
    • #C: Number of committers
    • C: Most recent commit: Hours, Days, Months, Years
PKG/URL Lang Str Notes #C C
iki/unidecode Python 75 Clone of. +. +. 8 Y
Text-Unidecode Perl 1 The original. 1 Y
rainycape/unidecode Go 12 NA.    
xuender/unidecode Java 35 NA.    
node-unidecode JS 70 Curious.    
UnidecodeR R 58 Good to know!    
sindikat/unidecode Elisp 2 NA.    
silverstripe-unidecode PHP 8 NA.    

The Python port looks like the most actively maintained and Python is always a good choice. The author’s discussion of his port is interesting for programmers. In theory we design system that use Unicode even though we know that they’ll have to inter-operate with ASCII-only systems. In practice it is usually an afterthought that results in well-hidden bugs and exploits. Kind of gets you wondering whether or not we would be better off only building ASCII-only systems today.

Here is how to get it set up with virtualenv on OS X and brew:

Review this and this and verify that you have a Python build with the Unicode support for “wide” characters. For transliterating Blackboard bold, you need this.

This code should answer 1114111 (not 65535)

import sys
print sys.maxunicode
65535

This is the wrong Python build. It needs to be ucs4 instead of ucs2. Seems like a fair number of people use ucs4 (here, here, here, here).

This explains common CFFI errors from systems with both ucs2 and ucs4 installatins that are “mixed up”:

Here is how you know that there is a problem:

This is about getting an ImportError about _cffi_backend.so with a message like Symbol not found: _PyUnicodeUCS2_AsASCIIString. This error occurs in Python 2 as soon as you mix “ucs2” and “ucs4” builds of Python. It means that you are now running a Python compiled with “ucs4”, but the extension module _cffi_backend.so was compiled by a different Python: one that was running “ucs2”. (If the opposite problem occurs, you get an error about _PyUnicodeUCS4_AsASCIIString instead.)

Here is the solution for doing a custom build with a custom CFFI and virtualenv though pyenv is also mentioned.

More generally, the solution that should always work is to download the sources of CFFI (instead of a prebuilt binary) and make sure that you build it with the same version of Python than the one that will use it. For example, with virtualenv:

virtualenv ~/venv
cd ~/path/to/sources/of/cffi
~/venv/bin/python setup.py build --force # forcing a rebuild to make sure
~/venv/bin/python setup.py install

This will compile and install CFFI in this virtualenv, using the Python from this virtualenv.

This post explains another approach to get it running. Here is another one. This all looks like it is fragile. Yuck. I’m going to set up a vagrant box instead.

Here is a start. It doesn’t build right now and I’m stuck. Pythonistas, what am I doing wrong here?

2 thoughts on “Best Way To Transliterate Unicode to ASCII? Python Help Needed With Solution.”

  1. I took a crack at it in Emacs Lisp for fun:
    http://pastebin.com/Tat9xqcK
    It draws the replacement name from the character’s Unicode name, combining characters are dropped, and there’s a table for custom replacements (em dash, en dash, etc.). Spacing around replacements might be needed (“copyright2016” vs. “copyright 2106”). It also doesn’t pluralize, so you’d end up with things like “12¢” “12cent”, which is where the custom table helps.
    I said “for fun” since I imagine your need is part of a larger build process and Emacs wouldn’t really fit in.

    1. Fun is appropriate because here that got me thinking about the process and how it is not a simple search and replace. Exploratory programming is indeed perfect here.
      Cool, thank you Chris!

Leave a Reply

Your email address will not be published. Required fields are marked *