Best Way To Transliterate Unicode to ASCII? Python Help Needed With Solution.

For creating audio-books I use a text-to-speech engine. One problem is that the application dies on Unicode text. The documents that I encode are too long to correct manually so I want it automated. The correction isn’t as simple as removing all Unicode text though because if possible I don’t want to lose the meaning of the character when it is easily converted to ASCII.

For example here are some transliterations that ought to occur:

¢ → cents
© → copyright
™ → trademark
∀ → for all
♥ → heart
∂ → derivative

I’m more concerned with not-breaking the text-to-speech engine but having a large breadth of transliterations would be nice. With that in mind I started looking for solutions and whittling them down to choosing one:

Revision 1
- PKG/URL
  - Package name
  - Github URL
- Lang
  - Programming language
- Str
  - Number of stars
- Notes
Revision 2 options. I want well supported and easy to run.
- #C: Number of committers
- C: Most recent commit: Hours, Days, Months, Years

PKG/URL	Lang	Str	Notes	#C	C
iki/unidecode	Python	75	Clone of. +. +.	8	Y
Text-Unidecode	Perl	1	The original.	1	Y
rainycape/unidecode	Go	12	NA.
xuender/unidecode	Java	35	NA.
node-unidecode	JS	70	Curious.
UnidecodeR	R	58	Good to know!
sindikat/unidecode	Elisp	2	NA.
silverstripe-unidecode	PHP	8	NA.

The Python port looks like the most actively maintained and Python is always a good choice. The author’s discussion of his port is interesting for programmers. In theory we design system that use Unicode even though we know that they’ll have to inter-operate with ASCII-only systems. In practice it is usually an afterthought that results in well-hidden bugs and exploits. Kind of gets you wondering whether or not we would be better off only building ASCII-only systems today.

Here is how to get it set up with virtualenv on OS X and brew:

Review this and this and verify that you have a Python build with the Unicode support for “wide” characters. For transliterating Blackboard bold, you need this.

This code should answer 1114111 (not 65535)

import sys
print sys.maxunicode

This is the wrong Python build. It needs to be ucs4 instead of ucs2. Seems like a fair number of people use ucs4 (here, here, here, here).

This explains common CFFI errors from systems with both ucs2 and ucs4 installatins that are “mixed up”:

Here is how you know that there is a problem:

This is about getting an ImportError about _cffi_backend.so with a message like Symbol not found: _PyUnicodeUCS2_AsASCIIString. This error occurs in Python 2 as soon as you mix “ucs2” and “ucs4” builds of Python. It means that you are now running a Python compiled with “ucs4”, but the extension module _cffi_backend.so was compiled by a different Python: one that was running “ucs2”. (If the opposite problem occurs, you get an error about _PyUnicodeUCS4_AsASCIIString instead.)

Here is the solution for doing a custom build with a custom CFFI and virtualenv though pyenv is also mentioned.

More generally, the solution that should always work is to download the sources of CFFI (instead of a prebuilt binary) and make sure that you build it with the same version of Python than the one that will use it. For example, with virtualenv:

virtualenv ~/venv
cd ~/path/to/sources/of/cffi
~/venv/bin/python setup.py build --force # forcing a rebuild to make sure
~/venv/bin/python setup.py install

This will compile and install CFFI in this virtualenv, using the Python from this virtualenv.

This post explains another approach to get it running. Here is another one. This all looks like it is fragile. Yuck. I’m going to set up a vagrant box instead.

Here is a start. It doesn’t build right now and I’m stuck. Pythonistas, what am I doing wrong here?

2 thoughts on “Best Way To Transliterate Unicode to ASCII? Python Help Needed With Solution.”

Chris Wellons says:

2016-09-04 at 06:22

I took a crack at it in Emacs Lisp for fun:
http://pastebin.com/Tat9xqcK
It draws the replacement name from the character’s Unicode name, combining characters are dropped, and there’s a table for custom replacements (em dash, en dash, etc.). Spacing around replacements might be needed (“copyright2016” vs. “copyright 2106”). It also doesn’t pluralize, so you’d end up with things like “12¢” “12cent”, which is where the custom table helps.
I said “for fun” since I imagine your need is part of a larger build process and Emacs wouldn’t really fit in.

1. Grant says:
  
  2016-09-04 at 12:00
  
  Fun is appropriate because here that got me thinking about the process and how it is not a simple search and replace. Exploratory programming is indeed perfect here.
  Cool, thank you Chris!

Best Way To Transliterate Unicode to ASCII? Python Help Needed With Solution.

You might also like some of these

2 thoughts on “Best Way To Transliterate Unicode to ASCII? Python Help Needed With Solution.”

Leave a Reply to Chris Wellons Cancel reply