Friday, October 25, 2013

Arbitrary Sort Orders in Python (including digraphs!)

Unicode: everyone wants it, until they get it.
Barry Warsaw

I know I'm due to do another post about LaTeX, but that'll have to wait for next week.

I've recently discovered two nice tools for my iPad which let me do some programming, and sophisticated editing and text processing, Editorial and Pythonista. So, I've been working on some code related to conlanging.

I know some people hate them, but I'm a big fan of word generators for three reasons. First, they help me avoid overusing certain sounds, something I'm normally prone to. Second, it helps you verify that the rules you've given for your syllable shapes actually describe what you want. Finally, while I might have phonaesthetic concerns about some vocabulary, I don't want to agonize over the word for "toe" or "napkin" most of the time, so I like having a random pool of words to grab from. I still might change the word, or decide a random selection is not right for the word, so it's not like I'm giving up aesthetic control of my language.

In any case, while it is a bit odd to write new software on a tablet, last night in about an hour I created a good tool for generating random new word shapes based on rules. But one serious problem came up — the sorted list was sorted terribly! For a person using a computer intended for English speakers, "á" is sorted after "z", which is not what I want at all. So I spent some time trying to come up with a way to sort arbitrarily.

In addition to the sort order of "á", I wanted to be able to correctly sort digraphs. In some languages, "ng" comes after the entirety of "n" in dictionaries and phone books.

It turns out there is a terrifying Perl library to accomplish this, Sort::ArbBiLex. As far as I can see, no such library exists for Python, so I had to write my own.

The code could probably be more efficient, but it works for my purposes, and turned out to be fairly simple. I rely on two bits of trickery. First, Python lets you sort ordered collections like lists and tuples. This makes it easy to follow the "decorate-sort-undecorate" pattern for sorting complex items. Second, I use a bit of a regular expression hack. If you split using a regular expression in a group, you get a strided array as a result, with the split pattern interwoven with the regular expression match.

>>> import re
>>> m = re.compile(r'(ch|t|p|k|a|i|o)')
>>> m.split("tapachi")
['', 't', '', 'a', '', 'p', '', 'a', '', 'ch', '', 'i', '']
>>> m.split("tapachi")[1::2]
['t', 'a', 'p', 'a', 'ch', 'i']

Basically, I split on every single character in the language, which gives me a lot of empty strings, but they're easily filtered out. Notice how it recognizes "ch" as a separate letter of the language.

So, the central algorithm of this little bit of code is: convert the unicode string to a sequence of "letters" (however defined in your language), convert those letters into a numerical code, sort the list of numerical codes, turn the collections of numerical codes back into words, spit back the complete result.

import re

class ArbSorter:
    def __init__(self, order):
        elts = re.split('\s*', order, flags=re.UNICODE)
        # Create a regex to split on each character or multicharacter
        # sort key.  (As in "ch" after all "c"s, for example.)
        # Gosh, this is not especially efficient, but it works.
        split_order = sorted(elts, key=len, reverse=True)
        self.splitter = re.compile(u"(%s)" % "|".join(split_order), re.UNICODE)
        # Next, collect weights for the ordering.
        self.ords = {}
        self.vals = []
        for i in range(len(elts)):
            self.ords[elts[i]] = i

    # Turns a word into a list of ints representing the new
    # lexicographic ordering.  Python, helpfully, allows one to
    # sort ordered collections of all types, including lists.
    def word_as_values(self, word):
        w = self.splitter.split(word)[1::2]
        return [self.ords[char] for char in w]

    def values_as_word(self, values):
        return "".join([self.vals[v] for v in values])

    def __call__(self, l):
        l2 = [self.word_as_values(item) for item in l]
        return [self.values_as_word(item) for item in l2]

if __name__ == '__main__':
    mysorter = ArbSorter(u"a á c ch e h i k l m n ng o p r s t u")
    m = u"chica ciha no áru ngo na nga sangal ahi ná mochi moco"
    s = mysorter(m.split())
    print " ".join(s).encode('utf-8')

(A more attractive presentation.)

Just run the code and it prints out "ahi áru ciha chica moco mochi na ná no nga ngo sangal", exactly what you want. Much better than the "ahi chica ciha mochi moco na nga ngo no ná sangal áru" you'll get on a computer localized for an English speaker.

It is vital that you tell Python you're working with unicode text here, so make sure to include this in a comment near the top of your code: -*- coding: utf-8 -*-.

No comments:

Post a Comment