Monday, June 16, 2014

The Ultimate Dictionary Database System

Is text. End of post.

Ok, it's not quite that simple. You probably want some sort of structured text, semantically marked up if possible. But at the end of the day, all you can really rely on is text.

Why Spreadsheets Suck

First, the format is proprietary and often inconsistent across even minor version changes. You will be in a world of hurt if you want to share your dictionary with anyone else.

Second — and this is the biggest problem by far, assuming you're trying to make a naturalistic conlang — a real dictionary for a real language does not look like this:

  • kətaŋ sleep
  • kətap book
  • kətəs hangnail on the left little finger which interferes with one's needlework
  • kəwa tree
  • kəwah noodle
  • kəwe computer
  • kəweŋ hard

A few words between two languages might have (nearly) perfect overlap, and the early history of word in a conlang might start as a simple gloss, but a simple word-to-word matching is profoundly lying to you for a real language, and in a conlang signals a relex.

A real dictionary entry looks like this: δίδωμι. It has multiple meanings defined, examples of use, collocations, grammar and morphology notes, references, etc., etc.

The spreadsheet format forces you into a very limited structure for each word. That structure can never hope to cope reliably with all the different words of a single language, much less the variety of things conlangers come up with (to say nothing of natlang variety). A spreadsheet is a too rigid format to grow the meaning and uses of a word over the lifetime of your conlang.

Why Databases Suck

First, they share the same problems with spreadsheets with respect to format. Technically, SQL is a standard. In reality, all but the most trivial of databases tend to use non-standard SQL conveniences offered by the database server software the software author decided to use. So, you may get something almost portable, but often not.

Second, and again like the spreadsheet problem, a truly universal dictionary tool, a piece of software that could handle everything from Indonesian to Ancient Greek to Navajo — or Toki Pona to Na'vi to High Valyrian to Ithkuil — is going to require a very complex database structure. The SIL "Toolbox" dictionary tool has more than 100 fields available (Making Dictionaries), and all those possibilities need to be in both the database design and the software that talks to the database.

I have, over the years, spent some time trying to design a database that could really be a good language dictionary. The schema for even a simple design was quite complex, and I would not have wanted to write the software to control it. There's this huge problem in that different languages vary wildly in their definitional needs. For Mandarin, for example, you need to cover all the usual purely semantic matters — polysemy, idiom, collocation, multiple definitions, examples, etc. — but there aren't too many morphological worries. But once you add morphological complexity you've got a whole new layer of issues. The Ancient Greek example I link to above is for a fairly irregular verb, with dialectal worries to boot. And for Navajo and related Athabaskan languages the situation is so dire that people write papers called things like Making Athabaskan Dictionaries Usable and Design Issues in Athabaskan Dictionaries (do look at those to get a feel for the issues).

Any truly general dictionary database, one capable of handling enough sorts of languages to be genuinely useful, would have vast tracts of empty space to accommodate information not needed in many languages, with these fields of whitespace in different places for different languages. Even if you target your database and software design to something like Ancient Greek, there will be lots of fields left blank most of the time. It's not like all the verbs are irregular, though it may sometimes seem that way to beginners.

If you had a very good team of developers, you could probably overcome these problems, assuming the users were willing to configure a complex tool to make it easy to use for only the things your language needed. But it's never going to be a money-making venture. I don't expect to see such a tool in my lifetime.

Enter Stage Right: Text

So, we're back to simple text. The benefits:

  • the file is still readable if Microsoft/Apple/Whoever releases a New and Improved (tm) version of this or that proprietary bit of software; a file you find from 10 years ago will still be readable
  • there are zillions of text editors, usually with built in search functions, which will work on the file
  • if part of the file is destroyed, the rest of the file will generally be recoverable (proprietary formats tend to be brittle when bitrot sets in)

Bare text, of course, is not very attractive. The way around this is to use a text-based markup of some sort. You could use HTML. Or even XML with a little more work. I strongly favor LaTeX, which requires more typing than I might like, but it gives me maximum flexibility to change my mind and spits out very attractive results. The point of this is that even though HTML and LaTeX are presentation formats, the underlying basis is still just plain text. If something goes horribly wrong, you'll have a modestly ugly text file to read, but all your hard work will still be recoverable.

If you are disorganized, a computer will not help you. If you can impose a little order on yourself, though, a computer can make your life a lot easier. And a little thought can make even a plain old .txt file into the best dictionary tool you could ever want.

9 comments:

  1. I completely agree with this post: anything that is not underlyingly text is unfit for purpose of safekeeping, and thus for dictionaries.

    On the other hand, I feel you misrepresented the SIL Toolbox database scheme, as it is actually *exactly* what you want: plain text with a little bit of mark up. Of course it's complex mark up, but that's because, as you said, making dictionaries is complicated. And that problem is intractable: you'll never be able to create a format that makes all dictionaries easy. SIL simplifies it by making most markers optional: you only use them physically in your pain text file if they are needed. They basically leave the actual design decisions where they belong: to the dictionary maker.

    That's why despite everything, my dictionary is in the SIL Toolbox format. It handles everything I need, and more. And the complexity is only in the beginning when you devise your own entry scheme. Once that's done (and the format is flexible enough that one can correct mistakes when one realises one's design decision was wrong). Afterwards, it's exceedingly simple to handle.

    ReplyDelete
    Replies
    1. Ah, I could have been clearer about that. I think the SIL Toolbox format is another very good text-based system for storing a dictionary. Some day, when I have more time, I may write a toolset to convert a good subset of the Toolbox codes into nice LaTeX.

      I just included that in the database section to show just how much you really have to include in the database schema for a reasonably complete and flexible dictionary database.

      Delete
    2. I strongly disagree. I'd argue that the best tool for building a dictionary would be a triple store. It frees you from the schema problems of a SQL database and is much more powerful at organizing information than a flatfile. The argument that a plain text file is safer is even odder. If you have your data in some kind of database it's trivial (at least compared to designing the storage system itself) to generate plain text backups.

      Delete
    3. A triple store is a fairly good match for the SIL Toolbox format actually. A text export of that would often be ugly to reconstitute your work from, though I'm guessing there are smarter and friendlier exporters these days. (Are triple stores getting much use these days? The Semantic Web seems further away every year.)

      And I would never recommend a SQL dump of dictionary as a friendly backup format to recover ten year old work from, especially a well-normalized one. Good readability for humans matters, too.

      Delete
  2. I think if I'm gonna roll my own LSJ, I'd rather use a wiki. It's much easier to search, interconnect, and revise. It's also easy to down-grade to plain text with just a couple of regex search-and-replaces. e.g.
    http://lsj.translatum.gr/wiki/δίδωμι

    ReplyDelete
  3. I don't mean a raw SQL/triple dump, I mean using the data to generate a document in whatever format you want. Plain text, markdown, latex, xml, whatever.

    There are a lot of triple store options, none that have really gained widespread adoption that I know of. I've been working on a dictionary web app off and on for a while now and have toyed with the idea of trying to use SparkleDB, but hosting is an issue. It's easy to find cheap hosting for a site using a relational database or something like mongodb, SparkleDB not so much.

    ReplyDelete
    Replies
    1. I don't mean a raw SQL/triple dump, I mean using the data to generate a document in whatever format you want. Plain text, markdown, latex, xml, whatever.

      If you can program, many options are available to you, and the effort can have value besides simply concocting a complex way to store a dictionary. For everyone else, the benefits of this complexity seem less obvious to me.

      Delete
  4. I am actually working on a program right now that addresses these issues exactly. It's in early stages, but I plan on implementing a graphical system of your thesaurus (http://lingweenie.org/conlang/ConlangersThesaurus.pdf) into the program, as I think that would help people with this problem immensely.

    1-1 word overlap is still possible if the conlang creator wishes, of course, but larger definition boxes (searchable) are available as well, to give multiple definitions, or the like.

    An old version (very stripped down, please don't judge by the clunker) is available at http://Sulmere.tumblr.com/PolyGlot. I'll be releasing 0.7 soon, though. If you have questions, concerns or suggestions, please feel free to contact me!

    ReplyDelete
  5. The solution is likely NoSQL, such as a document database (MongoDB) or at least a key:value store.

    ReplyDelete