Skip to content

Capitalization is a kind-of pseduo-morphology #690

@linas

Description

@linas

This is a meta-issue, design-change request, to treat capitalization (and possibly other things) as a kind-of pseudo-morphology. See issue #42 for context. The general issue is about refomulating tokenization (and related issues) into a collection of rules (that could be encoded in a file). The example is capitalization.

For example: capitalization: we have a rule (coded in C) that if a word is at the beginning of a sentence, then we should search for a lower-case version of it. ... or if the word is after a semicolon, then we should search for a lower-case version of it. ... or if the word is after a quote, then we should search for a lower-case version of it. (Lincoln said, "Four-score and seven...") There is an obvious solution: design a new file, and place semi-colons, quotes and LEFT-WORD as markers for capitalization. Maybe this is kind-of-like a "new kind of affix rule" ?? All of the other affix rules state "if there is a certain sequence, then insert whitespace" while this new rule is "if there is a certain sequence, then look for downcase".

In the language learning code, I don't downcase any data in advance. Instead, the system eventually learns that certain words behave the same way, grammatically, whether they are uppercased or not. The system is blind to uppercasing: it just sees two different UTF8 strings that happen to fall into the same grammatical class.

To "solve" this problem, one can imagine three steps. First, a "morphological" analysis: given a certain grammatical class, compare pairs to strings to see if they have a common substring - for example, if the whole string matches, except for the first letter. This would imply that some words have a "morphology", where the first letter can be either one of two, while the rest of the word is the same.

The second step is to realize that there is a meta-morphology-rule, which states that there are many words, all of which have the property that they can begin with either one of two different initial letters. The correct choice of the initial letter depends on whether the preceding token was a semicolon, a quote, or the left-wall.

The third step is to realize that the meta-morphology-rule can be factored into approximately 26 different classes. That is, in principle, there are 52-squared/2=1352 possible sets containing two (initial) letters. Of these, only 26 are seen: {A, a}, {B, b}, {C, c} ....and one never ever sees {P, Q} or {Z, a}.

As long was we write C code, and know in advance that we are dealing with capital letters, then we can use pre-defined POSIX locales for capitalization. I'm trying to take two or three steps backwards, here. One is to treat capitalization as a kind of morphology, just like any other kind of morphology. The second is to create morphology classes - the pseudo-morpheme A is only substitutable by the pseudo-morpheme a. The third is that all of this should be somehow rule-driven and "generic" in some way.

The meta-meta-meta issue is that I want to expand the framework beyond just language written as UTF8 strings, but more generally, language with associated intonation, affect, facial expressions, or "language" from other domains (biology, spatial relations, etc.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions