Morfessor file types

Binary model

The standard format for Morfessor 2.0 is a binary model, generated by pickling the BaselineModel object. This ensures that all training-data, annotation-data and weights are exactly the same as when the model was saved.

Morfessor 1.0 style text model

Morfessor 2.0 also supports the text model files that are used in Morfessor 1.0. These files consists of one segmentation per line, preceded by a count, where the constructions are separated by ‘ + ‘.

Specification:

<int><space><CONSTRUCTION>[<space>+<space><CONSTRUCTION>]*

Example:

10 kahvi + kakku
5 kahvi + kilo + n
24 kahvi + kone + emme

Text corpus file

A text corpus file is a free format text-file. All lines are split into compounds using the compound-separator (default <space>). The compounds then are split into atoms using the atom-separator. Compounds can occur multiple times and will be counted as such.

Example:

kavhikakku kahvikilon kahvikilon
kahvikoneemme kahvikakku

Word list file

A word list corpus file contains one compound per line, possibly preceded by a count. If multiple entries of the same word occur there counts are summed. If no count is given, a count of one is assumed (per entry).

Specification:

[<int><space>]<COMPOUND>

Example 1:

10 kahvikakku
5 kahvikilon
24 kahvikoneemme

Example 2:

kahvikakku
kahvikilon
kahvikoneemme

Annotation file

An annotation file contains one compound and one or more annotations per compound on each line. The separators between the annotations (default ‘, ‘) and between the constructions (default ‘ ‘) are configurable.

Specification:

<compound> <analysis1construction1>[ <analysis1constructionN>][, <analysis2construction1> [<analysis2constructionN>]*]*

Example:

kahvikakku kahvi kakku, kahvi kak ku
kahvikilon kahvi kilon
kahvikoneemme kahvi konee mme, kah vi ko nee mme