Python library interface to Morfessor

Morfessor 2.0 contains a library interface in order to be integrated in other python applications. The public members are documented below and should remain relatively the same between Morfessor versions. Private members are documented in the code and can change anytime in releases.

The classes are documented below.

IO class

class morfessor.io.MorfessorIO(encoding=None, construction_separator=' + ', comment_start='#', compound_separator='\s+', atom_separator=None, lowercase=False)

Definition for all input and output files. Also handles all encoding issues.

The only state this class has is the separators used in the data. Therefore, the same class instance can be used for initializing multiple files.

format_constructions(constructions, csep=None, atom_sep=None)

Return a formatted string for a list of constructions.

read_annotations_file(file_name, construction_separator=' ', analysis_sep=', ')

Read a annotations file.

Each line has the format: <compound> <constr1> <constr2>… <constrN>, <constr1>…<constrN>, …

Yield tuples (compound, list(analyses)).

read_any_model(file_name)

Read a file that is either a binary model or a Morfessor 1.0 style model segmentation. This method can not be used on standard input as data might need to be read multiple times

static read_binary_file(file_name)

Read a pickled object from a file.

read_binary_model_file(file_name)

Read a pickled model from file.

read_corpus_file(file_name)

Read one corpus file.

For each compound, yield (1, compound_atoms). After each line, yield (0, ()).

read_corpus_files(file_names)

Read one or more corpus files.

Yield for each compound found (1, compound_atoms).

read_corpus_list_file(file_name)

Read a corpus list file.

Each line has the format: <count> <compound>

Yield tuples (count, compound_atoms) for each compound.

read_corpus_list_files(file_names)

Read one or more corpus list files.

Yield for each compound found (count, compound_atoms).

read_parameter_file(file_name)

Read learned or estimated parameters from a file

read_segmentation_file(file_name, has_counts=True, **kwargs)

Read segmentation file.

File format: <count> <construction1><sep><construction2><sep>…<constructionN>

static write_binary_file(file_name, obj)

Pickle an object into a file.

write_binary_model_file(file_name, model)

Pickle a model to a file.

write_lexicon_file(file_name, lexicon)

Write to a Lexicon file all constructions and their counts.

write_parameter_file(file_name, params)

Write learned or estimated parameters to a file

write_segmentation_file(file_name, segmentations, **kwargs)

Write segmentation file.

File format: <count> <construction1><sep><construction2><sep>…<constructionN>

Model classes

class morfessor.baseline.AnnotatedCorpusEncoding(corpus_coding, weight=None, penalty=-9999.9)

Encoding the cost of an Annotated Corpus.

In this encoding constructions that are missing are penalized.

get_cost()

Return the cost of the Annotation Corpus.

set_constructions(constructions)

Method for re-initializing the constructions. The count of the constructions must still be set with a call to set_count

set_count(construction, count)

Set an initial count for each construction. Missing constructions are penalized

update_count(construction, old_count, new_count)

Update the counts in the Encoding, setting (or removing) a penalty for missing constructions

update_weight()

Update the weight of the Encoding by taking the ratio of the corpus boundaries and annotated boundaries

class morfessor.baseline.AnnotationCorpusWeight(devel_set, threshold=0.01)

Class for using development annotations to update the corpus weight during batch training

update(model, epoch)

Tune model corpus weight based on the precision and recall of the development data, trying to keep them equal

class morfessor.baseline.BaselineModel(forcesplit_list=None, corpusweight=None, use_skips=False, nosplit_re=None)

Morfessor Baseline model class.

Implements training of and segmenting with a Morfessor model. The model is complete agnostic to whether it is used with lists of strings (finding phrases in sentences) or strings of characters (finding morphs in words).

forward_logprob(compound)

Find log-probability of a compound using the forward algorithm.

Parameters:compound – compound to process

Returns the (negative) log-probability of the compound. If the probability is zero, returns a number that is larger than the value defined by the penalty attribute of the model object.

get_compounds()

Return the compound types stored by the model.

get_constructions()

Return a list of the present constructions and their counts.

get_cost()

Return current model encoding cost.

get_segmentations()

Retrieve segmentations for all compounds encoded by the model.

load_data(data, freqthreshold=1, count_modifier=None, init_rand_split=None)

Load data to initialize the model for batch training.

Parameters:
  • data – iterator of (count, compound_atoms) tuples
  • freqthreshold – discard compounds that occur less than given times in the corpus (default 1)
  • count_modifier – function for adjusting the counts of each compound
  • init_rand_split – If given, random split the word with init_rand_split as the probability for each split

Adds the compounds in the corpus to the model lexicon. Returns the total cost.

load_segmentations(segmentations)

Load model from existing segmentations.

The argument should be an iterator providing a count, a compound, and its segmentation.

make_segment_only()

Reduce the size of this model by removing all non-morphs from the analyses. After calling this method it is not possible anymore to call any other method that would change the state of the model. Anyway doing so would throw an exception.

segment(compound)

Segment the compound by looking it up in the model analyses.

Raises KeyError if compound is not present in the training data. For segmenting new words, use viterbi_segment(compound).

static segmentation_to_splitloc(constructions)

Return a list of split locations for a segmented compound.

set_annotations(annotations, annotatedcorpusweight=None)

Prepare model for semi-supervised learning with given annotations.

tokens

Return the number of construction tokens.

train_batch(algorithm='recursive', algorithm_params=(), finish_threshold=0.005, max_epochs=None)

Train the model in batch fashion.

The model is trained with the data already loaded into the model (by using an existing model or calling one of the load_ methods).

In each iteration (epoch) all compounds in the training data are optimized once, in a random order. If applicable, corpus weight, annotation cost, and random split counters are recalculated after each iteration.

Parameters:
  • algorithm – string in (‘recursive’, ‘viterbi’) that indicates the splitting algorithm used.
  • algorithm_params – parameters passed to the splitting algorithm.
  • finish_threshold – the stopping threshold. Training stops when the improvement of the last iteration is smaller then finish_threshold * #boundaries
  • max_epochs – maximum number of epochs to train
train_online(data, count_modifier=None, epoch_interval=10000, algorithm='recursive', algorithm_params=(), init_rand_split=None, max_epochs=None)

Train the model in online fashion.

The model is trained with the data provided in the data argument. As example the data could come from a generator linked to standard in for live monitoring of the splitting.

All compounds from data are only optimized once. After online training, batch training could be used for further optimization.

Epochs are defined as a fixed number of compounds. After each epoch ( like in batch training), the annotation cost, and random split counters are recalculated if applicable.

Parameters:
  • data – iterator of (_, compound_atoms) tuples. The first argument is ignored, as every occurence of the compound is taken with count 1
  • count_modifier – function for adjusting the counts of each compound
  • epoch_interval – number of compounds to process before starting a new epoch
  • algorithm – string in (‘recursive’, ‘viterbi’) that indicates the splitting algorithm used.
  • algorithm_params – parameters passed to the splitting algorithm.
  • init_rand_split – probability for random splitting a compound to at any point for initializing the model. None or 0 means no random splitting.
  • max_epochs – maximum number of epochs to train
types

Return the number of construction types.

viterbi_nbest(compound, n, addcount=1.0, maxlen=30)

Find top-n optimal segmentations using the Viterbi algorithm.

Parameters:
  • compound – compound to be segmented
  • n – how many segmentations to return
  • addcount – constant for additive smoothing (0 = no smoothing)
  • maxlen – maximum length for the constructions

If additive smoothing is applied, new complex construction types can be selected during the search. Without smoothing, only new single-atom constructions can be selected.

Returns the n most probable segmentations and their log-probabilities.

viterbi_segment(compound, addcount=1.0, maxlen=30)

Find optimal segmentation using the Viterbi algorithm.

Parameters:
  • compound – compound to be segmented
  • addcount – constant for additive smoothing (0 = no smoothing)
  • maxlen – maximum length for the constructions

If additive smoothing is applied, new complex construction types can be selected during the search. Without smoothing, only new single-atom constructions can be selected.

Returns the most probable segmentation and its log-probability.

class morfessor.baseline.ConstrNode(rcount, count, splitloc)
count

Alias for field number 1

rcount

Alias for field number 0

splitloc

Alias for field number 2

class morfessor.baseline.CorpusEncoding(lexicon_encoding, weight=1.0)

Encoding the corpus class

The basic difference to a normal encoding is that the number of types is not stored directly but fetched from the lexicon encoding. Also does the cost function not contain any permutation cost.

frequency_distribution_cost()

Calculate -log[(M - 1)! (N - M)! / (N - 1)!] for M types and N tokens.

get_cost()

Override for the Encoding get_cost function. A corpus does not have a permutation cost

types

Return the number of types of the corpus, which is the same as the number of boundaries in the lexicon + 1

class morfessor.baseline.Encoding(weight=1.0)

Base class for calculating the entropy (encoding length) of a corpus or lexicon.

Commonly subclassed to redefine specific methods.

frequency_distribution_cost()

Calculate -log[(u - 1)! (v - u)! / (v - 1)!]

v is the number of tokens+boundaries and u the number of types

get_cost()

Calculate the cost for encoding the corpus/lexicon

permutations_cost()

The permutations cost for the encoding.

types

Define number of types as 0. types is made a property method to ensure easy redefinition in subclasses

update_count(construction, old_count, new_count)

Update the counts in the encoding.

class morfessor.baseline.LexiconEncoding

Class for calculating the encoding cost for the Lexicon

add(construction)

Add a construction to the lexicon, updating automatically the count for its atoms

get_codelength(construction)

Return an approximate codelength for new construction.

remove(construction)

Remove construction from the lexicon, updating automatically the count for its atoms

types

Return the number of different atoms in the lexicon + 1 for the compound-end-token

Evaluation classes

class morfessor.evaluation.EvaluationConfig(num_samples, sample_size)
num_samples

Alias for field number 0

sample_size

Alias for field number 1

class morfessor.evaluation.MorfessorEvaluation(reference_annotations)

Do the evaluation of one model, on one testset. The basic procedure is to create, in a stable manner, a number of samples and evaluate them independently. The stable selection of samples makes it possible to use the resulting values for Pair-wise statistical significance testing.

reference_annotations is a standard annotation dictionary: {compound => ([annoation1],.. ) }

evaluate_model(model, configuration=EvaluationConfig(num_samples=10, sample_size=1000), meta_data=None)

Get the prediction of the test samples from the model and do the evaluation

The meta_data object has preferably at least the key ‘name’.

evaluate_segmentation(segmentation, configuration=EvaluationConfig(num_samples=10, sample_size=1000), meta_data=None)

Method for evaluating an existing segmentation

get_samples(configuration=EvaluationConfig(num_samples=10, sample_size=1000))

Get a list of samples. A sample is a list of compounds.

This method is stable, so each time it is called with a specific test_set and configuration it will return the same samples. Also this method caches the samples in the _samples variable.

class morfessor.evaluation.MorfessorEvaluationResult(meta_data=None)

A MorfessorEvaluationResult is returned by a MorfessorEvaluation object. It’s purpose is to store the evaluation data and provide nice formatting options.

Each MorfessorEvaluationResult contains the data of 1 evaluation (which can have multiple samples).

add_data_point(precision, recall, f_score, sample_size)

Method used by MorfessorEvaluation to add the results of a single sample to the object

format(format_string)

Format this object. The format string can contain all variables, e.g. fscore_avg, precision_values or any item from metadata

class morfessor.evaluation.WilcoxonSignedRank

Class for doing statistical signficance testing with the Wilcoxon Signed-Rank test

It implements the Pratt method for handling zero-differences and applies a 0.5 continuity correction for the z-statistic.

static print_table(results)

Nicely format a results table as returned by significance_test

significance_test(evaluations, val_property='fscore_values', name_property='name')

Takes a set of evaluations (which should have the same test-configuration) and calculates the p-value for the Wilcoxon signed rank test

Returns a dictionary with (name1,name2) keys and p-values as values.

Code Examples for using library interface

Segmenting new data using an existing model

import morfessor

io = morfessor.MorfessorIO()

model = io.read_binary_model_file('model.bin')

words = ['words', 'segmenting', 'morfessor', 'unsupervised']

for word in words:
    print(model.viterbi_segment(word))

Testing type vs token models

import morfessor

io = morfessor.MorfessorIO()

train_data = list(io.read_corpus_file('training_data'))

model_types = morfessor.BaselineModel()
model_logtokens = morfessor.BaselineModel()
model_tokens = morfessor.BaselineModel()

model_types.load_data(train_data, count_modifier=lambda x: 1)
def log_func(x):
    return int(round(math.log(x + 1, 2)))
model_logtokens.load_data(train_data, count_modifier=log_func)
model_tokens.load_data(train_data)

models = [model_types, model_logtokens, model_tokens]

for model in models:
    model.train_batch()

goldstd_data = io.read_annotations_file('gold_std')
ev = morfessor.MorfessorEvaluation(goldstd_data)
results = [ev.evaluate_model(m) for m in models]

wsr = morfessor.WilcoxonSignedRank()
r = wsr.significance_test(results)
WilcoxonSignedRank.print_table(r)

The equivalent of this on the command line would be:

morfessor-train -s model_types -d ones training_data
morfessor-train -s model_logtokens -d log training_data
morfessor-train -s model_tokens training_data

morfessor-evaluate gold_std morfessor-train morfessor-train morfessor-train

Testing different amounts of supervision data