Getting started

A framework for assisting transliterations and character-sets in python.

PyLeLemmatize is a Python package for lemmatizing characters. It provides a simple and efficient way to reduce large character sets to simpler ones.

Installation

Install from pypi

To install PyLemmatize from Pypi:

pip install pylelemmatize

for installation for coding, look at [development](### Development Installation)

Python Usage

Simple letter lemmatization

import pylelemmatize as ll

greek_poly_string = "Καὶ ὅτε ἤνοιξεν τὴν σφραγῖδα τὴν ἑβδόμην, ἐγένετο σιγὴ ἐν τῷ οὐρανῷ ὡς ἡμιώριον."

print(f"Polytonic   : {greek_poly_string}")
print(f"Modern Greek: {ll.llemmatize(greek_poly_string, ll.charsets.iso_8859_7)}")
print(f"ASCII       : {ll.llemmatize(greek_poly_string, ll.charsets.ascii)}")

Output:

Polytonic   : Καὶ ὅτε ἤνοιξεν τὴν σφραγῖδα τὴν ἑβδόμην, ἐγένετο σιγὴ ἐν τῷ οὐρανῷ ὡς ἡμιώριον.
Modern Greek: Καί ότε ήνοιξεν τήν σφραγίδα τήν έβδόμην, έγένετο σιγή έν τώ ούρανώ ώς ήμιώριον.
ASCII       : Kai ote enoixen ten spragida ten ebdomen, egeneto sige en to ourano os emiorion.

Efficient letter lemmatization

Creating automoatic llemmatizers is expencive O(|input_alphabet|x|output_alphabet|) Once they are created they are equally fast regardless of of their sizes. The following IPython codesnipet demonstrates the cost of creating vs applying llemmatizers.

import pylelemmatize as ll

greek_poly_string = "Καὶ ὅτε ἤνοιξεν τὴν σφραγῖδα τὴν ἑβδόμην, ἐγένετο σιγὴ ἐν τῷ οὐρανῷ ὡς ἡμιώριον."

print("Creating autoaligned llemmatizers O(|src_alphabet|x|dst_alphabet|)")
print("Medium llemmatizer: |34|x|186|")
%timeit polytonic2modern_greek = ll.llemmatizer(greek_poly_string, ll.charsets.iso_8859_7)
polytonic2modern_greek = ll.llemmatizer(greek_poly_string, ll.charsets.iso_8859_7)

print("Large llemmatizer: |100|x|3549|")
%timeit mes2ascii = ll.llemmatizer(ll.charsets.mes3a, ll.charsets.ascii)
mes2ascii = ll.llemmatizer(ll.charsets.mes3a, ll.charsets.ascii)

print("\nApplying the medium and large llemmatizers on strings:")
for inp_str in [greek_poly_string, greek_poly_string * 1000, greek_poly_string * 1000000]:
    modern_greek_str =  polytonic2modern_greek(inp_str)
    print(f"\nString size: {len(inp_str)}")
    %timeit modern_greek_str =  polytonic2modern_greek(inp_str)
    modern_greek_str =  polytonic2modern_greek(inp_str)
    %timeit modern_greek_str =  mes2ascii(inp_str)

Output:

Creating autoaligned llemmatizers O(|src_alphabet|x|dst_alphabet|)
Medium llemmatizer: |34|x|186|
1.97 s ± 18.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Large llemmatizer: |100|x|3549|
46.2 s ± 1 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
Applying the medium and large llemmatizers on strings:

String size: 80
6.06 μs ± 48.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
5.94 μs ± 65 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

String size: 80000
361 μs ± 6.79 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
397 μs ± 3.3 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

String size: 80000000
499 ms ± 984 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)
521 ms ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

PHOC string embedding

Pyramyd Histogram Of Characters (PHOC) embeddings have been implemented as a pytorch layer.

import torch,pylelemmatize as ll
phoc = ll.PHOC()
print(torch.norm(phoc("hello")-phoc("hell")))

Command Line Invocation

Demapping

Training and using RNNs that reverse character mappings can be done on the CLI without any code editing.

Setup

mkdir -p tmp/models
wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt -O ./tmp/tinyshakespeare.txt
cat ./tmp/tinyshakespeare.txt |shuf  --random-source ./tmp/tinyshakespeare.txt > ./tmp/tinyshakespeare_shuf.txt
head -n 1000 ./tmp/tinyshakespeare_shuf.txt > ./tmp/tinyshakespeare_test.txt
tail -n +1001 ./tmp/tinyshakespeare_shuf.txt > ./tmp/tinyshakespeare_exper.txt

Train a demapper

GPU is automatically employed if found

ll_train_one2one -corpus_files ./tmp/tinyshakespeare_exper.txt -output_model_path ./tmp/models/toy_model.pt -input_alphabet '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!",.:;? ' -output_alphabet '0abcdfghjklmnpqrstvwxz.' -nb_epochs 3

If a model has not been trained until nb_epochs, the training resumes.

ll_train_one2one -corpus_files ./tmp/tinyshakespeare_exper.txt -output_model_path ./tmp/models/toy_model.pt -input_alphabet '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!",.:;? ' -output_alphabet '0abcdfghjklmnpqrstvwxz.' -nb_epochs 50

Use a demapper

O demmaper can be use on streams or files

echo 'a da nat knaw what ta saa. bat a knaw what ta thank.' |ll_infer_one2one -model_path ./tmp/models/toy_model.pt

Output:

I do not know what to say, but a know what to think,

Evaluate Merges

Evaluating the CER introduced by merging multiple symbols to a single ones.

ll_evaluate_merges -h # get help string with the cli interface
ll_evaluate_merges -corpus_glob  './sample_data/wienocist_charter_1/wienocist_charter_1*' -merges '[("u", "v"),  ("U", "V")]'

Attention the merge CER is not symetric at all!

# The following gives a CER of 0.0591
ll_evaluate_merges -corpus_glob  './sample_data/wienocist_charter_1/wienocist_charter_1*' -merges '[("I", "J"), ("i", "j")]'
# While the following gives a CER of 0.0007
ll_evaluate_merges -corpus_glob  './sample_data/wienocist_charter_1/wienocist_charter_1*' -merges '[("J", "I"), ("j", "i")]'

Extract corpus alphabet

ll_extract_corpus_alphabet -h # get help string with the cli interface
ll_extract_corpus_alphabet -corpus_glob './sample_data/wienocist_charter_1/wienocist_charter_1*'

Test corpus on alphabets

ll_test_corpus_on_alphabets -h # get help string with the cli interface
ll_test_corpus_on_alphabets -corpus_glob './sample_data/wienocist_charter_1/wienocist_charter_1*' -alphabets 'mufibmp,ascii,mes1,iso_8859_2' -verbose

Development

Development Installation

For extending pylelemmatize, install from github.

git clone git@github.com:anguelos/pylelemmatize.git
cd pylelemmatize
pip install -r requirements
pip install -r ./docs/requirements.txt
pip install -e .

This will install pylelemmatize on your system in development mode.

Testing

Running the unit tests

pytest --cov ./src/pylelemmatize/ ./test/pytest/

Running shell script tests

This will run all bash scripts with -h essetially checking syntax and imports

./test/test_shell_scripts.sh