pylelemmatize

Public API (summary)

Classes

class pylelemmatize.AbstractLemmatizer(unicode_normalization='Dense', unknown_chr='�')[source]

Bases: ABC

Abstract base class for lemmatizers that map characters from a source alphabet to a destination alphabet.

Parameters:
  • unicode_normalization (Literal['Dense', 'Composite', None])

  • unknown_chr (str)

src_alphabet_str

The source alphabet string.

Type:

str

dst_alphabet_str

The destination alphabet string.

Type:

str

unknown_chr

The character used for unknown mappings. Default is “�”.

Type:

str

normalize_unicode

Function to normalize Unicode strings.

Type:

Callable[[str], str]

classmethod fast_alphabet_extraction(text)[source]
Return type:

str

Parameters:

text (str)

property unicode_normalization: Literal['Dense', 'Composite', None]
abstractmethod __call__(text)[source]

Convert text to the alphabet representation.

Return type:

str

Parameters:

text (str)

abstract property src_alphabet_str: str
abstract property dst_alphabet_str: str
property unknown_chr: str
property alphabet_tsv: str
property mapping_tsv: str
get_unigram(text)[source]
Return type:

Tuple[ndarray, ndarray, ndarray]

Parameters:

text (str)

get_cer(pred, true)[source]
Return type:

float

Parameters:
  • pred (str)

  • true (str)

get_encoding_information_loss(text)[source]
Return type:

float

Parameters:

text (str)

class pylelemmatize.LemmatizerBMP(mapping_dict={}, unknown_chr='�', unicode_normalization='Dense')[source]

Bases: GenericLemmatizer

Parameters:
  • mapping_dict (Dict[str, str])

  • unknown_chr (str)

  • unicode_normalization (Literal['Dense', 'Composite', None])

static alphabet_in_bmp(alphabet)[source]

Check if all characters in the given alphabet are within the BMP (Basic Multilingual Plane).

Parameters:

alphabet (Optional[str]) – A string containing the alphabet to check. If None, the method returns True.

Returns:

True if all characters are within the BMP, False otherwise.

Return type:

bool

__call__(text)[source]

Transform the input text using the lemmatizer.

Parameters:

text (str) – The input text to transform.

Returns:

The transformed text.

Return type:

str

str_to_intlabel_seq(text)[source]

Convert a string to a sequence of integer labels.

Parameters:

text (str) – The input string to convert.

Returns:

A NumPy array of integer labels representing the input string.

Return type:

np.ndarray

intlabel_seq_to_str(dense_np_text)[source]

Convert a sequence of integer labels back to a string.

Parameters:

dense_np_text (np.ndarray) – A NumPy array of integer labels to convert.

Returns:

The reconstructed string.

Return type:

str

get_unigram(text)[source]

Compute unigram statistics for the input text.

Parameters:

text (str) – The input text to analyze.

Returns:

  • values : np.ndarray Unique integer labels in the text.

  • counts : np.ndarray Counts of each unique label.

  • labels : np.ndarray Mapping of integer labels to their corresponding characters.

Return type:

Tuple[np.ndarray, np.ndarray, np.ndarray]

str_to_onehot(text, time_first=True)[source]

Convert a string to a one-hot encoded representation.

Parameters:
  • text (str) – The input string to convert.

  • time_first (bool, optional) – If True, the output array will have shape (T, C), where T is the length of the string and C is the number of unique characters. If False, the output will have shape (C, T). Defaults to True.

Returns:

A one-hot encoded NumPy array representing the input string.

Return type:

np.ndarray

onehot_to_str(onehot, time_first=True)[source]

Convert a one-hot encoded representation back to a string.

Parameters:
  • onehot (np.ndarray) – A one-hot encoded NumPy array to convert.

  • time_first (bool, optional) – If True, the input array is expected to have shape (T, C). If False, it is expected to have shape (C, T). Defaults to True.

Returns:

The reconstructed string.

Return type:

str

property dst_alphabet_str: str

Get the destination alphabet as a string.

Returns:

The destination alphabet string.

Return type:

str

property src_alphabet_str: str

Get the source alphabet as a string.

Returns:

The source alphabet string.

Return type:

str

class pylelemmatize.GenericLemmatizer(mapping_dict={}, unknown_chr='�', unicode_normalization='Dense')[source]

Bases: AbstractLemmatizer

Parameters:
  • unknown_chr (str)

  • unicode_normalization (Literal['Dense', 'Composed', None])

classmethod from_alphabet_mapping(src_alphabet_str, dst_alphabet_str=None, unknown_chr='�', override_map=None, min_similarity=0.25, verbose=0)[source]
Return type:

GenericLemmatizer

Parameters:
  • src_alphabet_str (str)

  • dst_alphabet_str (str | None)

  • unknown_chr (str)

  • override_map (Dict[str, str] | None)

  • min_similarity (float)

  • verbose (int)

copy_removing_unused_inputs(txt)[source]
Return type:

Any

Parameters:

txt (str)

len()[source]

Return the size of the destination alphabet.

Return type:

int

__call__(text)[source]

Convert text to the alphabet representation.

Return type:

str

Parameters:

text (str)

property src_alphabet_str: str
property dst_alphabet_str: str
class pylelemmatize.Seq2SeqDs(text_blocks, input_mapper=None, output_mapper=None, min_input_seqlen=50, min_output_seqlen=50, one2one_mapping=None, crop_to_seqlen=None, input_is_onehot=False, output_is_onehot=False)[source]

Bases: object

Parameters:
  • text_blocks (Tuple[List[str], List[str]])

  • input_mapper (LemmatizerBMP | None)

  • output_mapper (LemmatizerBMP | None)

  • min_input_seqlen (int)

  • min_output_seqlen (int)

  • one2one_mapping (bool | None)

  • crop_to_seqlen (int | None)

  • input_is_onehot (bool)

  • output_is_onehot (bool)

static load_icdar2019_parallel_txt_corpus(input_paths, max_insertions, min_length, max_length)[source]
Return type:

List[Tuple[List[str], List[str]]]

Parameters:
  • input_paths (str | List[str])

  • max_insertions (int)

  • min_length (int)

  • max_length (int)

static load_parallel_txt_corpus(input_glob, output_glob, check_integrity='cleanup')[source]
Return type:

List[Tuple[List[str], List[str]]]

Parameters:
  • input_glob (str | List[str])

  • output_glob (str | List[str])

  • check_integrity (Literal['cleanup', 'raise', 'ignore'])

static from_parallel_txt_corpus(input_glob, output_glob, **kwargs)[source]
Return type:

Seq2SeqDs

Parameters:
  • input_glob (str | List[str])

  • output_glob (str | List[str])

static create_selfsupervised_ds(corpus, mapper, mapped_is_input=True, add_all_occuring_to_input=True, **kwargs)[source]
Return type:

Seq2SeqDs

Parameters:
  • corpus (List[str])

  • mapper (LemmatizerBMP)

  • mapped_is_input (bool)

  • add_all_occuring_to_input (bool)

shuffle()[source]
Return type:

None

split(train_ratio=0.8, shuffle=True)[source]
Return type:

Tuple[Seq2SeqDs, Seq2SeqDs]

Parameters:
  • train_ratio (float)

  • shuffle (bool)

compute_ds_CER(use_editdistance=False)[source]

Compute the Character Error Rate (CER) of the dataset.

Return type:

float

Parameters:

use_editdistance (bool)

render_sample(n=0, include_alphabet=False)[source]
Return type:

str

Parameters:
  • n (int)

  • include_alphabet (bool)

class pylelemmatize.CharConfusionMatrix(alphabet)[source]

Bases: object

Parameters:

alphabet (LemmatizerBMP | str)

static edit_distance(s1, s2)[source]

Compute the Levenshtein edit distance between two sequences.

This function calculates the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one sequence into the other. It also returns the dynamic programming (DP) matrix used to compute the distance.

Parameters:
  • s1 (np.ndarray) – The first sequence as a NumPy array.

  • s2 (np.ndarray) –

    The second sequence as a NumPy array.

    The Levenshtein edit distance between s1 and s2. The DP matrix used to compute the distance, where dp[i, j] represents the edit distance between the first i characters of s1 and the first j characters of s2.

Return type:

Tuple[int, ndarray]

Examples

>>> import numpy as np
>>> s1 = np.array(['a', 'b', 'c'])
>>> s2 = np.array(['a', 'c', 'd'])
>>> distance, dp = edit_distance(s1, s2)
>>> distance
2
>>> dp
array([[0, 1, 2, 3],
       [1, 0, 1, 2],
       [2, 1, 1, 2],
       [3, 2, 2, 2]])
backtrace_ed_matrix(input_seq, gt_seq, dp)[source]

Backtraces the edit distance matrix to compute the alignment path, operation types, ground truth substitutions, and confusion matrix. :type input_seq: ndarray :param input_seq: The input sequence represented as an array of indices. :type input_seq: np.ndarray :type gt_seq: ndarray :param gt_seq: The ground truth sequence represented as an array of indices. :type gt_seq: np.ndarray :type dp: ndarray :param dp: The dynamic programming matrix containing the edit distances. :type dp: np.ndarray

Return type:

Tuple[ndarray, ndarray, ndarray, ndarray]

Returns:

  • path (np.ndarray) – The alignment path as an array of (input_index, gt_index) pairs.

  • operation_type (np.ndarray) – The sequence of operation types: - 0: Match - 1: Substitution - 2: Deletion - 3: Insertion

  • gt_sub_input (np.ndarray) – The ground truth sequence with substitutions applied.

  • cm (np.ndarray) – The confusion matrix representing the counts of matches, substitutions, insertions, and deletions. The matrix has dimensions (len(alphabet), len(alphabet)), where the first row/column represents insertions/deletions.

Parameters:
  • input_seq (ndarray)

  • gt_seq (ndarray)

  • dp (ndarray)

ingest_textline_observation(pred_line, gt_line)[source]

Processes a pair of predicted and ground truth text lines, computes the edit distance, and updates the confusion matrix.

This method performs the following steps: 1. Converts the predicted and ground truth text lines into dense integer label sequences. 2. Computes the edit distance and dynamic programming matrix between the two sequences. 3. Performs a backtrace on the edit distance matrix to generate the ground truth substitution input and updates the confusion matrix. 4. Returns the ground truth substitution input and the computed edit distance.

Parameters:
  • pred_line (str) – The predicted text line as a string.

  • gt_line (str) – The ground truth text line as a string.

Returns:

A tuple containing: - The ground truth substitution input as a string. - The edit distance between the predicted and ground truth text lines.

Return type:

Tuple[str, int]

generate_random_substitution_sequences(seq)[source]

Generate random substitution sequences based on a conditional probability matrix.

This method generates a sequence of random substitutions for the input sequence seq using the confusion matrix as conditional probability. Each output symbol is sampled from the conditional probabilities of the corresponding input symbol.

Parameters:

seq (np.ndarray) – Input sequence represented as a NumPy array of integers. Each integer corresponds to a symbol in the vocabulary.

Returns:

A NumPy array of the same shape as seq, where each element is a randomly substituted symbol based on the conditional probability matrix.

Return type:

np.ndarray

Examples

>>> import numpy as np
>>> cm = np.array([[0, 0.5, 0.5],
...                [0, 0.7, 0.3],
...                [0, 0.4, 0.6]])
>>> seq = np.array([1, 2, 1])
>>> augmenter = SubstitutionAugmenter(cm)
>>> out = augmenter.generate_random_substitution_sequences(seq)
>>> out.shape == seq.shape
True
get_self_supervision_textline(input_line)[source]
Return type:

str

Parameters:

input_line (str)

save(file_path)[source]
Parameters:

file_path (str | Path)

static load(file_path)[source]
Return type:

CharConfusionMatrix

Parameters:

file_path (str | Path)

get_matrix()[source]
Return type:

ndarray

distort_np_sequence(input_seq)[source]
Return type:

ndarray

Parameters:

input_seq (ndarray)

distort_pt_sequence(input_seq)[source]
Return type:

Tensor

Parameters:

input_seq (Tensor)

distort_string(input_str)[source]
Return type:

str

Parameters:

input_str (str)

__call__(seq)[source]

Call self as a function.

Return type:

Union[ndarray, Tensor, str]

Parameters:

seq (ndarray | Tensor | str)

class pylelemmatize.DemapperLSTM(input_mapper, output_mapper, hidden_sizes=[128, 128, 128], dropouts=0.0, directions=0, output_to_input_mapping=None)[source]

Bases: Module

Parameters:
  • input_mapper (str | LemmatizerBMP)

  • output_mapper (str | LemmatizerBMP)

  • hidden_sizes (List[int])

  • dropouts (List[float] | float)

  • directions (Literal[-1, 0, 1] | ~typing.List[~typing.Literal[-1, 0, 1]])

  • output_to_input_mapping (Dict[str, str] | None)

property input_size: int
property output_size: int
forward(bt_x)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Return type:

Tensor

Parameters:

bt_x (Tensor)

infer_str(src_str, device=None, return_confidence=False)[source]
Return type:

str

Parameters:
  • src_str (str)

  • device (device | None)

  • return_confidence (bool)

is_compatible(other)[source]
Return type:

bool

Parameters:

other (Any)

property hidden_sizes: List[int]
property dropout_list: List[float]
property epoch: int
save(path, args=None)[source]
Parameters:
  • path (str)

  • args (Any | None)

classmethod resume(path, input_alphabet_str=None, output_alphabet_str=None, hidden_sizes=[128, 128, 128], dropouts=[0.1, 0.1, 0.1], resume_best_weights=False)[source]
Return type:

DemapperLSTM

Parameters:
  • path (str)

  • input_alphabet_str (str | LemmatizerBMP | None)

  • output_alphabet_str (str | LemmatizerBMP | None)

  • hidden_sizes (List[int])

  • dropouts (List[float])

  • resume_best_weights (bool)

get_one2one_train_objects(lr)[source]

Return the optimizer and criterion for training.

Return type:

Tuple[Optimizer, Module]

validate_one2one_epoch(valid_ds, criterion=None, batch_size=1, progress=True)[source]
Return type:

Tuple[float, float]

Parameters:
  • valid_ds (Seq2SeqDs)

  • criterion (Module | None)

  • batch_size (int)

  • progress (bool)

train_one2one_epoch(train_ds, criterion, optimizer, batch_size=1, pseudo_batch_size=1, progress=True)[source]
Return type:

Tuple[float, float]

Parameters:
  • train_ds (Seq2SeqDs)

  • criterion (Module)

  • optimizer (Optimizer)

  • batch_size (int)

  • pseudo_batch_size (int)

  • progress (bool)

Functions

pylelemmatize.char_similarity(a, b, symmetric=True)[source]

Compute similarity score between two characters based on multiple heuristics.

Return type:

float

Parameters:
  • a (str)

  • b (str)

  • symmetric (bool)

pylelemmatize.fast_cer(pred, true)[source]
Return type:

float

Parameters:
  • pred (str)

  • true (str)

pylelemmatize.fast_numpy_to_str(np_arr)[source]
Return type:

str

Parameters:

np_arr (ndarray)

pylelemmatize.fast_str_to_numpy(s, dtype=<class 'numpy.uint16'>)[source]
Return type:

ndarray

Parameters:

s (str)

pylelemmatize.print_err(txt='Hello', correct=None, confidence=None, file=None)[source]

Print text to stderr with color coding based on correctness and confidence.

Each character in the input text is colorized using ANSI escape codes. The foreground color is green for correct characters and red for incorrect ones. The background color interpolates from black (high confidence) to white (low confidence).

Parameters:
  • txt (str, optional) – The text to be printed. Defaults to “Hello”.

  • correct (list of bool, optional) – A list indicating whether each character in txt is correct (True) or incorrect (False). If None, all characters are assumed to be correct. Defaults to None.

  • confidence (list of float, optional) – A list of confidence values (between 0.0 and 1.0) for each character in txt. A value of 1.0 corresponds to high confidence (black background), and 0.0 corresponds to low confidence (white background). If None, all characters are assigned a confidence of 1.0. Defaults to None.

  • file (file-like object, optional) – A file-like object to which the output will be written. If None, the output is printed to the standard error. Defaults to None.

Return type:

str

Notes

This function uses ANSI escape codes for colorization, which may not be supported in all terminal environments.

Examples

>>> print_err("Test", correct=[True, False, True, True], confidence=[1.0, 0.5, 0.8, 1.0])
(Outputs colorized text to the terminal)
pylelemmatize.extract_transcription_from_page_xml(xml_content, line_separator='\n', linesegment_separator='\t', ignore_deleted=True)[source]

Extracts transcription from a PAGE XML document string.

Parameters:
  • xml_content (str) – The PAGE XML content as a string.

  • ignore_deleted (bool) – If True, text within <del> tags will be ignored.

Returns:

The full transcription with each <TextLine> stitched by tabs and lines separated by newlines.

Return type:

str