pylelemmatize
Public API (summary)
Classes
- class pylelemmatize.AbstractLemmatizer(unicode_normalization='Dense', unknown_chr='�')[source]
Bases:
ABCAbstract base class for lemmatizers that map characters from a source alphabet to a destination alphabet.
- Parameters:
unicode_normalization (Literal['Dense', 'Composite', None])
unknown_chr (str)
- src_alphabet_str
The source alphabet string.
- Type:
str
- dst_alphabet_str
The destination alphabet string.
- Type:
str
- unknown_chr
The character used for unknown mappings. Default is “�”.
- Type:
str
- normalize_unicode
Function to normalize Unicode strings.
- Type:
Callable[[str], str]
- classmethod fast_alphabet_extraction(text)[source]
- Return type:
str- Parameters:
text (str)
- property unicode_normalization: Literal['Dense', 'Composite', None]
- abstractmethod __call__(text)[source]
Convert text to the alphabet representation.
- Return type:
str- Parameters:
text (str)
- abstract property src_alphabet_str: str
- abstract property dst_alphabet_str: str
- property unknown_chr: str
- property alphabet_tsv: str
- property mapping_tsv: str
- get_unigram(text)[source]
- Return type:
Tuple[ndarray,ndarray,ndarray]- Parameters:
text (str)
- get_cer(pred, true)[source]
- Return type:
float- Parameters:
pred (str)
true (str)
- get_encoding_information_loss(text)[source]
- Return type:
float- Parameters:
text (str)
- class pylelemmatize.LemmatizerBMP(mapping_dict={}, unknown_chr='�', unicode_normalization='Dense')[source]
Bases:
GenericLemmatizer- Parameters:
mapping_dict (Dict[str, str])
unknown_chr (str)
unicode_normalization (Literal['Dense', 'Composite', None])
- static alphabet_in_bmp(alphabet)[source]
Check if all characters in the given alphabet are within the BMP (Basic Multilingual Plane).
- Parameters:
alphabet (Optional[str]) – A string containing the alphabet to check. If None, the method returns True.
- Returns:
True if all characters are within the BMP, False otherwise.
- Return type:
bool
- __call__(text)[source]
Transform the input text using the lemmatizer.
- Parameters:
text (str) – The input text to transform.
- Returns:
The transformed text.
- Return type:
str
- str_to_intlabel_seq(text)[source]
Convert a string to a sequence of integer labels.
- Parameters:
text (str) – The input string to convert.
- Returns:
A NumPy array of integer labels representing the input string.
- Return type:
np.ndarray
- intlabel_seq_to_str(dense_np_text)[source]
Convert a sequence of integer labels back to a string.
- Parameters:
dense_np_text (np.ndarray) – A NumPy array of integer labels to convert.
- Returns:
The reconstructed string.
- Return type:
str
- get_unigram(text)[source]
Compute unigram statistics for the input text.
- Parameters:
text (str) – The input text to analyze.
- Returns:
values : np.ndarray Unique integer labels in the text.
counts : np.ndarray Counts of each unique label.
labels : np.ndarray Mapping of integer labels to their corresponding characters.
- Return type:
Tuple[np.ndarray, np.ndarray, np.ndarray]
- str_to_onehot(text, time_first=True)[source]
Convert a string to a one-hot encoded representation.
- Parameters:
text (str) – The input string to convert.
time_first (bool, optional) – If True, the output array will have shape (T, C), where T is the length of the string and C is the number of unique characters. If False, the output will have shape (C, T). Defaults to True.
- Returns:
A one-hot encoded NumPy array representing the input string.
- Return type:
np.ndarray
- onehot_to_str(onehot, time_first=True)[source]
Convert a one-hot encoded representation back to a string.
- Parameters:
onehot (np.ndarray) – A one-hot encoded NumPy array to convert.
time_first (bool, optional) – If True, the input array is expected to have shape (T, C). If False, it is expected to have shape (C, T). Defaults to True.
- Returns:
The reconstructed string.
- Return type:
str
- property dst_alphabet_str: str
Get the destination alphabet as a string.
- Returns:
The destination alphabet string.
- Return type:
str
- property src_alphabet_str: str
Get the source alphabet as a string.
- Returns:
The source alphabet string.
- Return type:
str
- class pylelemmatize.GenericLemmatizer(mapping_dict={}, unknown_chr='�', unicode_normalization='Dense')[source]
Bases:
AbstractLemmatizer- Parameters:
unknown_chr (str)
unicode_normalization (Literal['Dense', 'Composed', None])
- classmethod from_alphabet_mapping(src_alphabet_str, dst_alphabet_str=None, unknown_chr='�', override_map=None, min_similarity=0.25, verbose=0)[source]
- Return type:
- Parameters:
src_alphabet_str (str)
dst_alphabet_str (str | None)
unknown_chr (str)
override_map (Dict[str, str] | None)
min_similarity (float)
verbose (int)
- copy_removing_unused_inputs(txt)[source]
- Return type:
Any- Parameters:
txt (str)
- len()[source]
Return the size of the destination alphabet.
- Return type:
int
- __call__(text)[source]
Convert text to the alphabet representation.
- Return type:
str- Parameters:
text (str)
- property src_alphabet_str: str
- property dst_alphabet_str: str
- class pylelemmatize.Seq2SeqDs(text_blocks, input_mapper=None, output_mapper=None, min_input_seqlen=50, min_output_seqlen=50, one2one_mapping=None, crop_to_seqlen=None, input_is_onehot=False, output_is_onehot=False)[source]
Bases:
object- Parameters:
text_blocks (Tuple[List[str], List[str]])
input_mapper (LemmatizerBMP | None)
output_mapper (LemmatizerBMP | None)
min_input_seqlen (int)
min_output_seqlen (int)
one2one_mapping (bool | None)
crop_to_seqlen (int | None)
input_is_onehot (bool)
output_is_onehot (bool)
- static load_icdar2019_parallel_txt_corpus(input_paths, max_insertions, min_length, max_length)[source]
- Return type:
List[Tuple[List[str],List[str]]]- Parameters:
input_paths (str | List[str])
max_insertions (int)
min_length (int)
max_length (int)
- static load_parallel_txt_corpus(input_glob, output_glob, check_integrity='cleanup')[source]
- Return type:
List[Tuple[List[str],List[str]]]- Parameters:
input_glob (str | List[str])
output_glob (str | List[str])
check_integrity (Literal['cleanup', 'raise', 'ignore'])
- static from_parallel_txt_corpus(input_glob, output_glob, **kwargs)[source]
- Return type:
- Parameters:
input_glob (str | List[str])
output_glob (str | List[str])
- static create_selfsupervised_ds(corpus, mapper, mapped_is_input=True, add_all_occuring_to_input=True, **kwargs)[source]
- Return type:
- Parameters:
corpus (List[str])
mapper (LemmatizerBMP)
mapped_is_input (bool)
add_all_occuring_to_input (bool)
- shuffle()[source]
- Return type:
None
- split(train_ratio=0.8, shuffle=True)[source]
- compute_ds_CER(use_editdistance=False)[source]
Compute the Character Error Rate (CER) of the dataset.
- Return type:
float- Parameters:
use_editdistance (bool)
- render_sample(n=0, include_alphabet=False)[source]
- Return type:
str- Parameters:
n (int)
include_alphabet (bool)
- class pylelemmatize.CharConfusionMatrix(alphabet)[source]
Bases:
object- Parameters:
alphabet (LemmatizerBMP | str)
- static edit_distance(s1, s2)[source]
Compute the Levenshtein edit distance between two sequences.
This function calculates the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one sequence into the other. It also returns the dynamic programming (DP) matrix used to compute the distance.
- Parameters:
s1 (np.ndarray) – The first sequence as a NumPy array.
s2 (np.ndarray) –
The second sequence as a NumPy array.
The Levenshtein edit distance between s1 and s2. The DP matrix used to compute the distance, where dp[i, j] represents the edit distance between the first i characters of s1 and the first j characters of s2.
- Return type:
Tuple[int,ndarray]
Examples
>>> import numpy as np >>> s1 = np.array(['a', 'b', 'c']) >>> s2 = np.array(['a', 'c', 'd']) >>> distance, dp = edit_distance(s1, s2) >>> distance 2 >>> dp array([[0, 1, 2, 3], [1, 0, 1, 2], [2, 1, 1, 2], [3, 2, 2, 2]])
- backtrace_ed_matrix(input_seq, gt_seq, dp)[source]
Backtraces the edit distance matrix to compute the alignment path, operation types, ground truth substitutions, and confusion matrix. :type input_seq:
ndarray:param input_seq: The input sequence represented as an array of indices. :type input_seq: np.ndarray :type gt_seq:ndarray:param gt_seq: The ground truth sequence represented as an array of indices. :type gt_seq: np.ndarray :type dp:ndarray:param dp: The dynamic programming matrix containing the edit distances. :type dp: np.ndarray- Return type:
Tuple[ndarray,ndarray,ndarray,ndarray]- Returns:
path (np.ndarray) – The alignment path as an array of (input_index, gt_index) pairs.
operation_type (np.ndarray) – The sequence of operation types: - 0: Match - 1: Substitution - 2: Deletion - 3: Insertion
gt_sub_input (np.ndarray) – The ground truth sequence with substitutions applied.
cm (np.ndarray) – The confusion matrix representing the counts of matches, substitutions, insertions, and deletions. The matrix has dimensions (len(alphabet), len(alphabet)), where the first row/column represents insertions/deletions.
- Parameters:
input_seq (ndarray)
gt_seq (ndarray)
dp (ndarray)
- ingest_textline_observation(pred_line, gt_line)[source]
Processes a pair of predicted and ground truth text lines, computes the edit distance, and updates the confusion matrix.
This method performs the following steps: 1. Converts the predicted and ground truth text lines into dense integer label sequences. 2. Computes the edit distance and dynamic programming matrix between the two sequences. 3. Performs a backtrace on the edit distance matrix to generate the ground truth substitution input and updates the confusion matrix. 4. Returns the ground truth substitution input and the computed edit distance.
- Parameters:
pred_line (str) – The predicted text line as a string.
gt_line (str) – The ground truth text line as a string.
- Returns:
A tuple containing: - The ground truth substitution input as a string. - The edit distance between the predicted and ground truth text lines.
- Return type:
Tuple[str, int]
- generate_random_substitution_sequences(seq)[source]
Generate random substitution sequences based on a conditional probability matrix.
This method generates a sequence of random substitutions for the input sequence
sequsing the confusion matrix as conditional probability. Each output symbol is sampled from the conditional probabilities of the corresponding input symbol.- Parameters:
seq (np.ndarray) – Input sequence represented as a NumPy array of integers. Each integer corresponds to a symbol in the vocabulary.
- Returns:
A NumPy array of the same shape as
seq, where each element is a randomly substituted symbol based on the conditional probability matrix.- Return type:
np.ndarray
Examples
>>> import numpy as np >>> cm = np.array([[0, 0.5, 0.5], ... [0, 0.7, 0.3], ... [0, 0.4, 0.6]]) >>> seq = np.array([1, 2, 1]) >>> augmenter = SubstitutionAugmenter(cm) >>> out = augmenter.generate_random_substitution_sequences(seq) >>> out.shape == seq.shape True
- get_self_supervision_textline(input_line)[source]
- Return type:
str- Parameters:
input_line (str)
- save(file_path)[source]
- Parameters:
file_path (str | Path)
- static load(file_path)[source]
- Return type:
- Parameters:
file_path (str | Path)
- get_matrix()[source]
- Return type:
ndarray
- distort_np_sequence(input_seq)[source]
- Return type:
ndarray- Parameters:
input_seq (ndarray)
- distort_pt_sequence(input_seq)[source]
- Return type:
Tensor- Parameters:
input_seq (Tensor)
- distort_string(input_str)[source]
- Return type:
str- Parameters:
input_str (str)
- __call__(seq)[source]
Call self as a function.
- Return type:
Union[ndarray,Tensor,str]- Parameters:
seq (ndarray | Tensor | str)
- class pylelemmatize.DemapperLSTM(input_mapper, output_mapper, hidden_sizes=[128, 128, 128], dropouts=0.0, directions=0, output_to_input_mapping=None)[source]
Bases:
Module- Parameters:
input_mapper (str | LemmatizerBMP)
output_mapper (str | LemmatizerBMP)
hidden_sizes (List[int])
dropouts (List[float] | float)
directions (Literal[-1, 0, 1] | ~typing.List[~typing.Literal[-1, 0, 1]])
output_to_input_mapping (Dict[str, str] | None)
- property input_size: int
- property output_size: int
- forward(bt_x)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.- Return type:
Tensor- Parameters:
bt_x (Tensor)
- infer_str(src_str, device=None, return_confidence=False)[source]
- Return type:
str- Parameters:
src_str (str)
device (device | None)
return_confidence (bool)
- is_compatible(other)[source]
- Return type:
bool- Parameters:
other (Any)
- property hidden_sizes: List[int]
- property dropout_list: List[float]
- property epoch: int
- save(path, args=None)[source]
- Parameters:
path (str)
args (Any | None)
- classmethod resume(path, input_alphabet_str=None, output_alphabet_str=None, hidden_sizes=[128, 128, 128], dropouts=[0.1, 0.1, 0.1], resume_best_weights=False)[source]
- Return type:
- Parameters:
path (str)
input_alphabet_str (str | LemmatizerBMP | None)
output_alphabet_str (str | LemmatizerBMP | None)
hidden_sizes (List[int])
dropouts (List[float])
resume_best_weights (bool)
- get_one2one_train_objects(lr)[source]
Return the optimizer and criterion for training.
- Return type:
Tuple[Optimizer,Module]
Functions
- pylelemmatize.char_similarity(a, b, symmetric=True)[source]
Compute similarity score between two characters based on multiple heuristics.
- Return type:
float- Parameters:
a (str)
b (str)
symmetric (bool)
- pylelemmatize.fast_cer(pred, true)[source]
- Return type:
float- Parameters:
pred (str)
true (str)
- pylelemmatize.fast_numpy_to_str(np_arr)[source]
- Return type:
str- Parameters:
np_arr (ndarray)
- pylelemmatize.fast_str_to_numpy(s, dtype=<class 'numpy.uint16'>)[source]
- Return type:
ndarray- Parameters:
s (str)
- pylelemmatize.print_err(txt='Hello', correct=None, confidence=None, file=None)[source]
Print text to stderr with color coding based on correctness and confidence.
Each character in the input text is colorized using ANSI escape codes. The foreground color is green for correct characters and red for incorrect ones. The background color interpolates from black (high confidence) to white (low confidence).
- Parameters:
txt (str, optional) – The text to be printed. Defaults to “Hello”.
correct (list of bool, optional) – A list indicating whether each character in txt is correct (True) or incorrect (False). If None, all characters are assumed to be correct. Defaults to None.
confidence (list of float, optional) – A list of confidence values (between 0.0 and 1.0) for each character in txt. A value of 1.0 corresponds to high confidence (black background), and 0.0 corresponds to low confidence (white background). If None, all characters are assigned a confidence of 1.0. Defaults to None.
file (file-like object, optional) – A file-like object to which the output will be written. If None, the output is printed to the standard error. Defaults to None.
- Return type:
str
Notes
This function uses ANSI escape codes for colorization, which may not be supported in all terminal environments.
Examples
>>> print_err("Test", correct=[True, False, True, True], confidence=[1.0, 0.5, 0.8, 1.0]) (Outputs colorized text to the terminal)
- pylelemmatize.extract_transcription_from_page_xml(xml_content, line_separator='\n', linesegment_separator='\t', ignore_deleted=True)[source]
Extracts transcription from a PAGE XML document string.
- Parameters:
xml_content (str) – The PAGE XML content as a string.
ignore_deleted (bool) – If True, text within <del> tags will be ignored.
- Returns:
The full transcription with each <TextLine> stitched by tabs and lines separated by newlines.
- Return type:
str