pylelemmatize.CharConfusionMatrix

class pylelemmatize.CharConfusionMatrix(alphabet)[source]

Bases: object

Parameters:

alphabet (LemmatizerBMP | str)

__init__(alphabet)[source]
Parameters:

alphabet (LemmatizerBMP | str)

Methods

__init__(alphabet)

backtrace_ed_matrix(input_seq, gt_seq, dp)

Backtraces the edit distance matrix to compute the alignment path, operation types, ground truth substitutions, and confusion matrix.

distort_np_sequence(input_seq)

distort_pt_sequence(input_seq)

distort_string(input_str)

edit_distance(s1, s2)

Compute the Levenshtein edit distance between two sequences.

generate_random_substitution_sequences(seq)

Generate random substitution sequences based on a conditional probability matrix.

get_matrix()

get_self_supervision_textline(input_line)

ingest_textline_observation(pred_line, gt_line)

Processes a pair of predicted and ground truth text lines, computes the edit distance, and updates the confusion matrix.

load(file_path)

save(file_path)

static edit_distance(s1, s2)[source]

Compute the Levenshtein edit distance between two sequences.

This function calculates the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one sequence into the other. It also returns the dynamic programming (DP) matrix used to compute the distance.

Parameters:
  • s1 (np.ndarray) – The first sequence as a NumPy array.

  • s2 (np.ndarray) –

    The second sequence as a NumPy array.

    The Levenshtein edit distance between s1 and s2. The DP matrix used to compute the distance, where dp[i, j] represents the edit distance between the first i characters of s1 and the first j characters of s2.

Return type:

Tuple[int, ndarray]

Examples

>>> import numpy as np
>>> s1 = np.array(['a', 'b', 'c'])
>>> s2 = np.array(['a', 'c', 'd'])
>>> distance, dp = edit_distance(s1, s2)
>>> distance
2
>>> dp
array([[0, 1, 2, 3],
       [1, 0, 1, 2],
       [2, 1, 1, 2],
       [3, 2, 2, 2]])
backtrace_ed_matrix(input_seq, gt_seq, dp)[source]

Backtraces the edit distance matrix to compute the alignment path, operation types, ground truth substitutions, and confusion matrix. :type input_seq: ndarray :param input_seq: The input sequence represented as an array of indices. :type input_seq: np.ndarray :type gt_seq: ndarray :param gt_seq: The ground truth sequence represented as an array of indices. :type gt_seq: np.ndarray :type dp: ndarray :param dp: The dynamic programming matrix containing the edit distances. :type dp: np.ndarray

Return type:

Tuple[ndarray, ndarray, ndarray, ndarray]

Returns:

  • path (np.ndarray) – The alignment path as an array of (input_index, gt_index) pairs.

  • operation_type (np.ndarray) – The sequence of operation types: - 0: Match - 1: Substitution - 2: Deletion - 3: Insertion

  • gt_sub_input (np.ndarray) – The ground truth sequence with substitutions applied.

  • cm (np.ndarray) – The confusion matrix representing the counts of matches, substitutions, insertions, and deletions. The matrix has dimensions (len(alphabet), len(alphabet)), where the first row/column represents insertions/deletions.

Parameters:
  • input_seq (ndarray)

  • gt_seq (ndarray)

  • dp (ndarray)

ingest_textline_observation(pred_line, gt_line)[source]

Processes a pair of predicted and ground truth text lines, computes the edit distance, and updates the confusion matrix.

This method performs the following steps: 1. Converts the predicted and ground truth text lines into dense integer label sequences. 2. Computes the edit distance and dynamic programming matrix between the two sequences. 3. Performs a backtrace on the edit distance matrix to generate the ground truth substitution input and updates the confusion matrix. 4. Returns the ground truth substitution input and the computed edit distance.

Parameters:
  • pred_line (str) – The predicted text line as a string.

  • gt_line (str) – The ground truth text line as a string.

Returns:

A tuple containing: - The ground truth substitution input as a string. - The edit distance between the predicted and ground truth text lines.

Return type:

Tuple[str, int]

generate_random_substitution_sequences(seq)[source]

Generate random substitution sequences based on a conditional probability matrix.

This method generates a sequence of random substitutions for the input sequence seq using the confusion matrix as conditional probability. Each output symbol is sampled from the conditional probabilities of the corresponding input symbol.

Parameters:

seq (np.ndarray) – Input sequence represented as a NumPy array of integers. Each integer corresponds to a symbol in the vocabulary.

Returns:

A NumPy array of the same shape as seq, where each element is a randomly substituted symbol based on the conditional probability matrix.

Return type:

np.ndarray

Examples

>>> import numpy as np
>>> cm = np.array([[0, 0.5, 0.5],
...                [0, 0.7, 0.3],
...                [0, 0.4, 0.6]])
>>> seq = np.array([1, 2, 1])
>>> augmenter = SubstitutionAugmenter(cm)
>>> out = augmenter.generate_random_substitution_sequences(seq)
>>> out.shape == seq.shape
True
get_self_supervision_textline(input_line)[source]
Return type:

str

Parameters:

input_line (str)

save(file_path)[source]
Parameters:

file_path (str | Path)

static load(file_path)[source]
Return type:

CharConfusionMatrix

Parameters:

file_path (str | Path)

get_matrix()[source]
Return type:

ndarray

distort_np_sequence(input_seq)[source]
Return type:

ndarray

Parameters:

input_seq (ndarray)

distort_pt_sequence(input_seq)[source]
Return type:

Tensor

Parameters:

input_seq (Tensor)

distort_string(input_str)[source]
Return type:

str

Parameters:

input_str (str)

__call__(seq)[source]

Call self as a function.

Return type:

Union[ndarray, Tensor, str]

Parameters:

seq (ndarray | Tensor | str)