pylelemmatize.CharConfusionMatrix
- class pylelemmatize.CharConfusionMatrix(alphabet)[source]
Bases:
object- Parameters:
alphabet (LemmatizerBMP | str)
- __init__(alphabet)[source]
- Parameters:
alphabet (LemmatizerBMP | str)
Methods
__init__(alphabet)backtrace_ed_matrix(input_seq, gt_seq, dp)Backtraces the edit distance matrix to compute the alignment path, operation types, ground truth substitutions, and confusion matrix.
distort_np_sequence(input_seq)distort_pt_sequence(input_seq)distort_string(input_str)edit_distance(s1, s2)Compute the Levenshtein edit distance between two sequences.
Generate random substitution sequences based on a conditional probability matrix.
get_self_supervision_textline(input_line)ingest_textline_observation(pred_line, gt_line)Processes a pair of predicted and ground truth text lines, computes the edit distance, and updates the confusion matrix.
load(file_path)save(file_path)- static edit_distance(s1, s2)[source]
Compute the Levenshtein edit distance between two sequences.
This function calculates the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one sequence into the other. It also returns the dynamic programming (DP) matrix used to compute the distance.
- Parameters:
s1 (np.ndarray) – The first sequence as a NumPy array.
s2 (np.ndarray) –
The second sequence as a NumPy array.
The Levenshtein edit distance between s1 and s2. The DP matrix used to compute the distance, where dp[i, j] represents the edit distance between the first i characters of s1 and the first j characters of s2.
- Return type:
Tuple[int,ndarray]
Examples
>>> import numpy as np >>> s1 = np.array(['a', 'b', 'c']) >>> s2 = np.array(['a', 'c', 'd']) >>> distance, dp = edit_distance(s1, s2) >>> distance 2 >>> dp array([[0, 1, 2, 3], [1, 0, 1, 2], [2, 1, 1, 2], [3, 2, 2, 2]])
- backtrace_ed_matrix(input_seq, gt_seq, dp)[source]
Backtraces the edit distance matrix to compute the alignment path, operation types, ground truth substitutions, and confusion matrix. :type input_seq:
ndarray:param input_seq: The input sequence represented as an array of indices. :type input_seq: np.ndarray :type gt_seq:ndarray:param gt_seq: The ground truth sequence represented as an array of indices. :type gt_seq: np.ndarray :type dp:ndarray:param dp: The dynamic programming matrix containing the edit distances. :type dp: np.ndarray- Return type:
Tuple[ndarray,ndarray,ndarray,ndarray]- Returns:
path (np.ndarray) – The alignment path as an array of (input_index, gt_index) pairs.
operation_type (np.ndarray) – The sequence of operation types: - 0: Match - 1: Substitution - 2: Deletion - 3: Insertion
gt_sub_input (np.ndarray) – The ground truth sequence with substitutions applied.
cm (np.ndarray) – The confusion matrix representing the counts of matches, substitutions, insertions, and deletions. The matrix has dimensions (len(alphabet), len(alphabet)), where the first row/column represents insertions/deletions.
- Parameters:
input_seq (ndarray)
gt_seq (ndarray)
dp (ndarray)
- ingest_textline_observation(pred_line, gt_line)[source]
Processes a pair of predicted and ground truth text lines, computes the edit distance, and updates the confusion matrix.
This method performs the following steps: 1. Converts the predicted and ground truth text lines into dense integer label sequences. 2. Computes the edit distance and dynamic programming matrix between the two sequences. 3. Performs a backtrace on the edit distance matrix to generate the ground truth substitution input and updates the confusion matrix. 4. Returns the ground truth substitution input and the computed edit distance.
- Parameters:
pred_line (str) – The predicted text line as a string.
gt_line (str) – The ground truth text line as a string.
- Returns:
A tuple containing: - The ground truth substitution input as a string. - The edit distance between the predicted and ground truth text lines.
- Return type:
Tuple[str, int]
- generate_random_substitution_sequences(seq)[source]
Generate random substitution sequences based on a conditional probability matrix.
This method generates a sequence of random substitutions for the input sequence
sequsing the confusion matrix as conditional probability. Each output symbol is sampled from the conditional probabilities of the corresponding input symbol.- Parameters:
seq (np.ndarray) – Input sequence represented as a NumPy array of integers. Each integer corresponds to a symbol in the vocabulary.
- Returns:
A NumPy array of the same shape as
seq, where each element is a randomly substituted symbol based on the conditional probability matrix.- Return type:
np.ndarray
Examples
>>> import numpy as np >>> cm = np.array([[0, 0.5, 0.5], ... [0, 0.7, 0.3], ... [0, 0.4, 0.6]]) >>> seq = np.array([1, 2, 1]) >>> augmenter = SubstitutionAugmenter(cm) >>> out = augmenter.generate_random_substitution_sequences(seq) >>> out.shape == seq.shape True