pylelemmatize.LemmatizerBMP
- class pylelemmatize.LemmatizerBMP(mapping_dict={}, unknown_chr='�', unicode_normalization='Dense')[source]
Bases:
GenericLemmatizer- Parameters:
mapping_dict (Dict[str, str])
unknown_chr (str)
unicode_normalization (Literal['Dense', 'Composite', None])
- __init__(mapping_dict={}, unknown_chr='�', unicode_normalization='Dense')[source]
Initialize the LemmatizerBMP instance.
- Parameters:
mapping_dict (Union[Dict[str, str]], optional) – A dictionary mapping source characters to destination characters. If a string is provided, it will be converted into a dictionary where each character maps to itself. Defaults to an empty dictionary.
unknown_chr (str, optional) – The character to use for unknown mappings. Defaults to “�”.
unicode_normalization (Literal["Dense", "Composite", None], optional) – The type of Unicode normalization to apply. - “Dense”: Use dense Unicode normalization. - “Composite”: Use composite Unicode normalization. - None: No Unicode normalization is applied. Defaults to “Dense”.
Notes
This constructor initializes the mapping dictionary, sets up Unicode normalization, and creates internal mappings for efficient character transformations.
Methods
__init__([mapping_dict, unknown_chr, ...])Initialize the LemmatizerBMP instance.
alphabet_in_bmp(alphabet)Check if all characters in the given alphabet are within the BMP (Basic Multilingual Plane).
copy_removing_unused_inputs(txt)fast_alphabet_extraction(text)from_alphabet_mapping(src_alphabet_str[, ...])get_cer(pred, true)get_encoding_information_loss(text)get_unigram(text)Compute unigram statistics for the input text.
intlabel_seq_to_str(dense_np_text)Convert a sequence of integer labels back to a string.
len()Return the size of the destination alphabet.
onehot_to_str(onehot[, time_first])Convert a one-hot encoded representation back to a string.
str_to_intlabel_seq(text)Convert a string to a sequence of integer labels.
str_to_onehot(text[, time_first])Convert a string to a one-hot encoded representation.
Attributes
alphabet_tsvGet the destination alphabet as a string.
mapping_tsvGet the source alphabet as a string.
unicode_normalizationunknown_chr- static alphabet_in_bmp(alphabet)[source]
Check if all characters in the given alphabet are within the BMP (Basic Multilingual Plane).
- Parameters:
alphabet (Optional[str]) – A string containing the alphabet to check. If None, the method returns True.
- Returns:
True if all characters are within the BMP, False otherwise.
- Return type:
bool
- __call__(text)[source]
Transform the input text using the lemmatizer.
- Parameters:
text (str) – The input text to transform.
- Returns:
The transformed text.
- Return type:
str
- str_to_intlabel_seq(text)[source]
Convert a string to a sequence of integer labels.
- Parameters:
text (str) – The input string to convert.
- Returns:
A NumPy array of integer labels representing the input string.
- Return type:
np.ndarray
- intlabel_seq_to_str(dense_np_text)[source]
Convert a sequence of integer labels back to a string.
- Parameters:
dense_np_text (np.ndarray) – A NumPy array of integer labels to convert.
- Returns:
The reconstructed string.
- Return type:
str
- get_unigram(text)[source]
Compute unigram statistics for the input text.
- Parameters:
text (str) – The input text to analyze.
- Returns:
values : np.ndarray Unique integer labels in the text.
counts : np.ndarray Counts of each unique label.
labels : np.ndarray Mapping of integer labels to their corresponding characters.
- Return type:
Tuple[np.ndarray, np.ndarray, np.ndarray]
- str_to_onehot(text, time_first=True)[source]
Convert a string to a one-hot encoded representation.
- Parameters:
text (str) – The input string to convert.
time_first (bool, optional) – If True, the output array will have shape (T, C), where T is the length of the string and C is the number of unique characters. If False, the output will have shape (C, T). Defaults to True.
- Returns:
A one-hot encoded NumPy array representing the input string.
- Return type:
np.ndarray
- onehot_to_str(onehot, time_first=True)[source]
Convert a one-hot encoded representation back to a string.
- Parameters:
onehot (np.ndarray) – A one-hot encoded NumPy array to convert.
time_first (bool, optional) – If True, the input array is expected to have shape (T, C). If False, it is expected to have shape (C, T). Defaults to True.
- Returns:
The reconstructed string.
- Return type:
str
- property dst_alphabet_str: str
Get the destination alphabet as a string.
- Returns:
The destination alphabet string.
- Return type:
str
- property src_alphabet_str: str
Get the source alphabet as a string.
- Returns:
The source alphabet string.
- Return type:
str