pylelemmatize.LemmatizerBMP

class pylelemmatize.LemmatizerBMP(mapping_dict={}, unknown_chr='�', unicode_normalization='Dense')[source]

Bases: GenericLemmatizer

Parameters:

mapping_dict (Dict[str, str])
unknown_chr (str)
unicode_normalization (Literal['Dense', 'Composite', None])

__init__(mapping_dict={}, unknown_chr='�', unicode_normalization='Dense')[source]

Initialize the LemmatizerBMP instance.

Parameters:

mapping_dict (Union[Dict[str, str]], optional) – A dictionary mapping source characters to destination characters. If a string is provided, it will be converted into a dictionary where each character maps to itself. Defaults to an empty dictionary.
unknown_chr (str, optional) – The character to use for unknown mappings. Defaults to “�”.
unicode_normalization (Literal["Dense", "Composite", None], optional) – The type of Unicode normalization to apply. - “Dense”: Use dense Unicode normalization. - “Composite”: Use composite Unicode normalization. - None: No Unicode normalization is applied. Defaults to “Dense”.

Notes

This constructor initializes the mapping dictionary, sets up Unicode normalization, and creates internal mappings for efficient character transformations.

Methods

`__init__`([mapping_dict, unknown_chr, ...])	Initialize the LemmatizerBMP instance.
`alphabet_in_bmp`(alphabet)	Check if all characters in the given alphabet are within the BMP (Basic Multilingual Plane).
`copy_removing_unused_inputs`(txt)
`fast_alphabet_extraction`(text)
`from_alphabet_mapping`(src_alphabet_str[, ...])
`get_cer`(pred, true)
`get_encoding_information_loss`(text)
`get_unigram`(text)	Compute unigram statistics for the input text.
`intlabel_seq_to_str`(dense_np_text)	Convert a sequence of integer labels back to a string.
`len`()	Return the size of the destination alphabet.
`onehot_to_str`(onehot[, time_first])	Convert a one-hot encoded representation back to a string.
`str_to_intlabel_seq`(text)	Convert a string to a sequence of integer labels.
`str_to_onehot`(text[, time_first])	Convert a string to a one-hot encoded representation.

Attributes

`alphabet_tsv`
`dst_alphabet_str`	Get the destination alphabet as a string.
`mapping_tsv`
`src_alphabet_str`	Get the source alphabet as a string.
`unicode_normalization`
`unknown_chr`

static alphabet_in_bmp(alphabet)[source]

Check if all characters in the given alphabet are within the BMP (Basic Multilingual Plane).

Parameters:: alphabet (Optional[str]) – A string containing the alphabet to check. If None, the method returns True.
Returns:: True if all characters are within the BMP, False otherwise.
Return type:: bool

__call__(text)[source]

Transform the input text using the lemmatizer.

Parameters:: text (str) – The input text to transform.
Returns:: The transformed text.
Return type:: str

str_to_intlabel_seq(text)[source]

Convert a string to a sequence of integer labels.

Parameters:: text (str) – The input string to convert.
Returns:: A NumPy array of integer labels representing the input string.
Return type:: np.ndarray

intlabel_seq_to_str(dense_np_text)[source]

Convert a sequence of integer labels back to a string.

Parameters:: dense_np_text (np.ndarray) – A NumPy array of integer labels to convert.
Returns:: The reconstructed string.
Return type:: str

get_unigram(text)[source]

Compute unigram statistics for the input text.

Parameters:

text (str) – The input text to analyze.

Returns:

values : np.ndarray Unique integer labels in the text.
counts : np.ndarray Counts of each unique label.
labels : np.ndarray Mapping of integer labels to their corresponding characters.

Return type:

Tuple[np.ndarray, np.ndarray, np.ndarray]

str_to_onehot(text, time_first=True)[source]

Convert a string to a one-hot encoded representation.

Parameters:

text (str) – The input string to convert.
time_first (bool, optional) – If True, the output array will have shape (T, C), where T is the length of the string and C is the number of unique characters. If False, the output will have shape (C, T). Defaults to True.

Returns:

A one-hot encoded NumPy array representing the input string.

Return type:

np.ndarray

onehot_to_str(onehot, time_first=True)[source]

Convert a one-hot encoded representation back to a string.

Parameters:

onehot (np.ndarray) – A one-hot encoded NumPy array to convert.
time_first (bool, optional) – If True, the input array is expected to have shape (T, C). If False, it is expected to have shape (C, T). Defaults to True.

Returns:

The reconstructed string.

Return type:

str

property dst_alphabet_str: str

Get the destination alphabet as a string.

Returns:: The destination alphabet string.
Return type:: str

property src_alphabet_str: str

Get the source alphabet as a string.

Returns:: The source alphabet string.
Return type:: str