pylelemmatize.LemmatizerBMP

class pylelemmatize.LemmatizerBMP(mapping_dict={}, unknown_chr='�', unicode_normalization='Dense')[source]

Bases: GenericLemmatizer

Parameters:
  • mapping_dict (Dict[str, str])

  • unknown_chr (str)

  • unicode_normalization (Literal['Dense', 'Composite', None])

__init__(mapping_dict={}, unknown_chr='�', unicode_normalization='Dense')[source]

Initialize the LemmatizerBMP instance.

Parameters:
  • mapping_dict (Union[Dict[str, str]], optional) – A dictionary mapping source characters to destination characters. If a string is provided, it will be converted into a dictionary where each character maps to itself. Defaults to an empty dictionary.

  • unknown_chr (str, optional) – The character to use for unknown mappings. Defaults to “�”.

  • unicode_normalization (Literal["Dense", "Composite", None], optional) – The type of Unicode normalization to apply. - “Dense”: Use dense Unicode normalization. - “Composite”: Use composite Unicode normalization. - None: No Unicode normalization is applied. Defaults to “Dense”.

Notes

This constructor initializes the mapping dictionary, sets up Unicode normalization, and creates internal mappings for efficient character transformations.

Methods

__init__([mapping_dict, unknown_chr, ...])

Initialize the LemmatizerBMP instance.

alphabet_in_bmp(alphabet)

Check if all characters in the given alphabet are within the BMP (Basic Multilingual Plane).

copy_removing_unused_inputs(txt)

fast_alphabet_extraction(text)

from_alphabet_mapping(src_alphabet_str[, ...])

get_cer(pred, true)

get_encoding_information_loss(text)

get_unigram(text)

Compute unigram statistics for the input text.

intlabel_seq_to_str(dense_np_text)

Convert a sequence of integer labels back to a string.

len()

Return the size of the destination alphabet.

onehot_to_str(onehot[, time_first])

Convert a one-hot encoded representation back to a string.

str_to_intlabel_seq(text)

Convert a string to a sequence of integer labels.

str_to_onehot(text[, time_first])

Convert a string to a one-hot encoded representation.

Attributes

alphabet_tsv

dst_alphabet_str

Get the destination alphabet as a string.

mapping_tsv

src_alphabet_str

Get the source alphabet as a string.

unicode_normalization

unknown_chr

static alphabet_in_bmp(alphabet)[source]

Check if all characters in the given alphabet are within the BMP (Basic Multilingual Plane).

Parameters:

alphabet (Optional[str]) – A string containing the alphabet to check. If None, the method returns True.

Returns:

True if all characters are within the BMP, False otherwise.

Return type:

bool

__call__(text)[source]

Transform the input text using the lemmatizer.

Parameters:

text (str) – The input text to transform.

Returns:

The transformed text.

Return type:

str

str_to_intlabel_seq(text)[source]

Convert a string to a sequence of integer labels.

Parameters:

text (str) – The input string to convert.

Returns:

A NumPy array of integer labels representing the input string.

Return type:

np.ndarray

intlabel_seq_to_str(dense_np_text)[source]

Convert a sequence of integer labels back to a string.

Parameters:

dense_np_text (np.ndarray) – A NumPy array of integer labels to convert.

Returns:

The reconstructed string.

Return type:

str

get_unigram(text)[source]

Compute unigram statistics for the input text.

Parameters:

text (str) – The input text to analyze.

Returns:

  • values : np.ndarray Unique integer labels in the text.

  • counts : np.ndarray Counts of each unique label.

  • labels : np.ndarray Mapping of integer labels to their corresponding characters.

Return type:

Tuple[np.ndarray, np.ndarray, np.ndarray]

str_to_onehot(text, time_first=True)[source]

Convert a string to a one-hot encoded representation.

Parameters:
  • text (str) – The input string to convert.

  • time_first (bool, optional) – If True, the output array will have shape (T, C), where T is the length of the string and C is the number of unique characters. If False, the output will have shape (C, T). Defaults to True.

Returns:

A one-hot encoded NumPy array representing the input string.

Return type:

np.ndarray

onehot_to_str(onehot, time_first=True)[source]

Convert a one-hot encoded representation back to a string.

Parameters:
  • onehot (np.ndarray) – A one-hot encoded NumPy array to convert.

  • time_first (bool, optional) – If True, the input array is expected to have shape (T, C). If False, it is expected to have shape (C, T). Defaults to True.

Returns:

The reconstructed string.

Return type:

str

property dst_alphabet_str: str

Get the destination alphabet as a string.

Returns:

The destination alphabet string.

Return type:

str

property src_alphabet_str: str

Get the source alphabet as a string.

Returns:

The source alphabet string.

Return type:

str