pylelemmatize.Seq2SeqDs

class pylelemmatize.Seq2SeqDs(text_blocks, input_mapper=None, output_mapper=None, min_input_seqlen=50, min_output_seqlen=50, one2one_mapping=None, crop_to_seqlen=None, input_is_onehot=False, output_is_onehot=False)[source]

Bases: object

Parameters:
  • text_blocks (Tuple[List[str], List[str]])

  • input_mapper (LemmatizerBMP | None)

  • output_mapper (LemmatizerBMP | None)

  • min_input_seqlen (int)

  • min_output_seqlen (int)

  • one2one_mapping (bool | None)

  • crop_to_seqlen (int | None)

  • input_is_onehot (bool)

  • output_is_onehot (bool)

__init__(text_blocks, input_mapper=None, output_mapper=None, min_input_seqlen=50, min_output_seqlen=50, one2one_mapping=None, crop_to_seqlen=None, input_is_onehot=False, output_is_onehot=False)[source]
Parameters:
  • text_blocks (Tuple[List[str], List[str]])

  • input_mapper (LemmatizerBMP | None)

  • output_mapper (LemmatizerBMP | None)

  • min_input_seqlen (int)

  • min_output_seqlen (int)

  • one2one_mapping (bool | None)

  • crop_to_seqlen (int | None)

  • input_is_onehot (bool)

  • output_is_onehot (bool)

Methods

__init__(text_blocks[, input_mapper, ...])

compute_ds_CER([use_editdistance])

Compute the Character Error Rate (CER) of the dataset.

create_selfsupervised_ds(corpus, mapper[, ...])

from_parallel_txt_corpus(input_glob, ...)

load_icdar2019_parallel_txt_corpus(...)

load_parallel_txt_corpus(input_glob, output_glob)

render_sample([n, include_alphabet])

shuffle()

split([train_ratio, shuffle])

static load_icdar2019_parallel_txt_corpus(input_paths, max_insertions, min_length, max_length)[source]
Return type:

List[Tuple[List[str], List[str]]]

Parameters:
  • input_paths (str | List[str])

  • max_insertions (int)

  • min_length (int)

  • max_length (int)

static load_parallel_txt_corpus(input_glob, output_glob, check_integrity='cleanup')[source]
Return type:

List[Tuple[List[str], List[str]]]

Parameters:
  • input_glob (str | List[str])

  • output_glob (str | List[str])

  • check_integrity (Literal['cleanup', 'raise', 'ignore'])

static from_parallel_txt_corpus(input_glob, output_glob, **kwargs)[source]
Return type:

Seq2SeqDs

Parameters:
  • input_glob (str | List[str])

  • output_glob (str | List[str])

static create_selfsupervised_ds(corpus, mapper, mapped_is_input=True, add_all_occuring_to_input=True, **kwargs)[source]
Return type:

Seq2SeqDs

Parameters:
  • corpus (List[str])

  • mapper (LemmatizerBMP)

  • mapped_is_input (bool)

  • add_all_occuring_to_input (bool)

shuffle()[source]
Return type:

None

split(train_ratio=0.8, shuffle=True)[source]
Return type:

Tuple[Seq2SeqDs, Seq2SeqDs]

Parameters:
  • train_ratio (float)

  • shuffle (bool)

compute_ds_CER(use_editdistance=False)[source]

Compute the Character Error Rate (CER) of the dataset.

Return type:

float

Parameters:

use_editdistance (bool)

render_sample(n=0, include_alphabet=False)[source]
Return type:

str

Parameters:
  • n (int)

  • include_alphabet (bool)