pylelemmatize.extract_transcription_from_page_xml

pylelemmatize.extract_transcription_from_page_xml(xml_content, line_separator='\n', linesegment_separator='\t', ignore_deleted=True)[source]

Extracts transcription from a PAGE XML document string.

Parameters:
  • xml_content (str) – The PAGE XML content as a string.

  • ignore_deleted (bool) – If True, text within <del> tags will be ignored.

Returns:

The full transcription with each <TextLine> stitched by tabs and lines separated by newlines.

Return type:

str