pylelemmatize.extract_transcription_from_page_xml
- pylelemmatize.extract_transcription_from_page_xml(xml_content, line_separator='\n', linesegment_separator='\t', ignore_deleted=True)[source]
Extracts transcription from a PAGE XML document string.
- Parameters:
xml_content (str) – The PAGE XML content as a string.
ignore_deleted (bool) – If True, text within <del> tags will be ignored.
- Returns:
The full transcription with each <TextLine> stitched by tabs and lines separated by newlines.
- Return type:
str