PDF Formatting Issues

PDF script issues and how to handle them.

Occasionally Pozotron cannot parse a pdf script correctly due to a variety of reasons. If Pozotron’s pre-scan of your script shows that there may be formatting issues with your PDF, it may return an error message suggesting you get a cleaner version of a script to upload. If possible, get a script that hasn’t been compressed or hasn’t been converted to a PDF without embedding the native fonts used with it. PDFs can look fine to the eye, but if the content can’t be read as text, then search functions won’t work inside the document, copy and pasting from the PDF to another doc won’t work right, and Pozotron can’t read it as text either.

A quick formatting test

One way to check what Pozotron’s transcription will look like is by uploading your script and then exporting the full text (using the ‘full text’ option in the export menu). Using that option will export an .html file of the raw text from Pozotron’s transcription. If any of the words haven’t transcribed properly, you’ll be able to see that as you scan through the document. There is no charge for this test; it’s available from the main menu as a functional courtesy for anyone to review.

Font ligatures

Occasionally pdfs have an issue with their fonts called ligatures, where 2 text characters touch eachother. This lack of whitespace surrounding the entirety of each text character prevents Pozotron’s pdf parser from properly detecting each character separately. Ligatures are most often seen with the letter combinations of “fi”, “ff” and “if”, but this depends on the font used in the script. When a ligature is present, Pozotron will show the two charaters as an unreadable section of text (usually a blank space with a red underline) and note an annotation for an added word. To resolve this, follow the steps found in our PDF OCR Error support page.

Graphics in your script

For the most part, Pozotron will just skip over graphics in a PDF script. However, there are times when text is presented as a graphic where this can cause a slight discrepancy. For example, if “Introduction” or “Chapter One” or “No Trespassing” are the headers/headlines/chapter names of a section, but are illustrations or graphics of that text instead of actual letters, Pozotron will skip over those because they aren’t readable as text. That’s why we always have a one-button click to open up the original script so you can double check if the software thinks you’re adding those words to the project. Additionally, there are cases where a part of the script is presented as handwritten (as a handwritten letter or sign, for example), but if that writing is not parseable by a text editor (like Word), then Pozotron won’t pick it up either.

Copy and paste test for text formatting issues

If you’re not sure if text is computer-readable or not, you can do a quick copy and paste test. Drag over all of the text on a PDF page and hit ‘copy’. Then, in Word or another text editor, hit ‘paste’. If the pasted text is not what you expect to see, then there is a missing font, compression issue, or security issue in your PDF script. If a text editor doesn’t understand the content, Pozotron’s reader can’t either.

PDFs with columns

When PDFs are laid out in columns, Pozotron can have difficulty figuring out where those column breaks are as it reads from left to right. That means books with columns are not great candidates for the digital accuracy check. If there are graphics interspersed with these columns, the likelihood of a clean transcript goes down further. It is possible to duplicate pages with columns and then black out or redact one side at a time to effectively put together a linear version of the script, and Pozotron has no issue with this.

PDFs with double page spreads

Some PDFs include two book pages on a single PDF page (this looks like a ‘double spread’). The solution to this is to duplicate the double spread pages and then crop out all of one side on the even pages and all of the other side on the odd pages. This process can be done quite quickly in Adobe Acrobat and is used very effectively to make single-spread scripts that work well with Pozotron.

PDFs with captions and footnotes

Captions and footnotes are fine inside Pozotron if they are read in the order they appear in the book. Reading footnotes inline (bringing them up from the bottom and placing them where the cue is within the text) will cause issues with the matching analysis. Skipping footnotes or captions will also cause the misread detection process to trigger. Blacking out or redacting footnotes and captions in advance that the narrator won’t be reading is your best bet for accurate reporting, though ignoring the errors on the annotation page also works fine for final reporting.

PDF OCR Errors
Missing Segments of Text in Proofing (caused by empty audio uploads)