How to process PDF files in WorldServer

Article Number:000003148 | Last Updated:1/18/2017 10:05 AM

Scope/Environment

WorldServer

Symptoms/Context

When trying to segment a PDF file in my WorldServer environment, I receive an error saying that this file format is not supported. How can I process my PDF files in WorldServer?

Resolution

The PDF file format is not supported in WorldServer as a native file format. However, WorldServer provides an Automatic Action (AA) to convert PDF files to DOCX files within a normal workflow.
Note: This action requires that the FTS Server is installed and configured.

Starting from WorldServer 11.7., this AA is renamed to Convert PDF to DOCX (LIB). In versions before 11.7., the Automatic Action is called Convert PDF to Word 2007-2010 (LIB). However: Task History entries will still refer to the old name Convert PDF to Word 2007-2010 (LIB).

This Automatic Action converts PDF file format documents to Microsoft Word (DOCX) file format. The converted documents can then be translated as DOCX files. You can add the PDF conversion Automatic Action to a workflow before the Segment Asset step where it is executed as part of a project. The AA will detect PDF files and convert them to DOCX files. It will not convert other file formats so you can place this Automatic Action in any Workflow you are using. If the source file is not a PDF, the step will be skipped. Here is an example of a simple Workflow containing the Convert PDF to Word 2007-2010 (LIB) Automatic Action:

User-added image

By default, when the conversion from PDF to DOCX completes, the Automatic Action creates a new task that continues executing the workflow using the DOCX file as an input asset. The original task that performed the PDF conversion is marked as cancelled.

IMPORTANT: WorldServer does not convert documents from DOCX back to PDF format. This means that the target file will be in DOCX format. A conversion back to PDF will need to be done outside of WorldServer and through an external editor as needed.

The conversion happens on the server where the File Type Support (FTS) server service is installed. The conversion tool used in WorldServer versions until 11.8.0 was Solid Converter. For more information about the conversion tool, visit the Solid Documents Help Site. However, starting from version 11.8.1 (FTS 11.8.0.48), the software used to convert PDF files has changed from Solid to Sautin and Aspose. It is the same converter as used in Studio 2022 as PDF File Type. See also: PDF Assistant for Trados Studio 2022

Installation: The Convert PDF to DOCX (LIB) Automatic Action is distributed in the autoaction_libraries.zip in the SDK, together with other supported libraries. We assume you have installed this file so please check before following the steps below. If this Automatic Action is not present in your environment, please upload it following the steps in this article: How to upload Automatic Actions from the SDK to WorldServer

For WorldServer versions starting from version 11.8.1:

As explained above, starting from version 11.8.x, the software used to convert PDF files has changed from Solid to Sautin and Aspose. Sautin will be used if the setting Alternative Processing is set to No (default). If set to Yes, the software Aspose will be used. More details about the configuration can be found in this article: PDF Assistant for Trados Studio 2022

When the "Spawn New Task" option is set to Yes (default), a new task will be spawned to continue processing the converted DOCX file. It should be set to "No" only when this action is standalone within the workflow and no further processing is required.

The "Output AIS Path" parameter is used to configure the desired output location. When empty (default), the converted docx file will be written to the same location as the source PDF file with only the extension changed from .pdf to .docx. To specify another location, enter a valid AIS folder. (Only file system-based AIS mounts are supported.)

Use the Layout option to configure the converter behavior:
- Flowing: recover page layout, columns, graphic and preserve text flow. This is the default setting.
- Continuous: detect layout and columns but only recover formatting, graphic and preserve text flow
- Exact: recover exact page presentation using text boxes in Microsoft Word

The Alternative Processing option is used to switch to another converter (better for non-latin based languages). The default setting is No. However, when set to yes, the alternative converter will be based on Aspose as a tool (not Sautin).

We would like to point out a few limitations and potential solutions:

•   If you use Asian languages or other non-Latin-based languages as source languages, we recommend that you tick the new checkbox Use alternative processing (better for non-Latin based languages).
•   Support for scanned PDF documents using OCR (optical character recognition) is limited out of the box. If a PDF file contains merely a scanned picture of the underlying document, then the new technology will not be able to convert the document. If, on the other hand, the document is scanned but the text in it is selectable, then the technology will attempt to convert the characters within the document. You can test this in Adobe Reader, for example. If it's possible to select any text in the document, then the technology will attempt to convert it.
•   If you need more advanced support for scanned PDF documents, we recommend the following options:
o   If you use Microsoft Word, you can use its built-in PDF conversion - it accepts PDF files, including OCRed, for opening files and can save them out in Word .docx format which you can then process as normal.
o   Adobe Reader also has a built-in function to save PDF documents in Microsoft Word format, which can be purchased as a subscription.
o   Alternatively, consider purchasing a third-party solution, such as Abbyy Fine Reader or Readiris, that can convert OCR'ed PDF documents to Microsoft Word format. These options are available as perpetual licenses or on subscription.

For WorldServer versions before 11.8.x:

You can set several options for the Convert PDF to Word 2007-2010 (LIB):

Image recovery
Default: Automatic Anchoring
Anchor the image to the nearest paragraph or the page using the automatically calculated offsets.

Other options:

Anchor to Paragraph
Anchor the images to the nearest paragraph
Anchor to Page
Anchor the images to the page using the automatically calculated offsets.
Remove Images
Remove all images that are not inline.

Headers and footers
Default: Recover as headers and/or footers - Detect headers and footers and add them as Word headers and footers in the output document.

Other options:

Place in the body of the document
Handle headers and footers as ordinary text and put them into the body of the output document.
Remove
Detect headers and footers but remove them from the output document.

Table detection
Detect Tables
Default = Yes - Recognize PDF tables and create Word tables from them.

Recognize PDF text
You can control the conversion of symbols when missing or incorrect font encoding is detected during the conversion process to Word. Adjusting these options may help when the text or some symbols in the converted file look garbled. Note: PDF conversion runs approximately 15 times slower when you enable PDF text recognition, especially if you use the Every character option. This should be avoided.

Default: None
Do not apply optical text recovery.
Problem characters only
Let the application determine where optical text recovery is needed.
Every character
Apply optical text recovery to all text. If you need to use this feature, you may also need to reconfigure the fts_proxy_filter_process_timeout setting in the general.properties to greater than 30 minutes. WorldServer terminates long-running PDF conversions after the timeout is exceeded.

Other arguments

Output AIS path
Default = Empty
Specify the desired output location of the converted DOCX file. When empty, the converted .docx file will be written to the same location as the source PDF file with only the extension changed from PDF to DOCX. To specify other locations enter a valid AIS folder.
Note: Only file system-based AIS mounts are supported.

Spawn new task
Default = Yes
Determine whether a new task should be spawned after the conversion from PDF to DOCX is completed. Set this option to No when you only require conversion from PDF to DOCX, with no further processing of the resulting DOCX file in the same workflow.

Return values
The AA returns DONE in all cases.

We would like to point out a few limitations and potential solutions:

Support for scanned PDF documents using OCR (optical character recognition) is limited out of the box. If a PDF file contains merely a scanned picture of the underlying document, then the new technology will not be able to convert the document. If, on the other hand, the document is scanned but the text in it is selectable, then the technology will attempt to convert the characters within the document. You can test this in Adobe Reader, for example. If it's possible to select any text in the document, then the technology will attempt to convert it.

•   If you need more advanced support for scanned PDF documents, we recommend the following options:
o   If you use Microsoft Word, you can use its built-in PDF conversion - it accepts PDF files, including OCRed, for opening files and can save them out in Word .docx format which you can then process as normal.
o   Adobe Reader also has a built-in function to save PDF documents in Microsoft Word format, which can be purchased as a subscription.
o   Alternatively, consider purchasing a third-party solution, such as Abbyy Fine Reader or Readiris, that can convert OCR'ed PDF documents to Microsoft Word format. These options are available as perpetual licenses or on subscription.

Root Cause

Reference

Relevant articles:

PDF Assistant for Trados Studio 2022
WorldServer: Auto error at 'Convert PDF to Word 2007-2010' - PDF file conversion failed with error 'Converter output end-'

Send Article Feedback