When trying to segment a PDF file in my WorldServer environment, I receive an error saying that this file format is not supported. How can I process my PDF files in WorldServer?
The PDF file format is not supported in WorldServer as a native file format.
WorldServer provides an automatic action (AA) to convert PDF files to DOCX files within a normal workflow.
The Convert PDF to Word 2007-2010 AA converts PDF file format documents to Microsoft Word 2007- 2010 (DOCX) file format.
The converted documents can then be translated as DOCX files. You can add the PDF conversion AA to a workflow before the Segment Asset step where it is executed as part of a project. The AA will detect PDF files and convert them to DOCX files. It will not convert other file formats so you can place this Automatic Action in any Workflow you are using. If the source file is not a PDF, the step will be skipped. Here is an example of a simple Workflow containing the Convert PDF to Word 2007-2010 (LIB) Automatic Action:
By default, when the conversion from PDF to DOCX completes, the AA creates a new task that continues executing of the workflow using the DOCX file as an input asset. The original task that performed the PDF conversion is marked as canceled.
Note: WorldServer does not convert documents from DOCX to PDF so the target file will be in DOCX format. A conversion back to PDF will need to be done outside of WorldServer and through an external editor as needed.
The File Type Support (FTS) server performs the actual conversion. For more information about the conversion tool, visit the Solid Documents help site:
NOTE: some information might be lost during the PDF to Word conversion. We recommend that you review the converted DOCX file for correctness. For example, you can add a Review human step to the workflow after the PDF to DOCX conversion step. This allows the reviewer to make any necessary corrections to the converted document before the translation process continues.
The Convert PDF to Word 2007-2010 Automatic Action is distributed in the autoaction_libraries.zip in the SDK, together with other supported libraries. We assume you have installed this file so please check before following the steps below.
If this Automatic Action is not present in your environment, please upload it following the steps in this article: How to upload Automatic Actions from the SDK to WorldServer
We recommend uploading the sample workflow from the knowledge base as an example (the file is attached to this article).
1. Log in to WorldServer as Administrator.
2. Navigate to Management > Administration > Import Objects.
3. Browse to the location of the ConvertPDFSampleWorkflow.tml file.
4. Select the file and click Upload File to install the workflow.
PDF Converter Arguments
You can set several options for the PDF Converter AA.
Anchor the image to the nearest paragraph or to the page using the automatically calculated offsets.
Anchor to Paragraph
Anchor the images to the nearest paragraph
Anchor to Page
Anchor the images to the page using the automatically calculated offsets.
Remove all images that are not inline.
Headers and footers
Recover as headers and/or footers
Detect headers and footers and add them as Word headers and footers in the output document.
Place in the body of the document
Handle headers and footers as ordinary text and put them into the body of the output document.
Detect headers and footers but remove them from the output document.
Default = Yes
Recognize PDF tables and create Word tables from them.
PDF text recognition
You can control the conversion of symbols when there is missing or incorrect font encoding detected during the conversion process to Word. Adjusting these options may help when the text or some symbols in the converted file look garbled.
Note: PDF conversion runs approximately 15 times slower when you enable PDF text recognition, especially if you use the Every character option. This should be avoided.
Do not apply optical text recovery.
Problem characters only
Let the application determine where optical text recovery is needed.
Apply optical text recovery to all text. If you need to use this feature, you may also need to reconfigure the fts_proxy_filter_process_timeout setting in the general.properties to greater than 30 minutes. WorldServer terminates long-running PDF conversions after the timeout is exceeded.
Output AIS path
Default = Empty
Specify the desired output location of the converted DOCX file. When empty, the converted .docx file will be written to the same location as the source PDF file with only the extension changed from PDF to DOCX. To specify other location enter a valid AIS folder.
Note: Only file system based AIS mounts are supported.
Spawn new task
Default = Yes
Determine whether a new task should be spawned after the conversion from PDF to DOCX completes. Set this option to No when you only require conversion from PDF to DOCX, with no further processing of the resulting DOCX file in the same workflow.
Default = Flowing
This argument is not exposed in the AA and has no other options available at this time. The Flowing default recovers page layout, columns, formatting, graphics, and preserves text flow.
Note: The Page Layout argument does appear in the SDL Trados Studio implementation of the PDF converter.
The AA returns DONE in all cases.