When trying to segment a PDF file in my WorldServer environment, I receive an error saying that this file format is not supported. How can I process my PDF files in WorldServer? |
The PDF file format is not supported in WorldServer as a native file format. However, WorldServer provides an Automatic Action (AA) to convert PDF files to DOCX files within a normal workflow. Note: This action requires that the FTS Server is installed and configured. Starting from WorldServer 11.7., this AA is renamed to Convert PDF to DOCX (LIB). In versions before 11.7., the Automatic Action is called Convert PDF to Word 2007-2010 (LIB). However: Task History entries will still refer to the old name Convert PDF to Word 2007-2010 (LIB). This Automatic Action converts PDF file format documents to Microsoft Word (DOCX) file format. The converted documents can then be translated as DOCX files. You can add the PDF conversion Automatic Action to a workflow before the Segment Asset step where it is executed as part of a project. The AA will detect PDF files and convert them to DOCX files. It will not convert other file formats so you can place this Automatic Action in any Workflow you are using. If the source file is not a PDF, the step will be skipped. Here is an example of a simple Workflow containing the Convert PDF to Word 2007-2010 (LIB) Automatic Action: By default, when the conversion from PDF to DOCX completes, the Automatic Action creates a new task that continues executing the workflow using the DOCX file as an input asset. The original task that performed the PDF conversion is marked as cancelled. IMPORTANT: WorldServer does not convert documents from DOCX back to PDF format. This means that the target file will be in DOCX format. A conversion back to PDF will need to be done outside of WorldServer and through an external editor as needed. The conversion happens on the server where the File Type Support (FTS) server service is installed. The conversion tool used in WorldServer versions until 11.8.0 was Solid Converter. For more information about the conversion tool, visit the Solid Documents Help Site. However, starting from version 11.8.1 (FTS 11.8.0.48), the software used to convert PDF files has changed from Solid to Sautin and Aspose. It is the same converter as used in Studio 2022 as PDF File Type. See also: PDF Assistant for Trados Studio 2022 Installation: The Convert PDF to DOCX (LIB) Automatic Action is distributed in the autoaction_libraries.zip in the SDK, together with other supported libraries. We assume you have installed this file so please check before following the steps below. If this Automatic Action is not present in your environment, please upload it following the steps in this article: How to upload Automatic Actions from the SDK to WorldServer For WorldServer versions starting from version 11.8.1:As explained above, starting from version 11.8.x, the software used to convert PDF files has changed from Solid to Sautin and Aspose. Sautin will be used if the setting Alternative Processing is set to No (default). If set to Yes, the software Aspose will be used. More details about the configuration can be found in this article: PDF Assistant for Trados Studio 2022 When the "Spawn New Task" option is set to Yes (default), a new task will be spawned to continue processing the converted DOCX file. It should be set to "No" only when this action is standalone within the workflow and no further processing is required. The "Output AIS Path" parameter is used to configure the desired output location. When empty (default), the converted docx file will be written to the same location as the source PDF file with only the extension changed from .pdf to .docx. To specify another location, enter a valid AIS folder. (Only file system-based AIS mounts are supported.) Use the Layout option to configure the converter behavior: - Flowing: recover page layout, columns, graphic and preserve text flow. This is the default setting. - Continuous: detect layout and columns but only recover formatting, graphic and preserve text flow - Exact: recover exact page presentation using text boxes in Microsoft Word The Alternative Processing option is used to switch to another converter (better for non-latin based languages). The default setting is No. However, when set to yes, the alternative converter will be based on Aspose as a tool (not Sautin). We would like to point out a few limitations and potential solutions: • If you use Asian languages or other non-Latin-based languages as source languages, we recommend that you tick the new checkbox Use alternative processing (better for non-Latin based languages). • Support for scanned PDF documents using OCR (optical character recognition) is limited out of the box. If a PDF file contains merely a scanned picture of the underlying document, then the new technology will not be able to convert the document. If, on the other hand, the document is scanned but the text in it is selectable, then the technology will attempt to convert the characters within the document. You can test this in Adobe Reader, for example. If it's possible to select any text in the document, then the technology will attempt to convert it. • If you need more advanced support for scanned PDF documents, we recommend the following options: o If you use Microsoft Word, you can use its built-in PDF conversion - it accepts PDF files, including OCRed, for opening files and can save them out in Word .docx format which you can then process as normal. o Adobe Reader also has a built-in function to save PDF documents in Microsoft Word format, which can be purchased as a subscription. o Alternatively, consider purchasing a third-party solution, such as Abbyy Fine Reader or Readiris, that can convert OCR'ed PDF documents to Microsoft Word format. These options are available as perpetual licenses or on subscription.
You can set several options for the Convert PDF to Word 2007-2010 (LIB): |