Problems and solutions for translating PDF files
PDF files are one of the most widespread formats for displaying text content in documents. This is why it is often the only format we have for translating a contract, a brochure, a data sheet or a user manual. However, PDF files are only content exchange files. In fact, PDF stands for Portable Document Format.
The only purpose in life of a PDF file is that we can see it and share it without compatibility problems. It's important to understand this. A PDF file can be modified, but it is not actually intended to be modified.
Índice de contenidos
Index of contents
Index du contenu
Let's see here what we have to do to be able to translate a PDF using the computer assisted translation programs normally used by professional translators and translation agencies. Or to easily use a machine translator such as Google Translator, if we do not need a 100% reliable translation. Or even if we simply want to translate in the old-fashioned way, overwriting the text, but keeping the document format.
To make use of language technology we will need to have the PDF file in an editable format that we can easily handle. The most normal way to translate them is to convert them to Word. It can be a .doc or .docx document.
This is going to sound a little bit stupid, but the best way to translate a PDF file is not to translate it. I mean, better not to translate the PDF file but the original, editable file with which the PDF was created.
For example, if that PDF file was created in Word, it is best to use the original Word file. If it was created with FrameMaker or InDesign, it is best to use these formats. If you have InDesign files and can forget about the PDF, you might be interested in our blog: "DTP/layout best practices for translating InDesign files.”
However, it is quite possible that your company no longer knows where the original files are or who created them. Or, if you are a distributor or importer, it is possible that the manufacturer has only provided you with the PDF files to do the translation.
However, it is worth investing a little bit of time in investigating whether someone still has them, or insisting that the manufacturer send them to us. This will save us a lot of headaches during translation. When I say headaches, I mean the most common issues: time and money.
When we want to preserve the format of the original text, getting the originals will be even more important. PDF files are usually created in low-resolution versions, so they won't work if we need them for high quality professional printing.
On the other hand, when we convert the PDF files to Word for translation, we will see that the layout can change quite a lot. So it can be very laborious and expensive to reproduce 100% of the original layout.
In short, the best results in terms of quality and time/money will be achieved by working from the originals for translation. However, you probably wouldn't be reading this article if you did, would you?
In general, we will be able to easily find out which program was used to create the original PDF. If we open the PDF with Acrobat or Acrobat Reader and go to Files>Properties, we can see the application used to create it. Here is an example of a PDF created in Word:
We can also see who the author is and the date when it was created, which can give us a clue as to who to ask for the originals.
As I said, the standard solution for translating PDF files used by translation companies is to go through Word. However, someone might think this doesn't make sense because PDF files can be edited.
Well, this is true to a certain extent. First, yes, they can be modified, but for that we will need to have the paid professional version of Acrobat. Most translators will only have the free version: Acrobat Reader. This free version is limited to a few functions.
Second, even if we use a translator who does have the professional version, changing the text in Acrobat will take much longer. We may be able to convince novice translators to work in this way at their usual rate.
However, more experienced professional translators are likely to surcharge us to work this way. In the worst case, they will directly reject a translation project under suchconditions.
These two problems can be solved by sending a Word file for translation. In addition, Word files will allow translators or translation agencies to use translation assistance tools. These tools create a database of the translations performed by the professional translator.
These tools also allow you to analyse how much repeated text the documents to be translated have. This is especially important when it comes to translating technical manuals.
Technical manuals often include a lot of repeated text both in the same manual and/or between manuals for similar products. Many translators or translation agencies, like ourselves, will agree to offer discounts based on the volume of repeated text.
Converting a PDF to Word can be a very simple and efficient step. There are also cases where it will be a real headache. Next, we will explain how to create a Word file depending on the type of PDF we have.
Once we have made sure that we do not have the original files, there will be no option other than using the PDF files for translation. The best situation we can face is that the PDF files are editable.
When we say that they are editable, we mean that the text of the PDF can be easily modified. That is, it will not be an image as it happens with scanned PDFs or vectorized text. We'll look at these cases later.
The best solution will be to convert the PDF into a Microsoft Word document. This file format is a standard today. This means that we can send them to any translator or translation company. Today all translation professionals have Word or a compatible program.
Word documents are also easily handled by translation programs. With translation programs I refer to both computer assisted translation tools (such as SDL or memoQ) or automatic/machine translation tools (such as Google Translator).
The secret to getting a good PDF to Word conversion is the program we use. There are a whole series of programs called OCR on the market that make very decent conversions. These are programs that have been on the market for years and are mature programs.
Out of the experience working with PDF files in our translation agency, the best programs for this purpose are Adobe Acrobat, Omnipage and Abbyy FineReader. There are also other good programs. See other programs on this blog: Extract Text from Images and PDFs with Best OCR Software.
The best practice is to have several of these programs, if our budget allows for it. Depending on the document, sometimes Adobe Acrobat will give us an optimal result. Other times it will be OmniPage or Abby FineReader.
Once we have the converted document, we will have to review the layout and modify it if necessary. For example, the converter may have placed a paragraph mark in the middle of the sentences, splitting them. This type of formatting will complicate the translation process and should be avoided.
The conversion of a scanned PDF does not differ from that of an editable PDF, except for the result we can expect. In general, scanned PDFs are going to have poorer results.
If working with OmniPage, you can instruct the program to help you interpret your scanned document. You can basically tell OmniPage if there is a table, if it is a text paragraph or if it is an image. It also allows you to indicate the text orientation when it changes. These basic instructions can optimize the text to be translated.
A problem that can be insurmountable is when the resolution of the scanned PDF is not sufficient to perform OCR. OCR programs will need to have a minimum resolution to work. If we encounter this problem, we will have to ask for the documents to be rescanned at a higher resolution. If this is not possible, we can print the documents and scan them ourselves. This will be a valid work around sometimes.
A similar case of scanned PDFs is that of PDFs where the text has been converted to vector graphics. In design programs such as Illustrator, InDesign or Corel there is the function of passing the text to vectorise the text, losing the ability to edit and translate them. This is done to avoid having to send in the sources. In general, this type of PDF will convert the text well.
There are many programs that use the PDF format to create documents from data in a database. An example of this might be a simple invoice or report generated from an ERP or CMR.
Safety data sheets are a typical example that we usually find in translation companies. Most product safety data sheets are generated from a management program that has all the necessary information.
Since these PDFs are not generated from a previous design that you can have in Word or InDesign, they tend to generate more problematic conversions. Typical problems are text in split text boxes, columns whose text does not follow a logical order or sentences cut by paragraph marks.
Watermarks are usually the biggest problem that we can find in this type of documents when we want to translate them, since they are usually put there precisely to avoid the conversion of the document into an editable format.
In conclusion, the translation of PDF files is often a headache for translation agency project managers. There is a wide variety of possible problems, and sometimes managing them can be a major effort in parallel with translation. It is important that you, as a customer, are clear about the results you expect from the translation of a PDF document. In many occasions, getting the same format of the original implies a lot of layout work that someone has to pay for.
José Gambín holds a 5-year degree in Biology from the University of Valencia (Spain) and a 4-year degree in Translation and Interpreting from the University of Granada (Spain). He has worked as a freelance translator, in-house translator, desktop publisher and project manager. From 2002, he is a founding member of AbroadLlink and currently works as Marketing and Sales Manager.