If you have ever requested a translation quote for your documents written in Arabic, you are surely already familiar with the response from most translation project managers.
Do you have the file in an editable format?
Although it may not seem like it, we are very aware of how annoying this question can be, especially if you are one of those clients who always sends all documents in their original format.
However, you must also understand that for translation companies it is much easier, faster, and cheaper to process and prepare files sent in an editable format.
Despite everything, there will always be a document that has been scanned and converted to PDF. I sincerely believe this is the worst format to work with, but it does not mean that it is impossible.
In this blog, I will compare some of the programs that translation companies usually work with to manage and process this type of documentation written in Arabic. I have chosen Arabic because it is a widely translated language and not all programs are capable of working with this language. As a result, you have to delve a little deeper into the world of text extraction programs.
[TOC]

When working with files that, once digitised, become images in which the text cannot be selected, we are unable to use a simple text extraction program as we would with PDF files where the text can be selected. Here are two examples where you can see the difference:
- PDF in an editable format

- PDF in an non-editable format

The first text can be selected, which means that any text extraction program, whether free or not, can extract the text without any problem. In the second case, the PDF file only allows us to select an area of the document, but not the text itself. As a result, a text extraction program will not know how to recognise the characters present in the document.
You can try it with any text extraction program, they will all offer you similar results.
Now, from the many text extraction programs available on the market today, the ones that make a difference when converting non-editable PDF files are OCR extraction programs. OCR stands for “Optical Character Recognition”. As the name suggests, these programs not only recognise the editable characters of a document, but they can also detect text in a scanned document.
And you probably will say. Well, that’s that. But not so fast... No matter how good these programs are, they still leave much to be desired. It is true that they can give you an approximate idea of the workload, but it is not advisable to work with them when translating.
Below, you can check the results obtained with different text extraction programs.

The first program I would like to introduce is Adobe Acrobat Pro. If you are used to working with a computer, it is a program that should not be missing from your toolkit, as Adobe not only allows you to view, but also create, edit, organise pages, comment, fill in, sign, and correct any document in PDF format.
Furthermore, it also allows you to extract text and is quite practical. When I receive a PDF document and open it, it opens directly in this program and extracting the text is only two more clicks away. So it is usually the first one I always try.
Taking as a reference the non-editable document you have previously seen, this is how the conversion would look in Adobe Acrobat Pro:

Yes, yes, I'm not lying to you. A program this powerful still gives poor results when extracting Arabic text.
I think one of the reasons may be the fact that you cannot specify the language of the text, but the program “recognises” it automatically. Facilitating the language detection work for the program, the program only searches its character database for a particular language instead of the database where the characters of all languages are found.
In any case, I do not recommend this program at all when extracting text in Arabic.

The second application I usually try after failing to extract text with Adobe is OmniPage Ultimate.
Unlike Adobe, OmniPage does allow you to select the document's language with a simple right-click on the file. So what's the problem then? Arabic does not appear in the language list. Wolof and Zulu appear but not Arabic. In these cases, you can try using the “Detect language automatically” option and you will get this result:

As you can see, it is not what we are looking for either. However, I must admit that both Adobe and OmniPage work wonderfully in extracting text in other languages.

The third option to extract Arabic text and convert it into an editable format is Readiris 17.
It is a slightly more sophisticated program than Adobe and OmniPage for extracting text written from right to left, as is the case with Arabic. The program allows you to indicate on each page which section corresponds to text, which section to images, etc.
It is true that it takes a little more preparation time compared to other programs that do not offer this page selection option, but after seeing the result, it is evident that the effort is worth it:

It offers better results than the two previous programs, but for longer documents it still falls short. It tends to insert many paragraph breaks that do not appear in the original document as well as other inconsistencies in the format that bring about quite a bit of layout work.

The last program, which, in my opinion, is the best program for extracting non-editable text in Arabic, is ABBYY Fine Reader.
Like all the others I have presented in this blog, it is a paid program that allows you to indicate page by page which sections are text, which sections include images, and which sections are text tables.
Depending on how accurate your indications are, the program will generate a more or less precise document. I didn't modify much, and the result was the following:

Compared to the original document, we can say they are almost identical:

When it comes to both quoting and translating this document, we will obtain much more accurate results than with any other program seen in this blog. In my opinion, ABBYY is the clear winner when it comes to extracting non-editable text in Arabic.

As you have seen, not all programs that allow text extraction from non-editable format files offer the same results, at least when it comes to text in Arabic. I often work with all these programs, and unless it is a really problematic format with an uncommon language involved, they usually do not cause any problems. Quite the opposite!
I do not recommend the use of free text extractors, as you never know where your files might end up. If it is a non-confidential private document, it is not of much importance, but I would avoid uploading important company files containing confidential information to these free-to-use sites.
I hope I have shown you a bit of the day-to-day life of translation project managers in their tireless battle against scanned PDF files. The next time you are sent a PDF to translate, first ask if your company still has the original format of the file. This way you will not only reduce costs, but the final format of the translation will also look much better. Not to mention the immense favour you do for project managers!