PDF TRANSLATION USING INFIX v7 - José Henrique Lamensdorf - translation - tradução

Go to content

PDF TRANSLATION USING INFIX v7

ENGLISH > TRANSLATION > OF PDF FILES

DIRECT PDF FILES TRANSLATION
USING INFIX v7


A while ago I published here a walk-thru showing, step by step, how to translate PDF files directly, using version 6 of Infix.

That demonstration with Infix v6 is still here, however it has been hidden. If you want to see it, please click here. As it happens, Infix has evolved considerably, and its v7 has rendered the entire process much easier.

I will show here a quick overview of the current process.

For this purpose, I found on the web a 562-page manual, from which I randomly extracted 7, and put them together in a sole PDF (click here to see it, if you have Acrobat Reader installed), taking it as the "original". It is worth noting that:
      • I never worked for the company that published this manual.
      • I wan't hired to translate any part of this manual.
      • I didn't follow any style/terminology rules while translating, and I left some words untranslated on purpose.
      • I didn't check the translation for spelling nor consistency.
      • The objective is merely to demonstrate a PDF file translation process.


FIRST STEP: FONTS

A PDF file usually has the fonts it uses embedded within itself. As these may be many, in order to save on the final file size, only the characters in each font actually used in that PDF are embedded.

Translating it with only the parts of the fonts that are embedded within a PDF would be like doing it with some keys missing from the keyboard, different ones in each font.

For instance, if a PDF file only contained the word republican using the Roman Gothic font, we'd only have 10 chars from this font embedded. If this were an OpenType font, we'd be saving a lot of disk space, upon failing to include all letters with diacritics, numbers, graphic chars (@#$%, etc.) and puntuation marks, plus entire alphabets like Ciryllic, Hebrew, Arabic, Greek, and others that are built into such fonts.

If we were to translate republican into Portuguese or Spanish, it would be just a matter of adding an O at the end. Since this char is not embedded into the PDF, this O would be represented by a square, a space, an underscore, a period, or anything else, depending on the program use to view this PDF.

In my case, I'm usually translating from English - which uses no accents - into diacritics-rich Portuguese. It is certain that the fonts embedded into a PDF in English won't include any of these accented characters or cedillas.

So the next step is to check the fonts I have installed on my system, as well as those I can obtain and install, to complete the character set to embed into the PDF. Infix provides me with this information:

Infix tells me this file uses 14 fonts. At first sight, it seems that all re variations on Arial, which everybody has.

Some companies use their proprietary fonts, to comply with their corporate identity. Two examples that occur to me now are General Electric and Rakuten. In such cases, it will be necessary to get these fonts.

However let's take a look at all the fonts:
Apparently two fonts may cause trouble here.

We search for the first one, and Infix shows:
Apparently it was a slip from the original DTP operator. No problem in replacing it with Arial.

Regarding the other:
... the same case applies. It must have been a silp, we can replace it with Arial.

Supposing I didn't have Wingdings 3, it's worth checking:
Since I won't have to translate 'dingbats", and none of these will have accents or cedillas, I can use the chars from this font that are already embedded in the original PDF.

It is worth noting that I only showed one instance for each font, however I'd have to check all of them, though it is still possible to fix this while proofreading.


SECOND STEP - EXPORTING TEXT

I use Infix's Export feature, to export all the text as a tagged XML file. In this process, Infix applies tags to the text within the PDF file, to enable later matching.

This file is saved, and the segmentation on the PDF is shown with color shades, like this:
Fortunately, this file is well organized, there are no shattered paragraphs, no loose pieces.

When there are, Infix has the tools to rebuild unduly fragmented text blocks.


THIRD STEP - TRANSLATION

It is time to translate the text, on the XML file.

I translate between Portuguese (BR as target) and English (US as target). There are some specific areas where I don't translate technical material. Additionally, I speak and read Italian, French, and Spanish - however I don't translate them professionally.

This makes viable my partnership with fellow translators who:
a) work on these specific areas (medicine, biology, accounting, finance, sports) translating technical material;
b) trabslate into European Portuguese or British English;
c) translate in any language pair among English, Portuguese, Italian, French, and Spanish.

My colleague will translate text on the XML file, and I'll take care of DTP on the PDF.

And how is the translation done?

This is an example of the XML file exported from this publication being translated with WordFast Classic:

I don't know why, but WordFast Classic - actually a Microsoft Word macro - handles XML files much faster than its own proprietary DOC and DOCX files.

Anyway, my partner will be able to do it using any CAT tool capable of handling XML files, probably includinf Trados, MemoQ, WordFast Pro, and many others.

All they have to do is send me tha translated XML file, and I'll take it over from there.


FOURTH STEP - IMPORTING

Once the XML has been translated, it's time to import the translation, each text block in its proper locationa. That's when the XML tags will do their job.

Of course, I must pair the fonts to those I have installed on my system:
This is a must, so that Infix can retrieve and embed all the missing characters (with diacritics) in the translated PDF.

There is another peculiarity, from the fact that font names don't always match. For instance, Verdana Bold, included in the US Microsoft Office, is the same one named Verdana Negrito in the Brazilian Microsoft Office brasileiro. So this match must be made at the outset of the import process.

One it has been done, it's time to import.

I'll have 6 font replaced. If any char present in the translation is missing, not embedded in the PDF, and nonexisting in the replacement font, I'll see a warning from Infix, asking me which font should be used.

Furthermore, I've set Infix to:
a) adjust font size to the space available; and
b) reset any custom spacing between letters or words.

These are two things I'll adjust manually after importing.

Importing runs smoothly:

If you have Adobe Acrobat Reader installed, you may see the raw import result clicking here.


FIFTH STEP - DTP ADJUSTMENTS

Of course, a number of adjustments is required. Let's see just a couple of them.
A. ORIGINAL
A. RAW TRANSLATION IMPORTED
A. AFTER ADJUSTMENTS
.

B. ORIGINAL
B. RAW TRANSLATION IMPORTED
B. AFTER ADJUSTMENTS

At first sight it may seem easy, however it seldom is. Note that I didn't mention hyphenation, which may be necessary.

Furthermore, tables can be quite complex, and some figures may have embedded text which will have to be edited on another program.

Anyway, it makes possible to enjoy some considerable time and cost savings, if compare to translating on the original DTP file (e.g. InDesign), and then start a reviewing loop, assuming that translation and DTP will be handled by different individuals.

If you or your company have PDF files to translate between Englilh and Portuguese, please be welcome to contact me by clicking on the e-mail on the left.

And if you are a fellow translator, working in any language pair among EN-PT-IT-FR-ES, and have a live/editable PDF to translate, and you abhor doing DTP, be invited to partner with me. You'll take care of translation alone, and I'll handle the DTP. Use the e-mail button on the left to contact me.

I don't do this kind of work in languages I don't understand, because I think it would be counterproductive.


Back to content