PDF TRANSLATION USING INFIX - José Henrique Lamensdorf - translation - tradução

Busca/Search
Go to content

Main menu:

PDF TRANSLATION USING INFIX

ENGLISH > TRANSLATION > OF PDF FILES


TRANSLATION OF A PDF FILE
A STEP-BY-STEP WALKTHROUGH


I am often asked about PDF files translation using Infix Pro. Below is a step-by-step report of a project I did, of course omitting minor details to save space.

The environment in this process was:

  • Windows XP pro

  • Infix Pro 6.2.8

  • Microsoft Word 2007

  • WordFast Classic 6.03t

Supplementary software used was:

  • Ulead PhotoImpact 5.0

  • Microsoft Excel 2007

  • Adobe Page Maker 6.52

(however their role could have been performed by other applications)

1. THE PROJECT

The project comprised translating one technical brochure, from English into Portuguese (BR).
Dimensions:

  • 21,928 words

  • 103,285 characters

  • 76 pages

while preserving/restoring the layout.

Before getting started: Translation is my business.
Contact details for the equipment itself in Brazil is shown here, on the right.


2. THE FIRST STEP: FONTS


A PDF file has all the fonts it uses "embedded" within itself. However as these may be numerous, in order to save on file size, only the characters used from each one in that PDF get embedded.

To illustrate, if a PDF only contained the word republican using the Roman Gothic font, we'd have only 10 chars of this font embedded there. If this were an OpenType font, we'd be saving a lot of disk space by not including all the accented letters, digits, signs (@#$%, etc.), and punctuation marks, and especially all characters from the Cyrillic,  Hebrew, Arabic, Greek, and others that are included in such fonts.

If we were to translate republican into Portuguese or Spanish, which involves merely adding an O at the end, this letter would not have been embedded, so that O would be represented by a small square, a blank space, a dot, anything, depending on the PDF reader program.

In my present case at hand, I am translating from accent-less English into diacritic-rich Portuguese. This means that the fonts embedded in my source file has no accented chars, nor the C-cedilla.

So the next step is to check my system-installed fonts as well as those I can get and install, to complete the font chars set to embed in the PDF file. Infix provides me with this information (at right).

It's time to take stock of the situation. I have most of these fonts, except two: BankGothic and Segoe UI.

Since I couldn't find BankGothic, I checked where it was used. I queried Infix, on the screen at the right. Note some details:

  • Unchecked "Find" box; I want any text using this font.

  • I specified the font in "Any" size.

  • I'm asking for "All" occurrences.

  • I am searching "All" pages (in blue) in the document.

The result comes quickly: only 10 occurrences.

I select one of them for viewing.

This font is used only for the company logo, which obviously will not be translated.

Therefore I won't have to embed any new characters from it.

The other font I don't have is Segoe UI.

After some search, the Segoe UI I found had no accented chars, so it wouldn't work for Portuguese.

I searched for an equivalent font, a lookalike, and found Open Sans, quite similar, which I downloaded and installed.

After that, all I had to do was to have Infix find and replace each variant of Segoe UI with the corresponding variant of Open Sans.

Obviously some of these changes would cause the text to reflow, due to differences in width and spacing. As I would translate the entire text anyway, reflow was unavoidable.

At this time it's worth observing what happens (upon replacing fonts), so you'll know what will take place later.

Text within a PDF file is not structured in the same way you are used to see on a word processor, like Word. A PDF file is based on a printing language called PostScript, whose sole objective is to make every element to be printed exactly where it should be, regardless of the text structure.

To understand once and for all how this works, try to open a page from a PDF file on Photoshop. You'll see the text selected "floating in the air". If you want to have a printable page with this text, you need to use the command "Layer | Flatten image" on Photoshop, which will do exactly that: it will flatten all elements against a single flat page, including the background, which would be a blank sheet of paper. The conclusion is so blatantly obvious that it could elude us: A printer does not print the white on the paper!

For this reason, in a PDF file, each line of text ends exactly where it ends. It does not go beyond the [Enter] to the edge of the paper or the preset border, like in a text block on Word, or even InDesign/PageMaker.

Why is this important here? Because upon replacing fonts the text can only reflow to the point where each line ended in the original file.

It is timely to mention that Infix's "View | Text | Field shading" option applies different colors to each paragraph, so you may see what it includes. Another option is to click anywhere in the text and type Ctrl+A, however in this case the black selected area may cover other overlapping text blocks.

To illustrate, please see the example below.

In the original file, the end of line is marked (by Infix) using dotted lines (the text box frame) indicated by the blue arrows. The font here is still Segoe UI.

After we have replaced Segoe UI with Open Sans, as it seems, slightly wider, the text box frame (now shown by red arrows) caused some text to reflow.

Note that there was no "hard" end-of-line character (indicated by Infix with a
) nor any "soft" end-of-line char (which Infix indicates with a ) to preserve the line break after the word had reflowed to the next line below.

If this publication had many tables, Infix's "Table Box" tool could be used, however it was not the case here.

It would also have been possible to add end-of-line chars manually, one at a time. However as the text will be translated, and probably will reflow again, we can opt to do all this later, only once.

Strategic decisions like this come up during the entire journey, so it's worth to always consider what will be most efficient, and avoid unnecessary rework. PDF has its drawbacks, it is a final file, not necessarily intended to be edited, like in a word processor.

3. EXPORTING TEXT (TO TRANSLATE)


The next step is exporting the text, to translate it outside Infix.

I selected "All pages" here, however it would have been possible to do it in parts.

There are two options: TXT (plain text) and simplified XML.

XML worked fine with WordFast on Word 2003. Most likely, it is still the best option for those who use Trados or DejaVu. In my case, WordFast on Word 2007, the best option is TXT. It will require a bit of additional labor, but it works.

Exporting and saving text with tags...
(This takes a while.)

When Infix finishes (saving the TXT file), it also wants to save a tagged PDF.

Be careful now! It is better to save under a different file name.

It will be a PDF apparently identical to the source file, however this one contains the tags that will welcome the translated text back in the proper places, sizes, formats, etc.

It is advisable not to fuss with this file until the translation is imported back to it.

4. PREPARING FOR TRANSLATION

For the time being, you may shut down Infix.

You have a TXT file to translate as you prefer... as long as you leave the tags untouched. It works the same if you had chosen exporting to an XML file.

This TXT file on Word will look like this (I took some chunk from the middle):

Text between <
and > is the tags, the locators that, upon importing the translated text, will place each piece in its right place, size, color, and other features.


In a XML file, tags are different and properly marked; if the settings are correct, they will remain untouched.

However in a TXT file (our case at hand) we must protect them by all means.


WordFast offers the option to mark text highlighted with 25% gray as "untranslatable"; this is what works on Word 2007. On Word 2003 the late "marching red ands" could be used instead.

If you are using a different CAT tool, find out how "untranslatables" should be marked.

In the case of Word 2007, first we must set Highlight to 25% gray.

The Word feature that does it is "Search and Replace".

^?
represents "any char". The picture here shows Word's window to search and replace < followed by any (single) character and followed by >, using the string <^?>.

We'll have to repeat this operation a few more times, since this only replaces strings with a single character between < and >.


We'll do it with
<^?^?>, then <^?^?^?>, <^?^?^?^?> etc., until Word cannot find any instance to replace, and our text (from above) will look like this:

The text should look like this:

In addition to the tags, we must protect two other elements at both ends of the TXT file:

  • the text identifier (in the beginning)


  • the entire fonts list (at the very end)


This can be done manually, highlighting each one at a time and higlighting them with 25% gray.

Obviously, the file now will have to be saved as a DOC or DOCX, since a TXT can contain only text.

5. TRANSLATING

Translation is done on Word, using WordFast Classic.

As I said before, this could be done with any CAT tool, since we are dealing with plain text.

The picture on the right shows translation under way.


After the translation work is finished, the file must be saved as TXT, before importing it back to the (tagged) PDF. This will automatically remove all gray highlighting.

After translation, CAT cleanup, the file is saved as TXT. It's time to go back to Infix Pro.


6. IMPORTING TXT TO PDF

Let's import the TXT file, checking both options here:

Automatically fit text where needed, letting Infix shrink font size and spacing as needed to make it fit inside their text box. We'll fix that manually later.

Reset letter spacing wherever it's different from standard. As we are replacing the entire text, any such changes may be no longer necessary, or some may be necessary elsewhere. We can (and should) do this manually.


If I hadn't replaced the Segoe UI fonts with the corresponding Open Sans before, I could still do it now.

By clicking on the Fonts button, I'd open this window:

I see the fonts list on the left, and I think I don't have to replace any of them.

So I close this window, and hit OK on the previous screen, to start importing.

Surprise!

I don't have the Verdana Bold font!

Infix shows me a list of the chars I miss in this font. I could use any font I considered proper. And if the font I chose didn't have all the required chars, Infix would warn me again.

The problem here is simple. This PDF came from overseas, and my Microsoft Office is a Brazilian edition. No matter how preposterous this may be, my Verdana Bold font is named Verdana Negrito !

I set the replacement, and importing proceeds without a hitch.

The screen on the right shows the process, which took about half an hour.

Just imagine how long it would take you to manually copy & paste some 3,000 paragraphs from the translation onto the PDF! Yet some people still do it.


7. LAYOUT ADJUSTMENTS


Now we have the translated brochure, however many things may have been misplaced, by text swelling or shrinking, as well as due to the lack of an [Enter] at the end of some lines.

It is worth highlighting here that Infix has two major tools to manipulate text boxes.

The
tool and its I-beam cursor is used to edit text as such. So this tool, when used on a text box handles will reshape its area, increasing or decreasing its size in either dimension, however without altering the characters shape. Text reflows automatically when a word won't fit into a line any more, or when there is room on the line above for words on the next line below to move up there.

The
tool and its arrow-shaped cursor edits a text box as a graphic object. So this tool will stretch or compress the text proportionally, as if it were printed on a rubber band.

There are other less visual and more numerically controlled tools to achieve the same results.

Another feature to be mentioned is that Infix indicates, by means of a red square dot
, when only part of the text within a box is being shown, i.e. when part of the text is hidden because it doesn't fit in.

I'll show below a few cases of layout adjustments. On the left, you'll see how the text appeared immediately after importing. On the right I'll show you how it looked after the adjustments. And in between them I'll explain what was done.

I take the chance here to demonstrate that in Infix, a mouse-over on any translated text will display the source text inside a yellow box. Please note that this will only occur when the translated file is opened with Infix. If you open the same file on, say, Acrobat Reader, it won't happen.

The issue here is that the phrase on the pink background became longer than its original, so it is now displaced as far as the edge of the page.

With the sole purpose of illustrating, the first solution shown on the right is what we'd get from compressing the text leftwards from the right end, using the tool.

The second solution, more viable considering the space available, was obtained with the tool, shifting the text box to the left, and then right-justifying the text.

This is a simple example of a very frequent occurrence. I am using this simple example to demonstrate what I said previously, that on a PDF file line breaks are often achieved by means of the text box border, and without an end-of-line char being present. This causes text to reflow in an unpredictable manner.

It was probably there before importing the translation, as it results from numbers in the Open Sans font being thinner than those in Segoe UI, which was replaced.

However there was no (pi) end-of-line char there, not even the ¿ char for "soft" end-of-line.

Here on the left we see the tab marks with the cursor on the first line: there was none.

Here we see exactly the same text box, however with the cursor on the second line (on the lines below, the result would be the same). We have the tab mark (hiding behind the left second-line delimiter), and we have a [Tab] on one line only, the "5020".

The solution is in adding end-of-line chars. Next, select the entire text in the box (Ctrl+A), and set the tabs. Finally, add [Tab]s where they are missing.

We can see the result on the pictures on the right, first with the Infix tools visible, and next how it will look on the finished PDF.

It is important to emphasize here that this is just one out of many solutions available, and this situation will recur in many other ways. Having the text, it is easy to shape it to fit wherever we want.

Here we have three very interesting situations, not all of them interrelated.

All the situations shown here are only a few examples of the adjustments required after having translated the PDF file. It is necessary to step through every page, and fix all these issues one by one.

1. As I remarked before, text in a PDF file is delimited within text blocks. This does not prevent a paragraph delimiter within one block from being offset (as shown by the arrow) if in the original there weren't words that went all the way to the end of the line. This causes the translated paragraph to preserve these delimiters, while maybe there are words on the next line below that could move up. As translation in this case (English > Portuguese) tends to make the text swell, we must use all the space available on every page to its max.

2. If you ever observed fonts lists, you'll have noticed that they usually have variants (aka styles): regular, italic, bold, bold italic, etc. Please note that there is no "underscore" variant (this is why I chose to say font variant, not style here), though underscore is usually featured among all the style options and/or buttons in most applications (including Infix). The result is that within a PDF the underscore turns into a loose line, independent from the text. In this case, the line must be deleted, and then the corresponding text has to be selected and underscored again by Infix.
It is worth knowing that Infix has features to fine-tune the underscore line's vertical position, its width, thickness and other specs.

3. Just like the text block delimiters may reduce the text line width, the text block itself is a delimiter. Watch how much space was wasted on the right side, in different quantities by two consecutive text boxes. It is possible to adjust them (and reflow text) to the graphic elements on that page (I put the light blue line there, for reference). This may be done on one box at a time, or boxes may be clustered and adjusted together.

This is the result:

Here is a simple alignment issue. Both title and subtitle should be centered on the page. Default is having text aligned flush to the left (red circle).

Among different ways to accomplish this, a quick solution is to use the tool to "stretch" the text boxes page-wide, and then center the text alignment (red circle).

Here we have two issues:

Red arrow - Text became too long to fit the table cell.

Blue arrow -
indicates that there is hidden text, which didn't fit inside the text box.

Solutions here are simple:

On the first line, we use the
tool to compress the text horizontally.

On the second, using the
tool, we widen the text box and left-align the text.

This is another manifestation of the same issue.

It serves to illustrate the difference between the and tools in Infix, the main lesson to be learned by anyone used to other DTP programs.

For illustration purposes only, this is the solution we'd get if we tried to solve the problem using the tool.

The font gets flattened, smaller than the original, compromising the publication standard.

Reducing the spacing between both characters and words...

... and increasing the spacing between lines...

... we achieve the desired effect.

This is a common problem: non-editable text is part of an illustration.

In this case we use the [Object | Image | Extract to File] feature in Infix to export this image, taking care to use proper settings regarding size and resolution.

To edit it we have to, not necessarily in this order, remove the text by painting the area with the background color, translate the text manually, format it, and put it in its proper place.

For many years, my favorite tool to do this has been Ulead PhotoImpact, bought and later discontinued by Corel. The most popular choice is Adobe Photoshop. Depending on how the illustration was created, sometimes it is possible to edit the text directly using Adobe Illustrator. Some people do graphic editing with the ancient Windows Paint.

Anyway, this part of the job is beyond our scope here. A translator who lacks the skill or the software to do it may leave this part of the job to the client, or outsource it to a computer graphics editor.

The desired result is this:

One last issue worth highlighting here is the existence of a glossary in that brochure.

It not only required text reflow adjustments, but it would be necessary to sort the translated entries alphabetically.

My option was to select the corresponding part of the translated TXT file and copy it to an Excel spreadsheet, where it took just a click to sort all entries alphabetically.

For those skilled in making neat tables with Excel, this could have been done right there.

Not my case. Knowing the table measurements, I saved the sorted table contents to a TXT file, and quickly rebuilt the table using PageMaker.

Next, I distilled that PageMaker file into a PDF.

First, I deleted the messed-up table from the final PDF. Then I opened this new PDF file (the one distilled from PageMaker) on another instance of Infix, and copied my new table on the translated file

This table could have been generated using any suitable software. If that program doesn't have an output to PDF option, upon installing Infix, it is possible to include it as a printer, which will thereon appear on your printers list. All you'll have to do to create a PDF will be to command your software to "print to the Infix printer".

This was an overview of the major techniques I used to translate a PDF file using Infix.

Of course, there are many other features and their corresponding techniques. Also, there are always several alternate ways of obtaining the same result. DTP work means relentlessly learning. Even after using PageMaker for 25+ years, I still find new ways to do things better and faster than before. It all depends on the challenge at hand.

If you are a translator, and DTP is not worthwhile for you, this is no problem at all. You may go on translating only. Find a partner who is clever with Infix, ask them to export the text for you, translate it in TXT or XML, watching the original on Acrobat Reader for your reference, and then send it back, so your partner will import your translation and fix the translated PDF layout.

Your partner doesn't have to be a translator, however the ideal is to have someone with basic knowledge of the languages they'll be working with. For instance, I only translate between English and Portuguese, however I could do DTP work in Italian, French, or Spanish - languages that I speak, but don't translate professionally.

And yet, it doesn't cost anything to try... the Infix demo is free, and 100% functional. It will only rubberstamp all PDF pages with a warning that it was done with a demo. When you register Infix, you'll be able to remove this stamp. They also have a pay-per-save option, without having to buy the license. So you may buy it only after it has worked for you.

Just one piece of advice: Read the entire manual (250 pages). No need to memorize it all, merely develop an awareness on the resources that are available for you to do whatever you want, because there are many ways to achieve the same result.

As a final suggestion, if you find all this too challenging (or boring), and your language pair matches those I am able to work with, use the e-mail button on the left to send me a message. Maybe I am the partner you need to translate PDF files.


 
Back to content | Back to main menu