PDF files are constantly being created by businesses and non-profit organisations to show colleagues, customers and other interested parties what material has been written or drawn and what its layout will be like once it's printed. Basically, they are exact images of documents and can be viewed on computers running on various operating systems, not just Microsoft Windows.
PDFs can either be created from other electronic file formats such as Word .docx files or they can be generated by a scanner. Depending on what settings have been made in the software, the PDF files that are created may or may not be searchable. If they are, then individual words can be found in them thanks to a processing step called optical character recognition, or OCR for short. It's usually quite easy to create an editable Word file thanks to this kind of data processing; in Adobe Acrobat XI, for example, you just select these items in the 'File' menu to export the contents into a new Word document:
The 'tough nuts', in contrast, are the scanned images of paper documents we sometimes get sent, as it can take a lot of time and effort to create a reasonable editable text from these that can then be typed over and translated. To do this, you will need to use OCR software on the file in question to try and turn the image of the document into a set of legible and hopefully correctly rendered words. Sometimes this can work well, especially if you use high-quality programs such as Acrobat, ABBYY Finereader or Nuance OmniPage, which come with powerful character-recognition software. But things don't always go to plan, and the results of OCR'ing a scanned image can also be very disappointing, requiring copious editing – or even a completely different approach to creating a translatable file.
This is the situation you may also find yourself in if you ever get sent a PDF file that has been protected (i.e. 'secured') in some way – by a password, for example, meaning you can only open it or add comments to it if you enter the password first (providing you are authorised to do so). If you don't have the password, you won't be given the full right to use and process the file. This also means you won't be able to copy its contents and paste them into a blank Word file for translation. And what then?
Asking the customer for the password may be the obvious answer here, but if they don't have it themselves and are unable (or unwilling) to get it, what else should you do? Well, there are various suggestions about this on the internet, some of which I've tried out, but have you ever thought of using a simple work-around with a printer? That may be a faster and simpler way of getting round the password-protection issue.
If you are able to print the file out (this may not be allowed, depending on what properties the PDF has been given – see the screen shot below on how to access these in Adobe Acrobat XI), then do so using the best resolution and clearest print you can. Scan the printout and create a brand-new, multi-page PDF from it yourself. Most types of scanner software will let you do this, including the three I've just mentioned.
When the scanner creates the new PDF file, get it to make the file searchable when you check or adjust the settings beforehand; it will then OCR it (don't forget to tell it which language it should recognise first, though). Once you've got the file, check it to see if the quality of the text is okay, and if it is, export the contents into a new Word file. Now you should find you have a Word document that is straightforward to translate. A little editing may be necessary, but not much (utilities like CodeZapper and TransTools Suite will help you tidy the file up if need be).
Thanks to my German colleague Ludger Giebel for mentioning this idea.
- My earlier post on converting PDFs into a translation-friendly format using Wordfast Anywhere
- My earlier post on Acrobat XI and Acrobat Reader
- Kilgray, the maker of memoQ, on converting PDFs using various tools, including their own CAT tool
- Eric le Carre on translating PDFs using various free tools