Unicode characters dropped in PDF files generated with iText and Flying Saucer

Flying Saucer is a very useful Java library that uses iText to convert HTML pages to PDF documents. Here is a nice tutorial on how to use Flying Saucer.

The last few days I was trying unsuccessfully  to generate some report that contained non-standard Unicode characters (in my case it was Greek, but I guess the same problem exists for other character sets as well, like Cyrillic, Armenian, etc). The problem was that the Greek characters seemed to be omitted; they didn’t show up in the document.

The code I was using was more or less something like that:

public class Html2Pdf {

    public static void main(String[] args) throws DocumentException, IOException {
        File file = new File("output.pdf");
        Document document = new Document();
        PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(file));
        document.open();
        XMLWorkerHelper.getInstance().parseXHtml(writer, document, new FileInputStream("input.html"));
        document.close();
        Desktop.getDesktop().open(file);
    }
}

And the input HTML file was something like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
    <body>
        <h1>Αρνάκι άσπρο και παχύ</h1>
    </body>
</html>

When I tried to convert this simple HTML to PDF, I got a blank page.

After lots of hours of troubleshooting, I finally discovered that, for some reason, if no specific font is used, the generated PDF uses some kind of  default (probably Helvetica) font, that contains a very limited character set, that obviously does not contain the Greek code page.

So I came up with this simple trick, that seems to solve the problem. I only had to make sure that all elements in my HTML file will use a font that contains Greek, like Arial:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>

    <head>
        <style>
        * { font-family: Arial; }
        </style>
    </head>
    <body>
        <h1>Αρνάκι άσπρο και παχύ</h1>
    </body>
</html>

Arial is a pretty standard font, installed by default in most operating system, and implements a wide variety of alphabets (including Greek).

I hope this helped…

 

This entry was posted in Uncategorized. Bookmark the permalink.

2 Responses to Unicode characters dropped in PDF files generated with iText and Flying Saucer

  1. Ider Lkhagvasuren says:

    Good Point. Save my day

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>