Was working on my site - Curriculum Vitae - and needed to generate PDF's and extract data from PDF files for the users to download their CV in PDF Format.

Had the following Requirements for the Project:

  • Generate a PDF Document from HTML. Had to be able to generate PDF Documents that look just like the HTML Templates I created.
  • Generate a Image from HTML. Used for Previews of your CV.
  • Extract a Image from a PDF. Need one big image of the entire document, going to use this to add attachments documents to all generated CV's.

So how are we going to do this?

Was using a good old Python Script I wrote a couple of years ago but this just seems clanky:

from PySide.QtCore import *  
from PySide.QtGui import *  
from PySide.QtWebKit import *

app = QApplication(sys.argv)

web = QWebView()

if "http://" in sys.argv[1]:  
    web.load(QUrl(sys.argv[1]))
else:  
    f = open(sys.argv[1], 'rb')
    content = "".join(f.readlines())
    web.setHtml(content)

printer = QPrinter()  
printer.setPageSize(QPrinter.A4)  
printer.setOutputFormat(QPrinter.PdfFormat)  
printer.setFullPage(True)  
printer.setOutputFileName(sys.argv[2])

import os

def convertIt():

    print sys.argv[2]

    QApplication.exit()

QObject.connect(web, SIGNAL("loadFinished(bool)"), convertIt)

sys.exit(app.exec_())  

This worked for the problem requirements and PySide is a brilliant and fast binding for Python but wanted a solution where I did not want to write a temporary file and did not want to keep this old script floating above the water. I't needs to retire some time.

Then I found WKHtmlToPDF. Was very excited , finally someone created a binary that would do what I've been doing with the PySide binding.

Checked the documentation and the binary even allowed input and output with stdin and stdout! Which I could have done with Python too, but that's more effort on my time. This allowed me to create a PDF that I could easily write out to the client. Bingo!

Required Actions before we can start

  • Download WKHtmlToPDF or WKHtmlToImage and remember to get the static qt patch version.
  • Create a alias for wkhtmltopdf or / and wkhtmltopdf which points to the binaries.
  • Install GhostScript to generate Images from PDF. Ensure the Convert command is available.

How to do this in PHP

The site is mostly written in PHP so did a PHP wrapper first.

Generate a PDF from HTML

/**  
* Returns the Binary Content of the Generated PDF from the HTML
* @author Johann du Toit
*/
function pdf_from_html($html) {  
    $descriptorspec = array(
        0 => array('pipe', 'r'), // stdin
        1 => array('pipe', 'w'), // stdout
        2 => array('pipe', 'w'), // stderr
    );

    // Send the HTML on stdin
    fwrite($pipes[0], $html);
    fclose($pipes[0]);

    // Read the outputs
    $contents = stream_get_contents($pipes[1]);
    $errors = stream_get_contents($pipes[2]);

    fclose($pipes[1]);
    $return_value = proc_close($process);

    return $contents;
}

Generate a Image FROM HTML

/**  
* Returns the Binary Content of the Image from the HTML
* @author Johann du Toit
*/
function image_from_html($html) {  
    $descriptorspec = array(
        0 => array('pipe', 'r'), // stdin
        1 => array('pipe', 'w'), // stdout
        2 => array('pipe', 'w'), // stderr
    );
    $process = proc_open('wkhtmltoimage -q - -', $descriptorspec, $pipes);

    // Send the HTML on stdin
    fwrite($pipes[0], $html);
    fclose($pipes[0]);

    // Read the outputs
    $contents = stream_get_contents($pipes[1]);
    $errors = stream_get_contents($pipes[2]);

    fclose($pipes[1]);
    $return_value = proc_close($process);

    return $contents;
}

Generate a Image of a Document from PDF

/**  
* Returns the Binary Content of a Image Generated from a PDF
* @author Johann du Toit
*/
function image_from_pdf($pdf_path) {  
    $descriptorspec = array(
        0 => array('pipe', 'r'), // stdin
        1 => array('pipe', 'w'), // stdout
        2 => array('pipe', 'w'), // stderr
    );
    $process = proc_open('convert -density 350% -quality 85 -append pdf:- png:-', $descriptorspec, $pipes);

    // Send the HTML on stdin
    fwrite($pipes[0], file_get_contents($pdf_path));
    fclose($pipes[0]);

    // Read the outputs
    $contents = stream_get_contents($pipes[1]);
    $errors = stream_get_contents($pipes[2]);

    fclose($pipes[1]);
    $return_value = proc_close($process);

    return $contents;
}

How to do this in NodeJS

Generate a PDF from HTML

  
    var util  = require('util'),
    spawn = require('child_process').spawn;

/**
* Returns the Binary Content of the PDF Generated from the HTML
* @author Johann du Toit
*/
function html_to_pdf(html, fn, err) {


    var dt = false;

    child_process.on('exit', function (code) {
        fn(dt);
    });

    child_process.stdout.on('data', function (data) {
        dt = data;
    });

    child_process.stderr.on('data', function (data) {
        dt = data;
    });

    child_process.stdin.write(html);
    child_process.stdin.end();
}

Generate a Image FROM HTML

var util  = require('util'),  
spawn = require('child_process').spawn;

/**
* Returns a Image created from the HTML given to the method.
* @author Johann du Toit
*/
function html_to_image(html, fn, err) {

    var child_process = spawn('wkhtmltoimage', ['-', '-']);

    var dt = false;

    child_process.on('exit', function (code) {
        if(code  0) fn(dt);
        else err(dt);
    });

    child_process.stdout.on('data', function (data) {
        dt = data;
    });

    child_process.stderr.on('data', function (data) {
        dt = data;
    });

    child_process.stdin.write(html);
    child_process.stdin.end();
}

Generate a Image of a Document from PDF

var util  = require('util'),  
spawn = require('child_process').spawn;

/**
* Returns the Binary Content of a Image Generated from a PDF
* @author Johann du Toit
*/
function pdf_to_image(pdf_content, fn, err) {

    var child_process = spawn('convert', ['-density', '350%', '-quality', '85', '-append', 'pdf:-', 'png:-']);

    var dt = false;

    child_process.on('exit', function (code) {
        if(code  0) fn(dt);
        else err(dt);
    });

    child_process.stdout.on('data', function (data) {
        dt = data;
    });

    child_process.stderr.on('data', function (data) {
        dt = data;
    });

    child_process.stdin.write(pdf_content);
    child_process.stdin.end();
}

And that's it

There you have, generate PDF Document from Various Input in PHP and NodeJS. Not anything advance but always good to have in your toolbox.

Have a better way ? Let me know !