It is quite complex to go directly from a html website to PDF so I am proposing to do this in stages:
- website to XHTML
- use XSLT script to convert XHTML to XSL-FO
- use Apache FOP to convert XSL-FO to PDF
I also considered loading the website into Word or OpenOffice and then converting to XSL-FO although this would require more manual intervention. The native Open Office format is an XML file which could be converted to XSL-FO using an XSLT script.
Website HTML to XHTML
This feature requires the Kontent to read each html file being managed (every html file under the current directory), to convert content as described here:
http://www22.brinkster.com/beeandnee/techzone/articles/htmltoxhtml.asp
Then to write the file to disc as a xhtml.
The program already has code which reads html (using SAX) to get the title for the index generator. So this would require a new tab similar to the others but works as discribed here.
If it is too hard to handle CSS that could be left for a future stage.
If <p> is found then the program needs to check for a closing </p> and if not found it needs to be inserted before the next format.
The new files should be put in a new directory tree with the same structure as the original files. All html (and htm) files are converted to xhtml, other files such as .gif and .jpeg are copied into the new directory tree unchanged.
node.convertToXhtml
nodeDir.convertToXhtml
nodeHTML.convertToXhtml
convert XHTML to XSL-FO
Here is a XSLT script to do the convertion http://www.antennahouse.com/XSLsample/XSLsample.htm
This http://www-106.ibm.com/developerworks/library/x-xslfo2app/ explans the issues.
XSL-FO to PDF
see http://xml.apache.org/fop/