This section of our create site will help guide you in your analysis of the contents of your documents, for example creating concordance files and word frequency lists.

With our books we use a PDF to Text conversion program that is written in Java called pdfbox and we use this to convert PDF files to text files. We then use a simple shell script to convert the text file to a word frequency list.

If this text file of the word frequency list is then imported into OpenOffice/LibreOffice (OO/LO) or any other suitable spreadsheet program then you can do sums and calculate percentages on these lists.

What we also do like to use this list for is to look at the single-used words because these are the most likely to be where you would find spelling mistakes. With the odd looking word you can then search for it in the text file and identify the relevant page and then go to  that page in your Scribus (PDF) file.  Ideally you would and should correct spelling errors in the manuscript file before importing into Scribus because Scribus does not have a simple-to-use built in spell-checker (in the production versions), though you can use the aspell program from the command line to check the Scribus .SLA file.  If the error is found later and you have finished with your manuscript files then you obviously will have to just edit the text in Scribus by hand.

Truthfully it is much easier to correct all the spelling errors in the manuscript before importing the text.


