Tag: zotero

Adding OCR layers to your Zotero library PDF items for Metadata extraction and indexing

Zotero is a cross-platform literature manager that is able to sync to a remote server and across multiple user devices. There are many alternatives available, each with strengths and weaknesses, but I am currently using Zotero to manage my literature because it is free and works with WebDAV for additional free storage.

In this article I will describe why optical character recognition (OCR) is important for Zotero and suggest a way to add OCR to existing items in a Zotero library. However, the method actually works for any collection of PDF files on your computer!

The reason for OCR in Zotero

Zotero has a nice “Retrieve Metadata for PDF” feature that automatically scans a PDF file for metadata and then uses it to search for matching bibliography information from Google Scholar. The PDF is then nested under a parent item that is (usually) properly indexed in the internal Zotero SQLite database.

In this case Zotero found matches for most of the items. The one with the red cross appears not to be a journal article or book, but some other random (non-public) document that at some time was imported into my library.

However, if the PDF does not contain an OCR layer this feature does not work. This is often the case for older journal articles, or PDFs that were scanned from a hard copy.

If you manage a large literature library then you might have many non-OCR files in there, which are not properly indexed. Manually creating a parent item for each of them is laborious. The only practical approach is to add an OCR layer to the PDF files.

Example errors when Zotero is unable to find an OCR layer in a PDF document during attempted metadata retrieval.

Adding OCR to PDF files

There are a number of commercial, free and open source options for adding OCR to PDF files. Most famous of these is the Adobe Acrobat reader, which at the time of writing requires a monthly subscription to an “Edit” feature extension to unlock the OCR capabilities. If you have this available to you, please go ahead and use it.

If you prefer a free option there are a few available, but I had most success with ocrmypdf, written by James R. Barlow and release under the GNU GPLv3 license.

The following steps should help you get started with ocrmypdf and use it to fix those annoying OCR problems in Zotero.

Installing ocrmypdf

Linux

I am using Ubuntu 18.0.4.1LTS. Before using apt-get to install ocrmypdf, it was necessary to allow additional software to be installed

Ubuntu software repository options.

Then you should be able to do:

sudo apt-get install ocrmypdf

For more installation information please visit the project page.

General usage

On the command line terminal you can simply provide ocrmypdf with an input PDF file and the desired output file.

ocrmypdf input.pdf output.pdf

If successful, this creates a new file called output.pdf, which is a modified version of the original. The new file should hopefully contain an OCR text layer!

Usage with Zotero

My aim here is to describe a method for parsing through a large Zotero library, checking for files without an OCR layer and then adding one on the fly. We will eventually write  abash script to control this, but first I will explain how individual steps in the script work.

One-liner for a single file

In this crude example I have created a new folder called /home/simon/Zotero/ocr/. The Zotero storage folder is in /home/simon/Zotero/storage, or simply ../storage/

The following allows you to find a file using a search string, for example here the filename ends with ” – kittel.pdf.pdf”. I am assuming there is only one file that matches this search string!

INPUT=`find ../storage/ -name "* - kittel.pdf.pdf"`; ocrmypdf "$INPUT" output.pdf

You can then check the output.pdf manually then replace the original.

If you are feeling really brave, and wish to do this on the fly, without checking the new PDF file first, you can automatically replace the input file directly in the Zotero storage folder.

INPUT=`find ../storage/ -name "* - kittel.pdf.pdf"`; ocrmypdf "$INPUT" output.pdf; mv output.pdf "$INPUT"

I did not have any issues with this approach as ocrmypdf doesn’t seem to be destructive, but care should be taken when automatically replacing or deleting files! In the final script below I take a few rudimentary precautions in that sense!

Check if a file has OCR

We are going to approach this method on the bash command line. To that end we can use pdffonts to check whether a PDF document contains OCR. pdffonts checks if the file contains embedded fonts. Any PDF without OCR text will contain zero embedded fonts, while a file with an OCR layer will have 1 or more embedded fonts.

For example, this file does have an OCR layer:

$ pdffonts output2.pdf 
name type encoding emb sub uni object ID 
------------------------------------ ----------------- ---------------- --- --- --- --------- 
JKMERI+GlyphLessFont CID TrueType Identity-H yes yes yes 9 0 
QMAGQI+GlyphLessFont CID TrueType Identity-H yes yes yes 20 0 
MEMQSK+GlyphLessFont CID TrueType Identity-H yes yes yes 38 0

So does this one:

pdffonts Unknown\ -\ Unknown\ -\ Chapter\ 28\ –\ Magnetic\ Fields\ Sources\ Goals\ for\ Chapter\ 28.pdf 
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
ABCDEE+Calibri                       TrueType          WinAnsi          yes yes no       5  0
ABCDEE+Calibri                       CID TrueType      Identity-H       yes yes yes      7  0
Times New Roman                      CID TrueType      Identity-H       yes no  yes     14  0
Times New Roman                      TrueType          WinAnsi          no  no  no      19  0
Times New Roman,Bold                 TrueType          WinAnsi          no  no  no      30  0
Symbol                               CID TrueType      Identity-H       yes no  yes     32  0
Times New Roman,Italic               TrueType          WinAnsi          no  no  no      55  0
ABCDEE+Trebuchet MS                  CID TrueType      Identity-H       yes yes yes     73  0
ABCDEE+Trebuchet MS                  TrueType          WinAnsi          yes yes no      78  0

When there is no OCR there will be no embedded fonts found:

$ pdffonts Book3.pdf 
name type encoding emb sub uni object ID 
------------------------------------ ----------------- ---------------- --- --- --- ---------

Basically, we could determine whether the file has OCR by counting the lines of the output, or we can grep the output and count occurences of the string “Type”. Here we found 7 fonts:

$ pdffonts input.pdf | grep "Type" | wc -l
7

A slight problem with this approach is that some downloaded papers have no OCR layer for the actual, useful text, but a layer is added for a watermark layer by the publisher’s website on download. In this case a font will be found:

Paper downloaded from AIP website has a useless OCR layer added just to watermark the download event. This is an attempt to limit piracy but doesn’t help the user at all.

$ pdffonts "$INPUT"
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
Helvetica-Bold                       Type 1            WinAnsi          no  no  no       5  0

Perhaps we can change our search rule to require more than one embedded font, but I have not yet ruled out the possibility that genuine, useful OCR might mean only one PDF font is present in the file.

A script

Pulling it all together, the following prototype bash script will find a list of files that match a specific string, then loop through them checking if they are missing OCR. If they are, it will call ocrmypdf on them. By removing any existing temp.pdf file before calling ocrpdf we eliminate some of the risk of replacing our original files with a corrupted output file. That being said, please use it with caution and at your own risk!

#!/bin/bash

find ../storage/ -type f -name "Unknown - Unknown - 1*.pdf.pdf" -print0 | 
while IFS= read -r -d '' file; do

    # Remove any existing temp.pdf. This should stop your db file from being overwritten with the previous one, in case of an error. 
    if [ -f ./temp.pdf ]; then
        rm ./temp.pdf
    fi

    printf '%s\n' "$file"
    num=`pdffonts "$file" | grep "Type" | wc -l`
    
    if [ $num -lt 1 ]
    then
        echo "File has no OCR so let's do it..."
        
        ocrmypdf "$file" ./temp.pdf
        
        # Use this with extreme caution: This should only execute if temp.pdf was successfully created. 
        if [ -f ./temp.pdf ]; then
            mv ./temp.pdf "$file"
        else
            echo "OCR was unsuccessful so not updating file! "
        fi
    
    fi
    
done

Once files have had OCR added, you can call the Metadata extraction tool in Zotero.

Summing up

In my experience, ocrmypdf did an excellent job of adding OCR to journal articles. Sometimes Zotero was unable to identify enough of the metadata from the newly added text to get a positive match from Google Scholar, but this is mostly an issue with Zotero, not the OCR, and a parent item with at least the authors and title were created, making indexing far better than it was before. When I checked, a lot of other items in my library also did not have the full metadata included in the parent item, so it is always better to import a full, proper bibliography entry from an external source than to rely on Zotero to extract meta data. However, if this is not possible, say you imported a load of PDF files given to you by a friend or colleague, the ability to add OCR is a massive help.

 




ZotFile for syncing PDF articles from Zotero to my eReader

I use Zotero to manage my literature collection, including all the associated PDF attachments. It really made my life easier when I set up the WebDAV file sync on Box. However, until now the only way to sync files to my Onyx Boox M96 eReader (image) was by connecting a USB cable and copying them manually to the device. Since Zotero stores the files in cryptically-named individual folders it is hard to do this manually in an organised manner and involves lots of clicking. Today I am going to find a better way.

This is what we want! Our technical documents on a large-screen  eInk device that allows annotation. Great for proofreading!

This is what we want! Our technical documents on a large-screen eInk device that allows annotation. Great for proofreading!

I searched for an Android version of Zotero, hoping that I could just sync to the extra device. Aside from the fact that the article files would take up too much of my internal flash storage space there is actually no official Zotero for Android; only some discontinued third-party project called Zotero Reader and a paid-option called Zandy, which had mixed reviews on the Play Store.

But then there is ZotFile, which can be installed on your PC as a Firefox add-on or as an extension to the stand-alone Zotero, and gives additional functionality. Zotfile is able to copy the article attachment (PDF) to a location on your PC or Mac, for example a Dropbox folder, that is set up to sync with an external device. It can also extract any annotationsto the PDF that you create on the device. My eReader runs Android so I have a number of options for the software that does the syncing, including DropBox, Google Drive or even BitTorrent Sync. I am jumping between Windows and Ubuntu on my main PC and I had issues doing this with BTSync in a single shared folder before, but as long as the collection of files does not grow too big I can keep separate sync folders in Win7 and Ubuntu to avoid any issues. Using BTSync means that you have a collection of devices syncing to each other but your files are not stored in the cloud on someone else’s servers. All transfers are encrypted so in principle it is secure.

OK, here are the steps I took.

1. Install ZotFile as a Firefox Add-on (to work with the Zotero Firefox Add-on).

2. Create a new folder somewhere on the PC that will be the sync folder. In Ubuntu I chose “/home/username/Documents/ZotFileSync” and in Windows I chose “C:/Users/Documents/ZotFileSync” (or whatever). The important thing is that it is not the same folder on a shared drive, something that caused problems for me in the past.

3. In Firefox, find the ZotFile preferences. In the second tab, check the box to “Use Zotfile to send and get files form tablet”. Give it the sync “Base” folder path. I also chose to create subfolders and save a copy of annotated files with the suffix “_annotated”.

Setting up ZotFile

The ZotFile preferences can be found inside Zotero by going to Tools > Add-ons > ZotFile > Options.

4. Use BTSync or dropbox or whatever you like to sync this folder to the cloud.

5. Set up BTSync or dropbox or whatever on your tablet/eReader to sync the files to/from the cloud. You might need to install a file manager app in order to create a new sync folder on your device. I used ES File Explorer, which incidentally seems to work quite well on an eInk screen.

BTSync on eInk!

BitTorrent sync in the Play store on my Onyx Boox M96 eReader.

Now, in Zotero (on your PC) you can right-click an item, Manage Attachments and Send to Tablet (see image below). It should shortly appear on your device, as long as it is connected to the internet and syncing with the cloud. Just like magic!

Sending an article to your sync folder.

Sending an article to your handheld device from Zotero is easy with ZotFile, even though it is simply copying it to a folder that is synced by separate software.

UPDATE:

There are a few (device-specific) issues with this solution.

1) BitTorrent Sync is not running when I boot the eReader. I have to manually start it. It also forgets to sync my shared folder automatically so I have to remind it every time I run the app. I hope that the option to run apps on startup will be included on a later firmware update.

2) The Onyx Boox M96 (Booxtor edition) only scans the internal storage for new books when you boot the device, with no option to manually scan from inside the OS. That means that once the synced files appear in the btsync shared folder the device library doesn’t show them until the next reboot. Again, I hope a manual “scan for new books” will be added to the library app on a later OS update. You could alternatively use a 3rd party reading app that manages its own database of books on the device.

These two issues combined make the process of transferring documents to the eReader a little less automatic than I would like. However, I have eliminated the need to attach the eReader to my PC via USB cables. Let’s see if the arbiters of the device will tweak the software for us in the future.

At the moment I don’t seem to be able to export my annotated PDF files, so I have not tested the annotation extraction feature of ZotFile yet.




Syncing Zotero files with WebDAV from Box

It’s hard to stay organized when you work on multiple computers, with multiple operating systems. My main notebook is a dual boot Ubuntu/Win7 machine where I have a shared partition for work files. I sync my work folder with my Ubuntu tower PC via BitTorrent Sync. This has now been working well for some time (the syncing happens under Ubuntu only, which is a drawback, but if BTSync under Windows also tries to sync the same shared folder it causes problems, thus I avoided doing so) although if you start with two identical copies of the folder on the two PCs it still wants to sync all of the files one way over the network. Not good when the folder is 100 GB large! (Update: this may have been solved in the latest version of the software).

Anyway, for the last half year I have been back using Zotero for my bibliography manager. It makes importing citations and the associated fulltext PDFs from a journal website relatively easy although there are some difficulties with importing PDFs from a local drive, particularly when an old file has no OCR layer. Then I sometimes have to input the metadata manually.

So my Zotero folder (data and files) went over the 300 MB free size limit for the online Zotero storage. That means that my library on Windows and my library on Ubuntu (I cannot share it locally, it seems, due to formatting and access rights issues – need two copies for now!) are no longer synced. I managed to get access to some papers under Windows and now I want to use them in Ubuntu and they are not there. It could easily turn into a big mess of files and folders and I want to keep it somehow automagically synced using Zotero.

So finally I turned to the WebDAV option in Zotero. This lets you keep the data (i.e. library metadata and collections) synced on the Zotero server and the files (i.e. PDF and other attachments) on a separate server. So, how can you get a WebDAV server?

I got a free personal storage account with Box (10 GB) and in the Zotero preferences I am able to setup a server, that is, https://dav.box.com/dav/zotero/

Don’t forget to create a folder named “zotero” in your Box account first! Also, you can now purge your files from storage using the Zotero web interface under storage preferences. This frees up all of that space that was previously clogged up with attachments.

There you go, you now have 10GB of free storage for your Zotero library!