Category: Software

Adding OCR layers to your Zotero library PDF items for Metadata extraction and indexing

Zotero is a cross-platform literature manager that is able to sync to a remote server and across multiple user devices. There are many alternatives available, each with strengths and weaknesses, but I am currently using Zotero to manage my literature because it is free and works with WebDAV for additional free storage.

In this article I will describe why optical character recognition (OCR) is important for Zotero and suggest a way to add OCR to existing items in a Zotero library. However, the method actually works for any collection of PDF files on your computer!

The reason for OCR in Zotero

Zotero has a nice “Retrieve Metadata for PDF” feature that automatically scans a PDF file for metadata and then uses it to search for matching bibliography information from Google Scholar. The PDF is then nested under a parent item that is (usually) properly indexed in the internal Zotero SQLite database.

In this case Zotero found matches for most of the items. The one with the red cross appears not to be a journal article or book, but some other random (non-public) document that at some time was imported into my library.

However, if the PDF does not contain an OCR layer this feature does not work. This is often the case for older journal articles, or PDFs that were scanned from a hard copy.

If you manage a large literature library then you might have many non-OCR files in there, which are not properly indexed. Manually creating a parent item for each of them is laborious. The only practical approach is to add an OCR layer to the PDF files.

Example errors when Zotero is unable to find an OCR layer in a PDF document during attempted metadata retrieval.

Adding OCR to PDF files

There are a number of commercial, free and open source options for adding OCR to PDF files. Most famous of these is the Adobe Acrobat reader, which at the time of writing requires a monthly subscription to an “Edit” feature extension to unlock the OCR capabilities. If you have this available to you, please go ahead and use it.

If you prefer a free option there are a few available, but I had most success with ocrmypdf, written by James R. Barlow and release under the GNU GPLv3 license.

The following steps should help you get started with ocrmypdf and use it to fix those annoying OCR problems in Zotero.

Installing ocrmypdf

Linux

I am using Ubuntu 18.0.4.1LTS. Before using apt-get to install ocrmypdf, it was necessary to allow additional software to be installed

Ubuntu software repository options.

Then you should be able to do:

sudo apt-get install ocrmypdf

For more installation information please visit the project page.

General usage

On the command line terminal you can simply provide ocrmypdf with an input PDF file and the desired output file.

ocrmypdf input.pdf output.pdf

If successful, this creates a new file called output.pdf, which is a modified version of the original. The new file should hopefully contain an OCR text layer!

Usage with Zotero

My aim here is to describe a method for parsing through a large Zotero library, checking for files without an OCR layer and then adding one on the fly. We will eventually write  abash script to control this, but first I will explain how individual steps in the script work.

One-liner for a single file

In this crude example I have created a new folder called /home/simon/Zotero/ocr/. The Zotero storage folder is in /home/simon/Zotero/storage, or simply ../storage/

The following allows you to find a file using a search string, for example here the filename ends with ” – kittel.pdf.pdf”. I am assuming there is only one file that matches this search string!

INPUT=`find ../storage/ -name "* - kittel.pdf.pdf"`; ocrmypdf "$INPUT" output.pdf

You can then check the output.pdf manually then replace the original.

If you are feeling really brave, and wish to do this on the fly, without checking the new PDF file first, you can automatically replace the input file directly in the Zotero storage folder.

INPUT=`find ../storage/ -name "* - kittel.pdf.pdf"`; ocrmypdf "$INPUT" output.pdf; mv output.pdf "$INPUT"

I did not have any issues with this approach as ocrmypdf doesn’t seem to be destructive, but care should be taken when automatically replacing or deleting files! In the final script below I take a few rudimentary precautions in that sense!

Check if a file has OCR

We are going to approach this method on the bash command line. To that end we can use pdffonts to check whether a PDF document contains OCR. pdffonts checks if the file contains embedded fonts. Any PDF without OCR text will contain zero embedded fonts, while a file with an OCR layer will have 1 or more embedded fonts.

For example, this file does have an OCR layer:

$ pdffonts output2.pdf 
name type encoding emb sub uni object ID 
------------------------------------ ----------------- ---------------- --- --- --- --------- 
JKMERI+GlyphLessFont CID TrueType Identity-H yes yes yes 9 0 
QMAGQI+GlyphLessFont CID TrueType Identity-H yes yes yes 20 0 
MEMQSK+GlyphLessFont CID TrueType Identity-H yes yes yes 38 0

So does this one:

pdffonts Unknown\ -\ Unknown\ -\ Chapter\ 28\ –\ Magnetic\ Fields\ Sources\ Goals\ for\ Chapter\ 28.pdf 
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
ABCDEE+Calibri                       TrueType          WinAnsi          yes yes no       5  0
ABCDEE+Calibri                       CID TrueType      Identity-H       yes yes yes      7  0
Times New Roman                      CID TrueType      Identity-H       yes no  yes     14  0
Times New Roman                      TrueType          WinAnsi          no  no  no      19  0
Times New Roman,Bold                 TrueType          WinAnsi          no  no  no      30  0
Symbol                               CID TrueType      Identity-H       yes no  yes     32  0
Times New Roman,Italic               TrueType          WinAnsi          no  no  no      55  0
ABCDEE+Trebuchet MS                  CID TrueType      Identity-H       yes yes yes     73  0
ABCDEE+Trebuchet MS                  TrueType          WinAnsi          yes yes no      78  0

When there is no OCR there will be no embedded fonts found:

$ pdffonts Book3.pdf 
name type encoding emb sub uni object ID 
------------------------------------ ----------------- ---------------- --- --- --- ---------

Basically, we could determine whether the file has OCR by counting the lines of the output, or we can grep the output and count occurences of the string “Type”. Here we found 7 fonts:

$ pdffonts input.pdf | grep "Type" | wc -l
7

A slight problem with this approach is that some downloaded papers have no OCR layer for the actual, useful text, but a layer is added for a watermark layer by the publisher’s website on download. In this case a font will be found:

Paper downloaded from AIP website has a useless OCR layer added just to watermark the download event. This is an attempt to limit piracy but doesn’t help the user at all.

$ pdffonts "$INPUT"
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
Helvetica-Bold                       Type 1            WinAnsi          no  no  no       5  0

Perhaps we can change our search rule to require more than one embedded font, but I have not yet ruled out the possibility that genuine, useful OCR might mean only one PDF font is present in the file.

A script

Pulling it all together, the following prototype bash script will find a list of files that match a specific string, then loop through them checking if they are missing OCR. If they are, it will call ocrmypdf on them. By removing any existing temp.pdf file before calling ocrpdf we eliminate some of the risk of replacing our original files with a corrupted output file. That being said, please use it with caution and at your own risk!

#!/bin/bash

find ../storage/ -type f -name "Unknown - Unknown - 1*.pdf.pdf" -print0 | 
while IFS= read -r -d '' file; do

    # Remove any existing temp.pdf. This should stop your db file from being overwritten with the previous one, in case of an error. 
    if [ -f ./temp.pdf ]; then
        rm ./temp.pdf
    fi

    printf '%s\n' "$file"
    num=`pdffonts "$file" | grep "Type" | wc -l`
    
    if [ $num -lt 1 ]
    then
        echo "File has no OCR so let's do it..."
        
        ocrmypdf "$file" ./temp.pdf
        
        # Use this with extreme caution: This should only execute if temp.pdf was successfully created. 
        if [ -f ./temp.pdf ]; then
            mv ./temp.pdf "$file"
        else
            echo "OCR was unsuccessful so not updating file! "
        fi
    
    fi
    
done

Once files have had OCR added, you can call the Metadata extraction tool in Zotero.

Summing up

In my experience, ocrmypdf did an excellent job of adding OCR to journal articles. Sometimes Zotero was unable to identify enough of the metadata from the newly added text to get a positive match from Google Scholar, but this is mostly an issue with Zotero, not the OCR, and a parent item with at least the authors and title were created, making indexing far better than it was before. When I checked, a lot of other items in my library also did not have the full metadata included in the parent item, so it is always better to import a full, proper bibliography entry from an external source than to rely on Zotero to extract meta data. However, if this is not possible, say you imported a load of PDF files given to you by a friend or colleague, the ability to add OCR is a massive help.

 




ASCII plotting on the command line terminal with eplot

If you want to plot something on the terminal in ASCII you can use “eplot”.

eplot itself is a Ruby script that acts as a frontend for gnuplot. eplot can be downloaded from the project’s GitHub page. It makes it easier to pipe numbers into gnuplot, which can otherwise be a bit of a hassle. It also has a dumb terminal mode which allows us to plot using ASCII. Plotting like this provides a way to quickly check data files without requiring any x windowing system, which might not be available when logging in remotely over the terminal.

If your computer already has gnuplot and the Ruby runtime installed then the following should work.

Example:

cat model.ht | tail -n +2 | awk '{print $1,$2}' | eplot-master/eplot -d -r [0:5]

The -d option chooses ASCII “dumb terminal” mode and -r allows us to set the x axis range.

Some example output:

 

Now, of course if you want to do the same for a remote file that is possible over ssh.

ssh user@remote.server cat /somepath/model.ht | eplot-master/eplot -d

There are obviously many more options to play around with but I hope this gives a brief introduction to some of the capabilities available.




Fetching, wrangling and visualising sunrise and sunset data using Python

Previously I showed how it was possible to obtain sunrise and sunset times for a whole year at any location on Earth, from a public source.

This time I am going to explain how to fetch that data, clean it up and create graphical visualizations like the one below, all using Python. A Jupyter Notebook is available on GitHub.  Such data might even be useful in, for example, simulation of solar power generation.

Sunlight hours for Derry vs Greenwich

Fetching the data

Looking at the URL string of the returned data query on the USNO website we can see that it contains all of the required values for longitude and latitude, as well as the place name label that we defined ourselves:

http://aa.usno.navy.mil/cgi-bin/aa_rstablew.pl?ID=AA&year=2018&task=0&place=Greenwich+%28UK%29&lon_sign=-1&lon_deg=0&lon_min=00&lat_sign=1&lat_deg=51&lat_min=28&tz=&tz_sign=-1

We can try replacing those values and expect to be able to retrieve any data set without going through the web interface. This makes it really easy to do what is known as web scraping, i.e. fetching the remote data table for any location automatically from our Python script. All we have to do is supply a label, and the coordinate values, and we can crudely modify the original URL to get the desired result.

url = "http://aa.usno.navy.mil/cgi-bin/aa_rstablew.pl?ID=AA&year=2018&task=0&place=%s&lon_sign=-1&lon_deg=%s&lon_min=%s&lat_sign=1&lat_deg=%s&lat_min=%s&tz=&tz_sign=-1" % (label, longitude[0], longitude[1], latitude[0], latitude[1])
urllib.urlretrieve (url, inFile)

When you fetch in this way a HTML file is returned, containing some html tags and some css code. But luckily the data we want is in plain text starting and ending at consistent line numbers each time, so we can read in the file, slice the lines that we want and then re-save the local file wth only the trimmed data.

# Dirty and dumb way to get rid of the html and css before the data table:
f = open(inFile, 'r')
# Gets a list of all the lines in the file, with each line as an un-split string:
# Crucially, the line endings remain intact so the local file will be identical to the original raw text! 
html=f.readlines() 
f.close()
raw = html[25:66] # Slice only the lines that we need. This is hardcoded but was the best method I found. 

# Create/OverWrite the local file:
with open(inFile, "w") as text_file:
    for line in raw:
        text_file.write(line)

Every time you run this code it fetches from the USNO website. If you plan to run this code many times for the same geographic location it is worth saving to a local file to give their web server a break!

Data wrangling

Now that we have a local data file, let’s read it in to Python. Not so fast! It took me a while to find the best method here. Normally, as a scientist, I work with tabular data containing floats, not a mix of data types. So I often use numpy.loadtxt() as my go to method for reading data files. However, if we try this we get an error:

A = np.loadtxt(inFile, skiprows=9) # Gives "ValueError: Wrong number of columns at line 38"

Numpy’s loadtxt() function does not like it when a file contains columns of different lengths. Let’s try a different way.

Let’s try a pandas-based approach first.

df = pd.read_csv(inFile, skiprows=range(9), header=None, infer_datetime_format=True)
print df, df.shape

This almost works, but we get a collection of rows where each row is a long string. It has not split the row strings into columns yet!

                                                    0
0   01  0805 1602  0738 1649  0645 1740  0536 1833...
1   02  0805 1603  0737 1651  0643 1742  0533 1835...
2   03  0805 1604  0735 1653  0641 1744  0531 1836...
3   04  0805 1605  0734 1655  0639 1745  0529 1838...
4   05  0805 1606  0732 1657  0637 1747  0527 1840...
5   06  0804 1608  0730 1659  0635 1749  0524 1841...
6   07  0804 1609  0729 1700  0632 1751  0522 1843...
7   08  0803 1610  0727 1702  0630 1752  0520 1845...
8   09  0803 1612  0725 1704  0628 1754  0518 1846...
9   10  0802 1613  0723 1706  0626 1756  0516 1848...
10  11  0802 1614  0721 1708  0623 1758  0513 1850...
11  12  0801 1616  0720 1710  0621 1759  0511 1851...
12  13  0800 1617  0718 1711  0619 1801  0509 1853...
13  14  0800 1619  0716 1713  0617 1803  0507 1855...
14  15  0759 1620  0714 1715  0614 1804  0505 1856...
15  16  0758 1622  0712 1717  0612 1806  0503 1858...
16  17  0757 1624  0710 1719  0610 1808  0500 1900...
17  18  0756 1625  0708 1720  0608 1810  0458 1901...
18  19  0755 1627  0706 1722  0605 1811  0456 1903...
19  20  0754 1628  0704 1724  0603 1813  0454 1905...
20  21  0753 1630  0702 1726  0601 1815  0452 1906...
21  22  0752 1632  0700 1728  0559 1816  0450 1908...
22  23  0751 1634  0658 1729  0556 1818  0448 1910...
23  24  0749 1635  0656 1731  0554 1820  0446 1911...
24  25  0748 1637  0654 1733  0552 1821  0444 1913...
25  26  0747 1639  0652 1735  0549 1823  0442 1915...
26  27  0746 1641  0650 1737  0547 1825  0440 1916...
27  28  0744 1642  0648 1738  0545 1826  0438 1918...
28  29  0743 1644             0543 1828  0436 1920...
29  30  0741 1646             0540 1830  0434 1921...
30  31  0740 1648             0538 1831           ...
(31, 1)

In order to manipulate the data we would like Pandas to delimit the data correctly into rows and columns, and in this case we have whitespace between them.

df = pd.read_csv(inFile, skiprows=range(9), header=None, delim_whitespace=True, infer_datetime_format=True)

This does not produce the desired results, because the columns get shifted to the left to fill in the empty gaps of shorter months, e.g. now February has 31 days!

    0    1     2    3     4    5     6    7     8    9    ...       15  \
0    1  805  1602  738  1649  645  1740  536  1833  432   ...    424.0   
1    2  805  1603  737  1651  643  1742  533  1835  430   ...    425.0   
2    3  805  1604  735  1653  641  1744  531  1836  429   ...    427.0   
3    4  805  1605  734  1655  639  1745  529  1838  427   ...    428.0   
4    5  805  1606  732  1657  637  1747  527  1840  425   ...    430.0   
5    6  804  1608  730  1659  635  1749  524  1841  423   ...    431.0   
6    7  804  1609  729  1700  632  1751  522  1843  421   ...    433.0   
7    8  803  1610  727  1702  630  1752  520  1845  420   ...    434.0   
8    9  803  1612  725  1704  628  1754  518  1846  418   ...    436.0   
9   10  802  1613  723  1706  626  1756  516  1848  416   ...    438.0   
10  11  802  1614  721  1708  623  1758  513  1850  415   ...    439.0   
11  12  801  1616  720  1710  621  1759  511  1851  413   ...    441.0   
12  13  800  1617  718  1711  619  1801  509  1853  412   ...    442.0   
13  14  800  1619  716  1713  617  1803  507  1855  410   ...    444.0   
14  15  759  1620  714  1715  614  1804  505  1856  409   ...    445.0   
15  16  758  1622  712  1717  612  1806  503  1858  407   ...    447.0   
16  17  757  1624  710  1719  610  1808  500  1900  406   ...    449.0   
17  18  756  1625  708  1720  608  1810  458  1901  404   ...    450.0   
18  19  755  1627  706  1722  605  1811  456  1903  403   ...    452.0   
19  20  754  1628  704  1724  603  1813  454  1905  402   ...    453.0   
20  21  753  1630  702  1726  601  1815  452  1906  400   ...    455.0   
21  22  752  1632  700  1728  559  1816  450  1908  359   ...    457.0   
22  23  751  1634  658  1729  556  1818  448  1910  358   ...    458.0   
23  24  749  1635  656  1731  554  1820  446  1911  357   ...    500.0   
24  25  748  1637  654  1733  552  1821  444  1913  356   ...    501.0   
25  26  747  1639  652  1735  549  1823  442  1915  354   ...    503.0   
26  27  746  1641  650  1737  547  1825  440  1916  353   ...    504.0   
27  28  744  1642  648  1738  545  1826  438  1918  352   ...    506.0   
28  29  743  1644  543  1828  436  1920  351  2004  346   ...    557.0   
29  30  741  1646  540  1830  434  1921  351  2005  347   ...    559.0   
30  31  740  1648  538  1831  350  2006  422  1950  511   ...      NaN   

        16     17      18     19      20     21      22     23      24  
0   1948.0  512.0  1847.0  600.0  1738.0  653.0  1633.0  743.0  1555.0  
1   1946.0  514.0  1844.0  602.0  1736.0  655.0  1631.0  744.0  1554.0  
2   1945.0  516.0  1842.0  604.0  1733.0  657.0  1630.0  746.0  1554.0  
3   1943.0  517.0  1840.0  605.0  1731.0  658.0  1628.0  747.0  1553.0  
4   1941.0  519.0  1838.0  607.0  1729.0  700.0  1626.0  748.0  1553.0  
5   1939.0  520.0  1835.0  609.0  1727.0  702.0  1625.0  750.0  1552.0  
6   1938.0  522.0  1833.0  610.0  1724.0  704.0  1623.0  751.0  1552.0  
7   1936.0  524.0  1831.0  612.0  1722.0  706.0  1621.0  752.0  1552.0  
8   1934.0  525.0  1828.0  614.0  1720.0  707.0  1620.0  753.0  1551.0  
9   1932.0  527.0  1826.0  615.0  1718.0  709.0  1618.0  754.0  1551.0  
10  1930.0  528.0  1824.0  617.0  1716.0  711.0  1617.0  755.0  1551.0  
11  1928.0  530.0  1822.0  619.0  1713.0  712.0  1615.0  756.0  1551.0  
12  1926.0  532.0  1819.0  620.0  1711.0  714.0  1614.0  757.0  1551.0  
13  1924.0  533.0  1817.0  622.0  1709.0  716.0  1612.0  758.0  1551.0  
14  1923.0  535.0  1815.0  624.0  1707.0  718.0  1611.0  759.0  1551.0  
15  1921.0  536.0  1812.0  625.0  1705.0  719.0  1610.0  800.0  1551.0  
16  1918.0  538.0  1810.0  627.0  1703.0  721.0  1608.0  800.0  1552.0  
17  1916.0  539.0  1808.0  629.0  1701.0  723.0  1607.0  801.0  1552.0  
18  1914.0  541.0  1806.0  631.0  1659.0  724.0  1606.0  802.0  1552.0  
19  1912.0  543.0  1803.0  632.0  1657.0  726.0  1605.0  802.0  1553.0  
20  1910.0  544.0  1801.0  634.0  1655.0  728.0  1603.0  803.0  1553.0  
21  1908.0  546.0  1759.0  636.0  1652.0  729.0  1602.0  803.0  1554.0  
22  1906.0  547.0  1756.0  637.0  1650.0  731.0  1601.0  804.0  1554.0  
23  1904.0  549.0  1754.0  639.0  1648.0  733.0  1600.0  804.0  1555.0  
24  1902.0  551.0  1752.0  641.0  1647.0  734.0  1559.0  805.0  1555.0  
25  1900.0  552.0  1749.0  643.0  1645.0  736.0  1559.0  805.0  1556.0  
26  1858.0  554.0  1747.0  644.0  1643.0  737.0  1558.0  805.0  1557.0  
27  1855.0  556.0  1745.0  646.0  1641.0  739.0  1557.0  805.0  1558.0  
28  1743.0  648.0  1639.0  740.0  1556.0  805.0  1559.0    NaN     NaN  
29  1740.0  650.0  1637.0  742.0  1555.0  806.0  1600.0    NaN     NaN  
30     NaN    NaN     NaN    NaN     NaN    NaN     NaN    NaN     NaN  

[31 rows x 25 columns]

Since we notice that the tabular data has regular, repeating columnar widths, we can use the Pandas fixed width file function read_fwf(). Here is an example for two columns.

eg = pd.read_fwf(inFile, colspecs=[(4,8), (9,13)], skiprows=9, header=None)

We can adapt this approach to get all columns. Notice that not each column width is identical but they are bunched in a way that we can use. We can straight away loop through the relevant columns for sunrise and sunset and store them in separate Pandas dataframes.

sunrise = pd.read_fwf(inFile, colspecs=[(i, i+4) for i in range(4,132,11)], skiprows=9, header=None).values.T.flatten()
sunset  = pd.read_fwf(inFile, colspecs=[(i, i+4) for i in range(9,137,11)], skiprows=9, header=None).values.T.flatten()
# Remove all NaNs from the flattened lists to leave just the days that exist: https://stackoverflow.com/questions/11620914/removing-nan-values-from-an-array
sunrise = sunrise[~np.isnan(sunrise)].astype('int').tolist()
sunset = sunset[~np.isnan(sunset)].astype('int').tolist()

I have also transposed and flattened the data, then removed any NaN values and converted to lists, so that we have a list of 365 sunrise times and a list of 365 sunset times. Notice that all of the methods presented here should account for leap years, which contain 366 days, automatically!

Now we want to convert our float values to datetimes. We can do this using Python’s datetime module but we will need to give it integer values for hours, minutes and seconds.

srise = [datetime.time(int(str(a)[:-2]), int(str(a)[-2:]), 0) for a in sunrise]
sset = [datetime.time(int(str(a)[:-2]), int(str(a)[-2:]), 0) for a in sunset]

Creating a date list

Our method for extracting the sunrise and sunset times didn’t preserve the dates on which those values occurred, but we can easily create one. Pandas can create a period list using the date_range function, and this can be converted to datetimes.

periodlist = pd.date_range(start='%s-01-01' % year, end='%s-12-31' % year, freq='D')
print periodlist

datelist = [pd.to_datetime(date, format='%m-%d').date() for date in periodlist]
print "Datelist:\n", datelist[:10]

This datelist will allow us to plot our sunrise and sunset times against the day of the year easily.

Visualization

Now we can simply invoke matplotlib, creating line plots for the sunrise and sunset data series’ and filling in between using the fill_between() method.

# Create a nice plot:
plt.close()
plt.figure()

# Plot the sunrise and sunset series as lineplots:
plt.plot(datelist, srise, label='sunrise')
plt.plot(datelist, sset, label='sunset')

# Let's put a fill between the two line plots:
plt.fill_between(datelist, srise, sset, facecolor='yellow', alpha=0.2, interpolate=True)

plt.title('Sunlight hours')
plt.xlabel('date')

# Make sure the y axis covers the (almost) full range of 24 hours:
plt.ylim([datetime.time.min, datetime.time.max]) 
# The above y limits do not force the y axis to run a full 24 hours, so force manual axes tick values.
# Give a range of values in seconds and matplotlib is clever enough to work it out:
plt.yticks(range(0, 60*60*25, 2*60*60))

plt.legend()
plt.savefig('working.png', dpi=90)
plt.show()

 

Data for Greenwich, UK, visualized from USNO data.

Final code

In my final code I added some additional features such as a location dictionary, that allows me to compare a daylight times for different locations on a single plot. My final code can be found in a Jupyter Notebook, which is available on GitHub. Here is a comparison between Greenwich, UK and Derry, UK.

Sunlight hours for Derry vs Greenwich

As we can see, in winter the two locations have identical sunset times but the sunrise in Derry is significantly later. In summer The sunrise times are similar, but Derry has a much later sunset. The days in Derry are really short in winter and really long in summer, as the locals will tell you! As one of the farthest North-Westerly cities in the UK, Derry is considerably further north and further west than Greenwich (London), but it is still in the same time zone. Note that I have not included Daylight saving times in my code. Times are all in GMT.

Map showing Derry and Greenwich, UK.

Map showing Derry and Greenwich, UK.




How to get Sunrise and Sunset times for data analysis

jinsngjung / Pixabay

Today is 2nd January and the days are getting longer again. I was thinking about sunrise and sunset times and wondering if I should do some data analysis and plotting of these to visualize how they change over the course of a year. Of course, the first step is to get a suitable data set.

You can get a data set of sunrise and sunset times from this page on the US Naval Observatory (USNO) website. Sunrise and sunset times depend on your specific location on the globe so you have to specify the location for your data set. Using the “Form B – Locations Worldwide” section on the page it is possible to enter precise coordinates for any location in terms of latitude and longitude. You can also do it for a timezone, but this is very broad and not precise enough for my liking.

Getting your coordinates

You can get your exact coordinates from the website of the National Geospatial-Intelligence Agency (NGA). You could also do the same on OpenStreetMap, by right-clicking on a location and choosing “show address” (or on Google Maps), but the NGA system gives more precise information about the type of coordinate system used. FYI, OpenStreetMap gives the coordinate that corresponds to “Latitude North” and “Longitude East”. But if your Longitude East value is negative, this just corresponds to the Longitude West value if you make it positive. However, the NGA form wants the coordinates in minutes and seconds, not a decimal fraction of minutes, so if you don’t want to convert units I recommend using the NGA site.

Downloading the sunset and sunrise data

Once you know the coordinates, you can enter them into the USNO page and click “Compute table” for any year that you specified in the “year” field. Here are the settings I used to get the data for Greenwich, London:

USNO sunrise webform

These are the settings I used in the USNO “Complete Sun and Moon Data for One Day” form to get data for Greenwich, UK.

You should now see a text based data table with columns for sunrise and sunset for each month for that year, with a row for each day in the month. Go ahead and save as a text file on your local drive and then you can use Pandas or whatever to wrangle the data into whatever format you want. I will make a further post at a later date with some suggestions. Bye for now and good luck!

USNO sunrise data file

This is the USNO sunrise data file for Greenwich, UK.

 




Configuring KNIME to work with Python 2.7.x on Windows

Apparently it is tricky to get Python integration working in the KNIME Analytics Platform. If you read the official guide too quickly you can miss some critical information at the bottom of the page. I was getting an error complaining that the google.protobuf library was missing even though I thought that I had everything installed correctly:

Library google.protobuf is missing, required minimum version is 2.5.0

Here is the correct sequence of actions to fix this problem and get Python integration working in KNIME. It worked for me, so I hope it works for you too!

  1. If you don’t already have it on your system, download and install Anaconda Python. I use the Python 2.7 version of Anaconda. I recommend installing Anaconda in the folder “C:\Anaconda2”.
  2. Create a new conda environment by using the following code at the windows command prompt (or power shell):
    conda create -y -n py27_knime python=2.7 pandas jedi protobuf
  3. Create a new Windows batch file with the following contents:
    @REM Adapt the directory in the PATH to your system
    @SET PATH=C:\Anaconda2\Scripts;%PATH%
    @CALL activate py27_knime || ECHO Activating py27_knime failed
    @python %*

    Of course, you might have to change the path to your Anaconda2 installation if you installed it to a different path. Save the batch file somewhere, like your user directory or the KNIME workspace folder. I named my file “py27.bat” and placed it in the knime-workspace folder.
    Just in case anyone reading this is confused…. a windows batch file is just a text file that is saved with the file extension “.bat”. You can create it in any text editor by creating a new empty text file, pasting the above four lines of text into it and saving the file as “py27.bat”.

  4. If you haven’t already, download and install KNIME Analytics Platform. At the time of writing, the latest version is 3.4.0, so that’s the one I used. You might as well get the installer with all the free extensions included, presuming you have enough disk space for it and your internet connection is decent enough. This will save having to install further packages later, although to be fair KNIME seems to do a fairly good job at installing packages on the fly (after asking you first) whenever you load a workspace that needs them.

    Downloading KNIME

  5. Start KNIME. Go to File>Preferences, then KNIME > Python scripting. In the “path to the local python executable” paste the full path to your batch file, e.g. “C:\Users\yourname\knime-workspace\py27.bat”. To be future-proofed, also do this for KNIME > Python (Labs).
    KNIME now calls our batch file instead of calling the system-wide Python executable directly. The batch file activates the “py27_knime” environment before calling Python. This makes sure that the additional libraries required by KNIME are available.
    I guess you could also get your usual Python environment working with KNIME by installing additional packages, but let’s just do what the knowledgeable guys at KNIME have suggested this time. 🙂

    Python scripting preferences for KNIME Analytics Platform in Windows

  6. Now restart KNIME. Try opening or creating a workflow with a node that uses some Python code. It should work now!