Tag: Software

Fixing a USB Flash drive that has been “corrupted” by balenaEtcher.

Recently I used balenaEtcher to create a bootable Linux drive. It is a nice Windows program that simplifies the process of creating Linux disks and is recommended by some Linux distro vendors.

After the program successfully writes a bootable Linux USB drive, the drive no longer appears correctly in Windows. That’s because a bootable USB contains certain drive partitions with filesystem types that are not visible to the Windows operating system.

This is not really a problem until you want to remove Linux from the USB drive and start using it, once again, as a normal drive for transferring files etc. Windows is unable to see the partitions correctly and is therefore unable to remove them, or format the drive correctly.

The only way I found to “repair” my pen drive was by using a computer that was already running Linux, in my case Ubuntu.

I used gparted, which comes already installed in Ubuntu, to remove all partitions on the USB drive, which for me was mounted at /sdb1/

You should take care that you are modifying the correct drive location!

Once I deleted any existing partitions I created a new partition from the unallocated space, with an NTFS filesystem that is visible in Windows as well as Linux. For good measure I thoroughly formatted the drive a second time using the option in the File Explorer menu system.

Now when I check my drive in Windows it appears as it should; one large partitions with (almost) the whole capacity of the USB available.

A lot of people have written on the balenaEtcher forums about this problem, blaming the software for destroying their USB drives. This is not the case as the drives can be fixed, but it would be nice if balena addressed this issue by providing some kind of tool for restoring a drive to its original state.




Adding OCR layers to your Zotero library PDF items for Metadata extraction and indexing

Zotero is a cross-platform literature manager that is able to sync to a remote server and across multiple user devices. There are many alternatives available, each with strengths and weaknesses, but I am currently using Zotero to manage my literature because it is free and works with WebDAV for additional free storage.

In this article I will describe why optical character recognition (OCR) is important for Zotero and suggest a way to add OCR to existing items in a Zotero library. However, the method actually works for any collection of PDF files on your computer!

The reason for OCR in Zotero

Zotero has a nice “Retrieve Metadata for PDF” feature that automatically scans a PDF file for metadata and then uses it to search for matching bibliography information from Google Scholar. The PDF is then nested under a parent item that is (usually) properly indexed in the internal Zotero SQLite database.

In this case Zotero found matches for most of the items. The one with the red cross appears not to be a journal article or book, but some other random (non-public) document that at some time was imported into my library.

However, if the PDF does not contain an OCR layer this feature does not work. This is often the case for older journal articles, or PDFs that were scanned from a hard copy.

If you manage a large literature library then you might have many non-OCR files in there, which are not properly indexed. Manually creating a parent item for each of them is laborious. The only practical approach is to add an OCR layer to the PDF files.

Example errors when Zotero is unable to find an OCR layer in a PDF document during attempted metadata retrieval.

Adding OCR to PDF files

There are a number of commercial, free and open source options for adding OCR to PDF files. Most famous of these is the Adobe Acrobat reader, which at the time of writing requires a monthly subscription to an “Edit” feature extension to unlock the OCR capabilities. If you have this available to you, please go ahead and use it.

If you prefer a free option there are a few available, but I had most success with ocrmypdf, written by James R. Barlow and release under the GNU GPLv3 license.

The following steps should help you get started with ocrmypdf and use it to fix those annoying OCR problems in Zotero.

Installing ocrmypdf

Linux

I am using Ubuntu 18.0.4.1LTS. Before using apt-get to install ocrmypdf, it was necessary to allow additional software to be installed

Ubuntu software repository options.

Then you should be able to do:

sudo apt-get install ocrmypdf

For more installation information please visit the project page.

General usage

On the command line terminal you can simply provide ocrmypdf with an input PDF file and the desired output file.

ocrmypdf input.pdf output.pdf

If successful, this creates a new file called output.pdf, which is a modified version of the original. The new file should hopefully contain an OCR text layer!

Usage with Zotero

My aim here is to describe a method for parsing through a large Zotero library, checking for files without an OCR layer and then adding one on the fly. We will eventually write  abash script to control this, but first I will explain how individual steps in the script work.

One-liner for a single file

In this crude example I have created a new folder called /home/simon/Zotero/ocr/. The Zotero storage folder is in /home/simon/Zotero/storage, or simply ../storage/

The following allows you to find a file using a search string, for example here the filename ends with ” – kittel.pdf.pdf”. I am assuming there is only one file that matches this search string!

INPUT=`find ../storage/ -name "* - kittel.pdf.pdf"`; ocrmypdf "$INPUT" output.pdf

You can then check the output.pdf manually then replace the original.

If you are feeling really brave, and wish to do this on the fly, without checking the new PDF file first, you can automatically replace the input file directly in the Zotero storage folder.

INPUT=`find ../storage/ -name "* - kittel.pdf.pdf"`; ocrmypdf "$INPUT" output.pdf; mv output.pdf "$INPUT"

I did not have any issues with this approach as ocrmypdf doesn’t seem to be destructive, but care should be taken when automatically replacing or deleting files! In the final script below I take a few rudimentary precautions in that sense!

Check if a file has OCR

We are going to approach this method on the bash command line. To that end we can use pdffonts to check whether a PDF document contains OCR. pdffonts checks if the file contains embedded fonts. Any PDF without OCR text will contain zero embedded fonts, while a file with an OCR layer will have 1 or more embedded fonts.

For example, this file does have an OCR layer:

$ pdffonts output2.pdf 
name type encoding emb sub uni object ID 
------------------------------------ ----------------- ---------------- --- --- --- --------- 
JKMERI+GlyphLessFont CID TrueType Identity-H yes yes yes 9 0 
QMAGQI+GlyphLessFont CID TrueType Identity-H yes yes yes 20 0 
MEMQSK+GlyphLessFont CID TrueType Identity-H yes yes yes 38 0

So does this one:

pdffonts Unknown\ -\ Unknown\ -\ Chapter\ 28\ –\ Magnetic\ Fields\ Sources\ Goals\ for\ Chapter\ 28.pdf 
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
ABCDEE+Calibri                       TrueType          WinAnsi          yes yes no       5  0
ABCDEE+Calibri                       CID TrueType      Identity-H       yes yes yes      7  0
Times New Roman                      CID TrueType      Identity-H       yes no  yes     14  0
Times New Roman                      TrueType          WinAnsi          no  no  no      19  0
Times New Roman,Bold                 TrueType          WinAnsi          no  no  no      30  0
Symbol                               CID TrueType      Identity-H       yes no  yes     32  0
Times New Roman,Italic               TrueType          WinAnsi          no  no  no      55  0
ABCDEE+Trebuchet MS                  CID TrueType      Identity-H       yes yes yes     73  0
ABCDEE+Trebuchet MS                  TrueType          WinAnsi          yes yes no      78  0

When there is no OCR there will be no embedded fonts found:

$ pdffonts Book3.pdf 
name type encoding emb sub uni object ID 
------------------------------------ ----------------- ---------------- --- --- --- ---------

Basically, we could determine whether the file has OCR by counting the lines of the output, or we can grep the output and count occurences of the string “Type”. Here we found 7 fonts:

$ pdffonts input.pdf | grep "Type" | wc -l
7

A slight problem with this approach is that some downloaded papers have no OCR layer for the actual, useful text, but a layer is added for a watermark layer by the publisher’s website on download. In this case a font will be found:

Paper downloaded from AIP website has a useless OCR layer added just to watermark the download event. This is an attempt to limit piracy but doesn’t help the user at all.

$ pdffonts "$INPUT"
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
Helvetica-Bold                       Type 1            WinAnsi          no  no  no       5  0

Perhaps we can change our search rule to require more than one embedded font, but I have not yet ruled out the possibility that genuine, useful OCR might mean only one PDF font is present in the file.

A script

Pulling it all together, the following prototype bash script will find a list of files that match a specific string, then loop through them checking if they are missing OCR. If they are, it will call ocrmypdf on them. By removing any existing temp.pdf file before calling ocrpdf we eliminate some of the risk of replacing our original files with a corrupted output file. That being said, please use it with caution and at your own risk!

#!/bin/bash

find ../storage/ -type f -name "Unknown - Unknown - 1*.pdf.pdf" -print0 | 
while IFS= read -r -d '' file; do

    # Remove any existing temp.pdf. This should stop your db file from being overwritten with the previous one, in case of an error. 
    if [ -f ./temp.pdf ]; then
        rm ./temp.pdf
    fi

    printf '%s\n' "$file"
    num=`pdffonts "$file" | grep "Type" | wc -l`
    
    if [ $num -lt 1 ]
    then
        echo "File has no OCR so let's do it..."
        
        ocrmypdf "$file" ./temp.pdf
        
        # Use this with extreme caution: This should only execute if temp.pdf was successfully created. 
        if [ -f ./temp.pdf ]; then
            mv ./temp.pdf "$file"
        else
            echo "OCR was unsuccessful so not updating file! "
        fi
    
    fi
    
done

Once files have had OCR added, you can call the Metadata extraction tool in Zotero.

Summing up

In my experience, ocrmypdf did an excellent job of adding OCR to journal articles. Sometimes Zotero was unable to identify enough of the metadata from the newly added text to get a positive match from Google Scholar, but this is mostly an issue with Zotero, not the OCR, and a parent item with at least the authors and title were created, making indexing far better than it was before. When I checked, a lot of other items in my library also did not have the full metadata included in the parent item, so it is always better to import a full, proper bibliography entry from an external source than to rely on Zotero to extract meta data. However, if this is not possible, say you imported a load of PDF files given to you by a friend or colleague, the ability to add OCR is a massive help.

 




ASCII plotting on the command line terminal with eplot

If you want to plot something on the terminal in ASCII you can use “eplot”.

eplot itself is a Ruby script that acts as a frontend for gnuplot. eplot can be downloaded from the project’s GitHub page. It makes it easier to pipe numbers into gnuplot, which can otherwise be a bit of a hassle. It also has a dumb terminal mode which allows us to plot using ASCII. Plotting like this provides a way to quickly check data files without requiring any x windowing system, which might not be available when logging in remotely over the terminal.

If your computer already has gnuplot and the Ruby runtime installed then the following should work.

Example:

cat model.ht | tail -n +2 | awk '{print $1,$2}' | eplot-master/eplot -d -r [0:5]

The -d option chooses ASCII “dumb terminal” mode and -r allows us to set the x axis range.

Some example output:

 

Now, of course if you want to do the same for a remote file that is possible over ssh.

ssh user@remote.server cat /somepath/model.ht | eplot-master/eplot -d

There are obviously many more options to play around with but I hope this gives a brief introduction to some of the capabilities available.




Successfully clearing ports in Salome (Code ASTER)

Building a geometry in the Salome graphical user interface (GUI).

How Salome tracks ports

When Salome is starting up, it checks for free ports on your system using a few built-in Python scripts. Then when you close Salome those ports should be freed up again for the next one. This has a number of uses, but one reason is to stop multiple instances of Salome trying to use the same port at once.

Those Python scripts keep track of the port numbers that are currently in use by storing the numbers in some configuration files (*.cfg) that are saved on your system. When Salome exits, those configuration files should be updated to recognize that the current port is being freed up again.

A possible problem with port tracking

Sometimes, however, those configuration files do not get updated. For example, if you are running Salome using a script in batch mode you can include a command to kill Salome properly, giving the correct port number. I have found in the past that this method has not been very reliable and so the configuration file keeps being updated with port numbers that are in use, but those numbers are never removed from the “in use” list even if they have actually been freed up on the system.

If you do a lot of scripting in salome you will find that when writing/testing your scripts, if salome crashes a lot then often the ports being used don’t get released and so stay as “being used” in the port log file.

The result is that after a while a maximum number of ports is reached and Salome thinks that there are no ports free, so it will not start successfully, giving the following error message:

 

RuntimeError:

Can’t find a free port to launch omniNames

Try to kill the running servers and then launch SALOME again.

 

Perhaps you will check for salome instances running using. There may or may not be lots of Salome processes running on your system. In this post I am going to assume that Salome has closed properly. You can check if salome is running:

ps -x | grep salome

 

Salome provides some Python scripts that should kill any running instances in a well-behaved way. For example, killSalome.py kills all of the instances running on your system, so you should use it with care:

~/bin/SALOME-8.2.0-UB16.04/BINARIES-UB16.04/KERNEL/bin/salome/killSalome.py

But if you already know the port number of a specific instance, killSalomeWithPort.py can be invoked to kill just that one, without affecting other instances that are currently running:

~/bin/SALOME-8.2.0-UB16.04/BINARIES-UB16.04/KERNEL/bin/salome/killSalomeWithPort.py 21116

As a last resort, you can kill all processes in your system that mention salome:
ps x | grep salome | awk ‘{print $1}’ | xargs -n1 kill
OK, so now the ports should be freed up, right? Well maybe not! Your problem might indeed be that Salome is not updating the port config files correctly. Just killing the processes does not help because the next instance of Salome you launch will still check those files and think that there are no free ports. If this describes your current situation, don’t worry! I will now explain how to fix it.

Clearing the port config files

For my version of Salome (see footnotes), hidden configuration files were being created in a number of locations.
For example, the .omniORB_PortManager.cfg file, which in my case is located in my home directory at ~/.omniORB_PortManager.cfg
However, deleting this file did not solve the problem.
I then searched through my home drive for all *.cfg files, but none of the ones that came back were related to ports.
$ find . -name *.cfg
./bin/SALOME-8.2.0-UB16.04/BINARIES-UB16.04/KERNEL/share/salome/resources/kernel/channel.cfg
./bin/SALOME-8.2.0-UB16.04/BINARIES-UB16.04/SMESH/share/salome/plugins/smesh/padder.cfg
… plus a load of non-salome-related stuff…
This would also have found any “.omniORB_*_2888.cfg” (where 2888 is the port number) as mentioned here but those did not show up. There is a USERS directory within my salome installation directory structure at ~/bin/SALOME-8.2.0-UB16.04/BINARIES-UB16.04/SALOME/USERS, however it is empty and so does not contain any such .cfg files.
~/bin/SALOME-8.2.0-UB16.04/BINARIES-UB16.04/SALOME/USERS$ ls -a
.  ..
Finally, I found that for my installation Salome was using /tmp to store these hidden *.cfg files. The /tmp/ directory (and maybe other directories – see below) contained the following files:
  • .omniORB_PortManager.cfg
    • Stores a list of the busy ports
  • .omniORB_PortManager.lock
    • Locks the .omniORB_PortManager.cfg from being edited? I’m not sure exactly what it locks.
  • .omniORB_<username>_<hostname>_<port>.cfg
    • Should be deleted each time but if Salome is not doing this you will have many of these files with different port numbers.
  • .omniORB_<username>_last.cfg
    • I guess this probably stores the last port that was used, although I deleted it already before confirming this hunch.

On one of my systems (a HPC cluster) Salome was not storing the .omniORB_PortManager.cfg file in /tmp/ . Instead it was located in /home/<username>/bin/salome/appli_V7_6_0/USERS
You can check which path is being used by looking in /home/<username>/bin/salome/appli_V7_6_0/bin/salome/PortManager.py  In there is a variable named “omniorbUserPath”, which is obtained from an environment variable that I could not see. Nonetheless, I modified PortManager.py to print this variable to screen, which told me that it was looking for /home/<username>/bin/salome/appli_V7_6_0/USERS/.omniORB_PortManager.cfg . Believe me, this was very frustrating to identify as I really thought I had deleted all necessary files, only for salome to continue not finding a free port!

You can delete all of these files, and now when you run Salome it will start fresh, creating new files as it needs. Problem fixed! But…

Stop it happening again

The above fix will only help if we don’t cause the problem again. If you are creating many models or running many simulations from a controller script you do not want to keep reaching a hard limit of consecutive salome calls you can make, only to have to manually delete the omniORB config files again. What we really want is to make sure that Salome will update the config files correctly in future.
In the past I tried many times to use killSalomeWithPort.py. I did this by running salome with the –ns-port-log argument and providing a log file to store the port number.
<salome_distro>/salome --ns-port-log="somefolder/salomePort.log" -t -b script.py
port_file = open('somefolder/salomePort.log' , 'r')
killPort = int(port_file.readline())
<salome_distro>/bin/salome/killSalomeWithPort.py %s' % killPort
For some reason I could never get this to work successfully. I always ended up building a call to killSalome.py in my script, which kills all running Salome instances and meant that I had to build models consecutively, never in parallel. It also meant that if I had a script running I could not really use the Salome GUI because it too could be killed at any moment!
Here is the correct way to do it, which I only recently discovered through some trawling of the web (unfortunately I can no longer found the page where I saw it and so I can’t give credit to the author).
if not salome.sg.hasDesktop():
    from killSalomeWithPort import killMyPort
    killMyPort(os.getenv('NSPORT'))
The important part is inside the “if” clause. killSalomeWithPort contains a function called killMyPort and the current port used in our Salome instance is stored in an environment variable named “NSPORT”. So by passing that port number to the function we can kill Salome cleanly!
The salome.sg.hasDesktop() just returns True if we are in the Salome GUI. Because if we were, we would not want our script to kill Salome for us. We only want it to happen if we are running inside a batch script.
I’m wondering why I never found this solution before, as it would have saved me a lot of frustration, but there you go, that’s life! Now I am passing it onto you, have fun!

Footnotes

  • I am using the Salome version 8.2.0 for Ubuntu 16.04 x64 precompiled binaries. Different versions have different file structures and so your binary folder path might be different. If you search on the command line for e.g. runSalome.py, you should be able to identify where your salome binaries and Python scripts are located.
  • A lot of this info was gleaned from the Salome user forum, particularly from this 2015 post: http://www.salome-platform.org/forum/forum_10/519093933
  • In Linux, hidden files have a full stop (US: “period”) in front of their filename. To see them when listing a directory use “ls -a”.
  • For more information about why Salome needs to use ports, check out the Salome FAQ.



How to get up-to-date Python packages without bothering your cluster admin

If you have ever been stuck as a user on an out-of-date cluster without root access it can be frustrating to ask the admin guy to install packages for you. Even if they respond, by the time they get round to it you might have moved onto something else. The moment could be gone.

Luckily, as far as Python is concerned, the pyenv project allows users to install their own local Python version or even assign different versions to different directories/projects.

Sir Andrew Smith - A. Smith: Illustrations of the zoology of South Africa, Reptilia. Smith, Elder, and Co., London 1840 PYTHON NATALENSIS (Southern African Python) (Reptilia Plate 9) in A. Smith: Illustrations of the zoology of South Africa, Reptilia. Smith, Elder, and Co., London 1840

Public domain image.

João Moreira has written a great beginner’s guide on the Amaral Lab homepage in order to get started. I now have the latest version of Python 2 (v2.7.12) installed along with essential packages like Scipy and Pandas, which I added using pip.

Installation of pyenv is easy.

curl -L https://raw.githubusercontent.com/yyuu/pyenv-installer/master/bin/pyenv-installer | bash 

Different versions of python can then be installed with

pyenv install 3.4.0

Switching your global Python version is then as simple as typing

pyenv global 3.4.0

From first impressions I can say I highly recommend pyenv, and will continue to learn about it over the coming days through using it. Please refer to João’s excellent post for more details.