Plotting multivariate data with Matplotlib/Pylab: Edgar Anderson’s Iris flower data set

The problem of how to visualize multivariate data sets is something I often face in my work. When using numerical optimization we might have a single objective function and multiple design variables that can be represented by columnar data in the form {x1, x2, x3, … xn, y} a.k.a. NXY. With design spaces of more than a few dimensions it is difficult to visualize them in order to estimate the relationship between each independent variable and the objective, or perform a sensitivity study.

JensG / Pixabay

While perusing recent work in and tools for visualizing such data I stumbled across some nice examples of multivariate data plotting using a famous data set known as the “Iris data set”, also known as Fisher’s Iris data set or Edgar Anderson’s Iris flower data set. It contains data from 50 flowers each of three different flower species, collected in the Gaspé Peninsula. This set is not in the NXY form typical of optimization routines, but instead each flower has a number of parameters measured and tabulated; namely sepal length, sepal width, petal length and petal width. In other words there is no resultant Y data that is a function of the design space vector. Instead, it is interesting to plot relationships between the measured parameters to determine if they correlate with each other.

A quick internet search brings up a number of examples where the set has been plotted as a gridded set of subplots, using various software tools. For example, Mike Bostock’s blog post demonstrating his D3.js package, and the version on the wikipedia page.

I decided to try and code a Matplotlib script to generate a similar gridded multiplot from the data set. I did so within a Jupyter Notebook (formerly known as iPython Notebook) running Python 2.7. The data was imported using Pandas and made use of Matplotlib’s Pyplot module. Pandas was used to import the data but it could have been done in a number of different ways; it is just that Pandas is designed to work with csv files containing a mix of types.

The resulting image can be seen below.

Iris flower data set visualization using Matplotlib/pyplot.

Fisher’s Iris data set sometimes known as Anderson’s Iris data set, visualization by Simon Bance using Matplotlib/Pyplot. A multivariate data set introduced by Ronald Fisher in 1936 from data collected by Edgar Anderson on Iris flowers in the Gaspé Peninsula.

Here is the script:


"""
https://en.wikipedia.org/wiki/Iris_flower_data_set
A script for plotting multivariate tabular data as gridded scatter plots.
"""
import os
import pandas as pd
import matplotlib.pyplot as plt

inFile = r'iris.dat'

# Check if data file exists:
if not os.path.exists(inFile): sys.exit("File %s does not exist" % inFile)

rootFolder = os.path.dirname(os.path.abspath(inFile))

# Read in the data file
df = pd.read_csv(inFile, delimiter="\t")
headers = list(df.columns.values)
df.head(5) # Prints first n lines to check if we loaded the data file as expected.

# We also have n=4 distinct species in the Species column and I will
# list the species names so we can distinguish them later for plotting:
species = list(df.Species.unique()) # normal python list, thank you very much!
print type(species)

# Here we specify how many columns prepend and append the columns that we want to use.
# For Dakota this would include the objective function(s) column(s) appended to the end.
num_precols = 0
num_obj_fn = 1

# Work out the number of dimensions in each design vector:
num_dims = df.shape[1] - num_obj_fn # We know that there are 3 additional columns (and hope that it stays consistent in future)!
print "Our design vector has %s dimensions: %s" % (num_dims, headers[num_precols:-1])
gridshape = (num_dims, num_dims)
num_plots = num_dims**2
print "Our multivariate grid will therefore be of shape", gridshape, "with a total of", num_plots, "plots"

# Plot the data in a grid of subplots.
fig = plt.figure(figsize=(12, 12))

# Iterate over the correct number of plots.
n = 1

# Create an empty 2D list to store created axes. This alows us to edit them somehow.
axes = [[False for i in range(num_dims)] for j in range(num_dims)]

for j in range(num_dims):
for i in range(num_dims):

# e.g. plt.subplot(nx, ny, plotnumber)
ax = fig.add_subplot(num_dims, num_dims, n) # Plot numbering in this case starts from 1 not zero (MATLAB style indexing)!

# Choose your list of colours
colors = ['red', 'green', 'blue']

for index, s in enumerate(species):

# x axis: For each in the species list look at all rows with that value in the Species column.
# Use the ith column of that subset as the x series.
# y axis: Likewisem, but use the jth column.

if i != j:
ax.scatter(df.where(df['Species'] == s).ix[:,i], df.where(df['Species'] == s).ix[:,j], color=colors[index], label=s)
else:
# Put the variable name on the i=j subplots:
ax.text(0.25, 0.5, headers[i])
pass

# Set axis labels:
ax.set_xlabel(headers[i])
ax.set_ylabel(headers[j])

# Hide axes for all but the plots on the edge:
if j < num_dims - 1: ax.xaxis.set_visible(False) if i > 0:
ax.yaxis.set_visible(False)

if i == 1 and j == 0:
ax.legend(bbox_to_anchor=(3.5, 1), loc=2, borderaxespad=0., title="Species name:")

# Add this axis to the list.
axes[j][i] = ax

n += 1

plt.subplots_adjust(left=0.1, right=0.85, top=0.85, bottom=0.1)

plt.savefig("%s/iris.png" % rootFolder, dpi=300)
plt.show()

Further so-called “classic data sets” are listed at https://en.wikipedia.org/wiki/Data_set#Classic_data_sets.




The new default colormap for matplotlib is called “viridis” and it’s great!

It’s probably not news to anyone in data visualization that the most-used “jet” colormap (sic) (sometimes referred to as “rainbow”) is a bad choice for many reasons.

  • Doesn’t work when printed black & white
  • Doesn’t work well for colourblind people
  • Not linear in colour space, so it’s hard to estimate numerical values from the resulting image

The Matlab team recently developed a new colormap called “parula” but amazingly because Matlab is commercially-licensed software no-one else is allowed to use it!
The guys at Matplotlib have therefore developed their own version, based on the principles of colour theory (covered in my own BSc lecture courses on Visualization 🙂 ) that is actually an improvement on parula. The new Matplotlib default colormap is named “viridis” and you can learn all about it in the following lecture from the SciPy 2015 conference (YouTube ):

Viridis will be the new default colour map from Matplotlib 2.0 onwards, but users of v1.5.1 can also choose to use it using the cmap=plt.cm.viridis command.
I don’t know about you, but I like it a lot and will start using it immediately!




[PDF] “Grain-size dependent demagnetizing factors in permanent magnets” reprint update

The reprint of our Journal of Applied Physics (JAP) paper “Grain-size dependent demagnetizing factors in permanent magnets” has been updated since the old version was not being discovered by the Google Scholar crawler.

There is also now a version on arXiv. I hope that Google Scholar will now correctly index the paper so that it’s easier for people to find!

The full, correct reference for the paper is:

S. Bance, B. Seebacher, T. Schrefl, L. Exl, M. Winklhofer, G. Hrkac, G. Zimanyi, T. Shoji, M. Yano, N. Sakuma, M. Ito, A. Kato and A. Manabe, “Grain-size dependent demagnetizing factors in permanent magnets”, J. Appl. Phys. 116, 233903 (2014); http://dx.doi.org/10.1063/1.4904854

 




Equations in Gmail with the “TeX for Gmail” Chrome extension

Science via email

One thing scientists and engineers have to do daily is discuss collaborative work via email exchanges. This often includes the need to share and discuss mathematical equations and to represent variables with subscripts and superscripts or special characters; something that is tricky when you are emailing in plain text.

WikiImages / Pixabay

Of course it is possible to work around this problem! Email was invented by scientists, and for decades they have been communicating in this manner, using various conventions to convey the correct information using plaintext. However, if you are a Gmail user there is a nice extension that will make your equations look proper good.

Tex for GmailGmail-logo

TeX for Gmail is a Chrome browser extension that checks a Gmail email that you are writing for LaTeX markup and converts the markup to a visually prettier equation, using one of two modes. In Simple Math mode, subscripts and superscripts are correctly formatted but the current font is maintained and text remains ediatble. In Rich Math mode, the equation is rendered into TeX and replaced by an embedded image.  The email recipient doesn’t need the extension installed on their browser in order to read your nice equations!

Example

Original markup:

$E = mc^{2}$

Simple Math mode:

E = mc2

Rich Math mode:
E = mc^{2}

Issues

One problem; once the extension has converted my markup to formatted text, I cannot get the markup back. So editing a small mistake usually means re-doing all the curly brackets and other stuff that a TeX equation requires. The only workaround seems to be to stay vigilant and use Undo (Ctrl-z), but this doesn’t work when you notice a mistake in an equation that you wrote a while ago. One improvement could be the option to restore any equation to the original markup.

Conclusions

Overall, a great little tool to improve the clarity of science and maths communications over email. With a few small improvements it could be even better but it is already very usable.




HP Spectre x360 keyboard turning off; problem solved!

I recently bought a new notebook; the HP Spectre x360, a 13” convertible ultra-slim notebook PC. It is a really nice piece of hardware, chiselled from solid aluminium with great battery life, decent performance and a pretty usable keyboard.

When the laptop is folded back it automatically converts to tablet mode, wherein the keyboard is automatically turned off so that the keys are not accidentally pressed while using the touchscreen interface.

But since I bought it I have found that, on booting up the device in laptop mode, with it sitting open on my desk, the keyboard is mistakenly turned off. No amount of opening and closing the screen will trick the keyboard into coming on, and it is necessary to log in using the touchscreen interface and on-screen keyboard. After lots of opening and closing it seemed that eventually the keyboard would come back on but there seemed to be no logic to the problem at all.

This post on the official HP forum describes a similar problem. Unfortunately the HP rep simply assumes a hardware fault in the hinges and advises the owner to send it back for repair. However, once my keyboard is on, the connection stays on even as I adjust the screen angle, so it seems a hinge connection failure is not the issue here.

Finally, after much confusion, I found the answer: When the laptop is rotated sideways it attempts to switch to tablet mode, even when I have not adjusted the screen angle. The sensing it done by an orientation sensor in the base of the laptop, not in the hinges! I have found that, to turn the keyboard back on and to go to laptop mode I can tilt the whole laptop towards me slightly, and voila! So far it has worked every time. I wish that there had been some way for me to find this answer out earlier!