Tag: Python

Plotting multivariate data with Matplotlib/Pylab: Edgar Anderson’s Iris flower data set

The problem of how to visualize multivariate data sets is something I often face in my work. When using numerical optimization we might have a single objective function and multiple design variables that can be represented by columnar data in the form {x1, x2, x3, … xn, y} a.k.a. NXY. With design spaces of more than a few dimensions it is difficult to visualize them in order to estimate the relationship between each independent variable and the objective, or perform a sensitivity study.

JensG / Pixabay

While perusing recent work in and tools for visualizing such data I stumbled across some nice examples of multivariate data plotting using a famous data set known as the “Iris data set”, also known as Fisher’s Iris data set or Edgar Anderson’s Iris flower data set. It contains data from 50 flowers each of three different flower species, collected in the Gaspé Peninsula. This set is not in the NXY form typical of optimization routines, but instead each flower has a number of parameters measured and tabulated; namely sepal length, sepal width, petal length and petal width. In other words there is no resultant Y data that is a function of the design space vector. Instead, it is interesting to plot relationships between the measured parameters to determine if they correlate with each other.

A quick internet search brings up a number of examples where the set has been plotted as a gridded set of subplots, using various software tools. For example, Mike Bostock’s blog post demonstrating his D3.js package, and the version on the wikipedia page.

I decided to try and code a Matplotlib script to generate a similar gridded multiplot from the data set. I did so within a Jupyter Notebook (formerly known as iPython Notebook) running Python 2.7. The data was imported using Pandas and made use of Matplotlib’s Pyplot module. Pandas was used to import the data but it could have been done in a number of different ways; it is just that Pandas is designed to work with csv files containing a mix of types.

The resulting image can be seen below.

Iris flower data set visualization using Matplotlib/pyplot.

Fisher’s Iris data set sometimes known as Anderson’s Iris data set, visualization by Simon Bance using Matplotlib/Pyplot. A multivariate data set introduced by Ronald Fisher in 1936 from data collected by Edgar Anderson on Iris flowers in the Gaspé Peninsula.

Here is the script:


"""
https://en.wikipedia.org/wiki/Iris_flower_data_set
A script for plotting multivariate tabular data as gridded scatter plots.
"""
import os
import pandas as pd
import matplotlib.pyplot as plt

inFile = r'iris.dat'

# Check if data file exists:
if not os.path.exists(inFile): sys.exit("File %s does not exist" % inFile)

rootFolder = os.path.dirname(os.path.abspath(inFile))

# Read in the data file
df = pd.read_csv(inFile, delimiter="\t")
headers = list(df.columns.values)
df.head(5) # Prints first n lines to check if we loaded the data file as expected.

# We also have n=4 distinct species in the Species column and I will
# list the species names so we can distinguish them later for plotting:
species = list(df.Species.unique()) # normal python list, thank you very much!
print type(species)

# Here we specify how many columns prepend and append the columns that we want to use.
# For Dakota this would include the objective function(s) column(s) appended to the end.
num_precols = 0
num_obj_fn = 1

# Work out the number of dimensions in each design vector:
num_dims = df.shape[1] - num_obj_fn # We know that there are 3 additional columns (and hope that it stays consistent in future)!
print "Our design vector has %s dimensions: %s" % (num_dims, headers[num_precols:-1])
gridshape = (num_dims, num_dims)
num_plots = num_dims**2
print "Our multivariate grid will therefore be of shape", gridshape, "with a total of", num_plots, "plots"

# Plot the data in a grid of subplots.
fig = plt.figure(figsize=(12, 12))

# Iterate over the correct number of plots.
n = 1

# Create an empty 2D list to store created axes. This alows us to edit them somehow.
axes = [[False for i in range(num_dims)] for j in range(num_dims)]

for j in range(num_dims):
for i in range(num_dims):

# e.g. plt.subplot(nx, ny, plotnumber)
ax = fig.add_subplot(num_dims, num_dims, n) # Plot numbering in this case starts from 1 not zero (MATLAB style indexing)!

# Choose your list of colours
colors = ['red', 'green', 'blue']

for index, s in enumerate(species):

# x axis: For each in the species list look at all rows with that value in the Species column.
# Use the ith column of that subset as the x series.
# y axis: Likewisem, but use the jth column.

if i != j:
ax.scatter(df.where(df['Species'] == s).ix[:,i], df.where(df['Species'] == s).ix[:,j], color=colors[index], label=s)
else:
# Put the variable name on the i=j subplots:
ax.text(0.25, 0.5, headers[i])
pass

# Set axis labels:
ax.set_xlabel(headers[i])
ax.set_ylabel(headers[j])

# Hide axes for all but the plots on the edge:
if j < num_dims - 1: ax.xaxis.set_visible(False) if i > 0:
ax.yaxis.set_visible(False)

if i == 1 and j == 0:
ax.legend(bbox_to_anchor=(3.5, 1), loc=2, borderaxespad=0., title="Species name:")

# Add this axis to the list.
axes[j][i] = ax

n += 1

plt.subplots_adjust(left=0.1, right=0.85, top=0.85, bottom=0.1)

plt.savefig("%s/iris.png" % rootFolder, dpi=300)
plt.show()

Further so-called “classic data sets” are listed at https://en.wikipedia.org/wiki/Data_set#Classic_data_sets.




The new default colormap for matplotlib is called “viridis” and it’s great!

It’s probably not news to anyone in data visualization that the most-used “jet” colormap (sic) (sometimes referred to as “rainbow”) is a bad choice for many reasons.

  • Doesn’t work when printed black & white
  • Doesn’t work well for colourblind people
  • Not linear in colour space, so it’s hard to estimate numerical values from the resulting image

The Matlab team recently developed a new colormap called “parula” but amazingly because Matlab is commercially-licensed software no-one else is allowed to use it!
The guys at Matplotlib have therefore developed their own version, based on the principles of colour theory (covered in my own BSc lecture courses on Visualization 🙂 ) that is actually an improvement on parula. The new Matplotlib default colormap is named “viridis” and you can learn all about it in the following lecture from the SciPy 2015 conference (YouTube ):

Viridis will be the new default colour map from Matplotlib 2.0 onwards, but users of v1.5.1 can also choose to use it using the cmap=plt.cm.viridis command.
I don’t know about you, but I like it a lot and will start using it immediately!




5 Tips for making finite element models with Salome

Salome is an open source software package used to create geometric models and finite element meshes for use in numerical simulations. It is also able to perform its own numerical simulations and has post-processing capabilities built in.

Here are my 5 tips for anyone who is interested in using Salome for model and mesh creation.

1. Practice manually first

This goes without saying. Although Salome has a powerful Python-based scripting capability, it is worth practicing with manual model generation. By that I mean, clicking with your mouse in the GUI. Manual practice lets you get familiar with the quirks of the Salome workflow, which has a different mentality to many other model generator programs.

2. Use the Notebook

In Salome the Notebook is a useful tool that allows you to set size parameters in the beginning as model variables. This means that later on you can edit the notebook variable values and re-build the model using different sizes; something that cannot be done normally, while typing size values in to construct geometric objects. The Notebook is like having some of the power of scripting while still making the model manually in the GUI. A great halfway step up to full scripting.

3. Learn to use scripting

Once you are familiar with Salome’s way of thinking, automation through scripting can become a critical component of being productive. Often it is necessary to compare many similar models where just one specific parameter (say the size of some part) is being changed.

In this recent paper we had to compare the effects of different soft magnetic defect thicknesses in a permanent magnetic grain.  By scripting the model generation I was able to rapidly generate similar models from one template script.

Reversal

4. Small size adjustments can get you out of a rut

Sometimes with Salome (and actually with other modelling software too) you find an annoying problem that makes no sense. For example, you made the model with no problems but now you increased the size of part A and it doesn’t want to build anymore. These kind of problems can occur because designing such a powerful piece of software to work universally is hard. You might be attempting something that the software designers did not anticipate.

One trick that often works for me goes like this. Change the parameter again by a small, arbitrary amount. If size A is 100 nm large, try changing it to 100.001. The simulation result will be virtually identical but you might find the model generation can now function without problems.

I remember in an earlier version of GiD, doing a boolean volume operation on two objects would fail if the edges of the objects overlapped perfectly. So I always had to make sure to translate one of the objects by a tiny amount. Then the Boolean operation would usually work. The makers of GiD fixed this in later versions and most good model software can handle such things.

The same problem often occurs with meshing algorithms. Depending on the type, an algorithm may be trying to fill a certain volume with elements but is being constrained for element size. This can cause the algorithm to fail, but a small change in the specified mesh size can get it to work again.

5. Check your dumped scripts

When making a model using the GUI it is really cool to be able to dump the model as a Python script, which can then be edited and run using the script mode. As I highlighted in another post, it helps to take care that Explode functions do not reorder the exploded entities. The default behaviour is to reorder, and it is easy to forget this when using the script, leading to confusion when the wrong objects are subject to later operations.

Exploding planet




Reading an OpenOffice spreadsheet into Python

Sometimes the best way to manage simulation output data is in a spreadsheet. Or the data you want to use is already in a spreadsheet.

Before reading it into Python for numerical analysis or plotting, or reading into xmgrace to plot, etc, it is normal to export the spreadsheet to a CSV or tab-separated file. But then you have two files. And if you do something in the spreadsheet you have to export it again and this is inefficient.

Here is a way to read an .ods file into a Python array:

  1. Install ODFPY. If you are running Ubuntu this can be found in the official software repository under the package name “python-odf”.
  2. Get the ODSReader script by Marco Conti . But wait! There is a patched version here that has some additional benefits, such as the ability to include empty cells.

To use it you will need to import the ODSReader class from ODSReader.py. Here is an example by myself where I cut off some unnecessary empty lines at the end:

#!/usr/bin/python

# Read an openoffice spreadsheet into a python array. 
# Uses python-odf a.k.a. ODFpy from Ubuntu repository.
# Uses ODSReader.py by Marko Conti (Apach License 2.0)
# Perform operations on the data using numpy

import sys
import numpy as np
sys.path.append("/mnt/FILES/WORK/scripts/ODSReader")
from ODSReader_simon import ODSReader as ods

def ReadODS(inFile, sheetName, cutEmpty=True):
    """
    Returns a given sheet from a given .ods file into a python array (list of lists). 
    """
    a = ods(inFile)
    sheet = a.getSheet(sheetName)
    # cut out empty lines for precision:
    if cutEmpty:
        removeList = []
        for index, line in enumerate(sheet):
            # If any of the items are filled (true) then don' add the line to the remove list. 
            b = False
            for item in line:
                if item:
                    b = True
            if not b:
                removeList.append(index)
        # Now we know which lines are empty, we can cut them out. 
        # Do it in reverse order because otherwise you will cut the wrong indices out. 
        removeList.reverse()
        for i in removeList:
            del sheet[i]
    return sheet
a = np.array(ReadODS("file_name.ods", "sheet_name", cutEmpty=True))

The numpy array now contains an array of strings so I modified the OSDReader.py script to eval() each cell item before returning the array. Eval() is safe if you are using your own data but can in some cases be exploited for malicious purposes; there is a lot of discussion about this online.