interactive scatterplot for visually exploring pairwise correlation

View the Project on GitHub mckennapsean/scorrplot

Documentation for scorr

An implementation of the s-CorrPlot in R using C++ and OpenGL.

About scorr

The s-CorrPlot is an interactive scatterplot for visually exploring pairwise correlation coefficients between variables in large datasets. Variables are projected as points on a scatterplot with respect to some user-selected variables of interest, driven by a geometric interpretation of correlation. The correlation of all other variables to the selected one is indicated by vertical gridlines in the plot. By selecting new variables of interest, a user can create simple tours of the correlation space through animations between different projections of the data.

For further details about the s-CorrPlot, please refer to our companion paper.

Tool Overview

The call scorr(data) will initialize the s-CorrPlot tool with your input dataset. In this view, the default projection is based off of the first two principal component vectors of the dataset, but you can select any other variables in the plot to animate to a new projection of the data. These principal components are illustrated as interactive bar-charts on the left of the screen. A dataset must provide factor labels for the entire dataset, and each unique label is given its own color as shown in the boxes along the bottom. Lastly, on the right is both a parallel coordinate plot showing a profile of the observations for each variable as well as a text display for all variables near the cursor in the display.

Data Format

The s-CorrPlot expects a data frame where each row is your variable and its columns correspond to different observations, where rownames and colnames correspond to these respectively. Additionally, a final column of nominal labels is expected (the factor data type), so that each variable gets an associated color in the plot. Each row variable will become a point in the final s-CorrPlot display, and the observations will be displayed for these points using the parallel coordinates view. For further details, please see the package's provided data data(NAME) as well as the data frame constructed in the example.

Mouse Interactions

Hovering the mouse over the data points will update the parallel coordinate plot and text display views by the variables near the cursor. Additionally, you can left-click variables or the bar-charts to select them as the new primary variable of interest, thus animating the plot accordingly and this point moves to the right-most part of the circular plot. All correlation coefficients to this primary variable can be read directly by vertical gridlines, where anti-correlated variables are furthest away and zero correlation is near the middle.

Additionally, you can right-click the bars or points in order to select them as the new secondary variable of interest, and then the s-CorrPlot will animate the projection by moving all variables vertically to their new locations. It is also possible to shift-click on variables in order to affix them in the parallel coordinates plot for quick comparison against other variables.

Keyboard Interactions

To analyze correlation with Spearman’s rank correlation, press ’s’. You can return to the default of Pearson’s correlation by pressing ’p’.

To remove all highlighting and selected variables in the display, press ’c’.

To adjust transparency of data points, press the left and right arrow keys.

To adjust the size of data points, press the up and down arrow keys.

To adjust the kernel bandwidth for density estimation, press ’<’ or ’>’.

To change the background color of the plot, press ’b’.

Queries through R Console

Once you have initialized the s-CorrPlot tool, there are several commands to interact with the display from the R console.

The primary or secondary projection can be set directly by passing in a vector with the observations stored in the columns (or length equal to the number of observations in the input data). These can be input using scorr.set.primary() and scorr.set.secondary(). Additionally, these same vectors can be output to R using the associated get functions.

The values used for density estimation in the current view of the tool can be output back into R using the call scorr.get.density().

To retrieve the currently n top correlation or anti-correlated points (i.e. left-most and right-most points, respectively), call scorr.get.cor(n) or scorr.get.acor(n), which returns a data frame with n rows with columns corresponding to each variable's index in the input data, its correlation to the primary variable, and its rowname. Similarly, you can call either scorr.get.corr(r) or scorr.get.acorr(r) in order to output all variables with above or below the given correlation coefficient, respectively.

To highlight a set of variables, pass a list of indices to highlight into the command scorr.highlight.index() or a list of names to scorr.highlight.name().

The call scorr.close() will close the interactive tool safely.

The function scorr.plot() takes data and two variables as input in order to produce a static version of the s-CorrPlot, in case the interactive tool is unable to build or run on a machine. This is useful for Windows computers, since they are unable to compile the tool.

scorr Commands

Example Code

This is a code example to create a random dataset and view it in scorr.

# load library

# create a uniform random dataset
# 10,000 variables with 10 observations each
m = matrix(runif(10 * 10000), ncol = 10, nrow = 10000);
data = data.frame(m)

# add random labels (as factors!)
data$labels = as.factor(ceiling((0.1 + runif(10000)) * 4))
levels(data$labels) = c("A", "B", "C", "D", "E")
names = c();
for(i in 1 : 10000)
  names[i] = paste("name", i, sep = "-")

rownames(data) = names

# start up the s-CorrPlot interactive display
# disable density estimation since random data is spread out

# get the number of variables or points on the screen

# select a new variable of interest
scorr.set.primary(data[3, 1 : ncol(data) - 1])

# query the currently selected primary variable of interest (as index)

# query all points for their density estimation values
# MUST have enabled density estimation earlier to do so!
density = scorr.get.density()

# query the positive correlations between 0.95 and 1 in our current projection
corr = scorr.get.corr(0.95)

# query the negative correlations between -0.95 and -1 in our current projection
acorr = scorr.get.acorr(0.95)

# query the top 100 correlations in our current projection
cor = scorr.get.cor(100)

# query the top 100 anti-correlations in our current projection
acor = scorr.get.acor(100)

# highlight our first ten variables in the plot
scorr.highlight.index(1 : 10);

# highlight some other variables by their names
scorr.highlight.name(c("name-45", "name-101"))

# create a static plot of our current view
# must have called scorr() first in order to use the get functions!
scorr.plot(data, scorr.get.primary(), scorr.get.secondary(), 0.2)

# before closing R, be sure to close the s-CorrPlot!
# you can call the function below or just press 'q' in the tool


If you have any difficulties or questions, please contact [email protected].